CN105404614A

CN105404614A - Subject and predicate coding based text watermark embedding and extraction method

Info

Publication number: CN105404614A
Application number: CN201510743382.1A
Authority: CN
Inventors: 陈建平; 李桂森; 朱晓辉; 施佺; 马海英; 王进
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2015-11-05
Filing date: 2015-11-05
Publication date: 2016-03-16
Anticipated expiration: 2035-11-05
Also published as: CN105404614B

Abstract

The invention relates to a subject and predicate coding based text watermark embedding and extraction method. An embedding method comprises: 1) representing each character of watermarking information with Unicode codes to form a Unicode code string; 2) detecting out subjects and predicates of statements in a to-be-embedded text, and storing the subjects and predicates in a set; 3) according to the quantity of the subjects and predicates, dividing the Unicode code string into a plurality of segments, representing each of the subjects and predicates with one of the segments by coding, and giving a number; and 4) storing the subjects and predicates, the Unicode code segments corresponding to the subjects and predicates, and the numbers of the code segments in sequence to form a codebook, and finishing the coding to realize the watermark embedding. An extraction method comprises: finding out the detected subjects and predicates in the text; by reference to the codebook, extracting the Unicode code segments and the numbers of the code segments corresponding to the subjects and predicates; splicing the Unicode code segments according to a number sequence; and converting the obtained Unicode code string into the corresponding character to form the watermark information. The method makes no change for the format and content of the text, has good concealment and robustness, and is simple in algorithm construction and easy to realize.

Description

A kind of Text Watermarking based on subject-predicate language coding embeds and extracting method

Technical field

The present invention relates to embedding and the extractive technique of watermark, particularly relate to a kind of Text Watermarking based on subject-predicate language coding and embed and extracting method.

Background technology

Along with the popularization and application of internet and infotech; text message is more and more issued in the mode of numeral, propagate and is used; it is while offering convenience to the study of people, work and life; also create the problems such as text is easily copied illegally and usurps, the intellectual property protection of digital text is subject to the extensive concern of industry.Text Watermarking is a technology of the protection digital text intellecture property occurred in recent years; it embeds copyright information or authentication information (watermark) by certain mode in digital text; when finding that text suffers bootlegging or usurps; these information can be extracted to prove the copyright ownership of text; confirm bootlegging and usurp behavior, the rights and interests of protection text copyright owner or possessor.In addition, Text Watermarking technology also can be used for hiding and transmit the aspects such as secret information, the certification of content of text, the tracking of text message in the text.

Text Watermarking mainly contains two class methods at present---the Text Watermarking based on text formatting and the Text Watermarking based on natural language.Digital watermark based on text formatting utilizes the slight text formatting that changes not easily to be carried out embed watermark information by the feature discovered, as changed line space, word space, character boundary etc.This kind of digital watermark simple structure based on text formatting, is easy to realize, but carries out format conversion to text and just likely the watermark of embedding is destroyed, and robustness is not strong.Text Watermarking technology based on natural language utilizes the grammatical and semantic of content of text to carry out coding to carry out embed watermark information, realize at present more be to be replaced by synonym and syntax transfer pair watermark information is encoded.Compared with the watermark based on text formatting, natural Language Watermarking has better disguised and robustness, and format conversion can not have an impact to watermark.But due to the complicacy of Chinese language, synonym is replaced and syntax conversion likely can produce ambiguity or change semanteme, and it is not suitable for the situation that content of text should not change yet simultaneously.

Summary of the invention

The object of the invention is the deficiency overcoming above prior art, provides a kind of Text Watermarking based on subject-predicate language coding with good disguise and robustness to embed and extracting method, specifically has following technical scheme to realize:

The described Text Watermarking embedding grammar based on subject-predicate language coding, comprises

1) by each character Unicode coded representation of watermark information, a Unicode code string is formed.

2) detect the subject-predicate language of statement in text to be embedded, deposit in a set.

3) according to the subject-predicate language quantity that detects, Unicode code string is divided into some sections, each subject-predicate pragmatic wherein one section carrys out coded representation.Considering that the order changing statement in text may make watermark information correctly not extract, to the given numbering of the Unicode code section that each subject-predicate language is corresponding, during for extracting watermark, splicing Unicode code string according to numbering.

4) store Unicode code section corresponding to each subject-predicate language, this subject-predicate language and numbering corresponding to this subject-predicate successively, form a code book, complete coding, realize the embedding of watermark.

Above-mentioned Unicode coding adopts UTF-16 form, and each character is 4 sexadecimal numbers, forms a hexadecimal Unicode code string.

Described step 2) in detect that the subject-predicate language in text to be embedded comprises the steps:

A) will the text-converted of watermark to be embedded be submitted to be the form of character string;

B) character string of the text of watermark to be embedded is committed to language technology platform LTP and carries out interdependent syntactic analysis, obtain the character string that comprises the XML format of sentence element dependence in text;

C) character string of the XML format obtained is converted to XML file, carries out DOM parsing to XML file, according to the contact between the Key Relationships of sentence element attribute of a relation in XML file and subject-predicate relation, searching loop file, finds out the subject-predicate language of every.

The described further design of Text Watermarking embedding grammar based on subject-predicate language coding is, in described code book every a line subject-predicate language, Unicode code section, number between separate with space respectively.

According to the described Text Watermarking embedding grammar based on subject-predicate language coding, a kind of extracting method of the Text Watermarking based on subject-predicate language coding is proposed, comprise the subject-predicate language found out in detected text, the described code book formed during contrast embed watermark, from code book, take out Unicode code section, numbering that each subject-predicate language is corresponding, Unicode code section is got up by the sequential concatenation of the numbering of correspondence, obtain the Unicode code string representing watermark information, convert corresponding character again to, form the watermark information embedded.

The further design of the extracting method of the described Text Watermarking based on subject-predicate language coding is, the Unicode code section that in described taking-up detected text, each subject-predicate language is corresponding and the step of numbering thereof comprise: each subject-predicate language in the detected text found out and each subject-predicate language in code book are compared one by one, if both are consistent, then from code book, take out Unicode code section, numbering that this subject-predicate language is corresponding.

Advantage of the present invention is as follows:

The present invention proposes a kind of new Text Watermarking and embeds and extracting method, utilizes the subject-predicate language of statement in text to carry out coding to watermark information and carrys out embed watermark.The method does not make any change to text formatting and content, can not produce a bit impact to original text, and the embedding of watermark, without any vestige, can not be discovered and find, have good disguise.Carry out format conversion (comprise and change line space, word space, change character boundary, font, color etc.) to text, adjustment text fragment, change sentence order all can not affect the correct extraction of watermark, have good robustness.Algorithm construction is simple simultaneously, is easy to realize.

Embodiment

Below the present invention program is described in detail.

The Text Watermarking embedding grammar based on subject-predicate language coding that the present embodiment provides, comprise the steps: 1) by the Unicode coded representation of each character UTF-16 form of watermark information, each character is 4 sexadecimal numbers, forms a hexadecimal Unicode code string.2) detect the subject-predicate language in text to be embedded, deposit in a set.3) according to the subject-predicate language quantity detected, Unicode code string is divided into some sections, each subject-predicate pragmatic wherein one section carrys out coded representation, to the given numbering of the Unicode code section that each subject-predicate language is corresponding, according to numbering splicing Unicode code string during for extracting watermark.4) store each subject-predicate language, the Unicode code section corresponding with this subject-predicate language successively and number, forming a code book, complete coding, realize the embedding of watermark.Wherein, separate with space respectively between the subject-predicate language of every a line, Unicode code section, numbering in code book.

Further, step 2) in detect that the subject-predicate language in text to be embedded comprises the steps: A) will the text-converted of watermark to be embedded be submitted to be the form of character string.B) character string of the text of watermark to be embedded is committed to language technology platform LTP and carries out interdependent syntactic analysis, obtain the character string that comprises the XML format of sentence element dependence in text.C) character string of the XML format obtained is converted to XML file, carries out DOM parsing to XML file, according to the contact between the Key Relationships of sentence element attribute of a relation in XML file and subject-predicate relation, searching loop file, finds out the subject-predicate language of every.

The above-mentioned language technology platform (LanguageTechnologyPlatform mentioned, LTP) be that a whole set of the open online Chinese natural language disposal system developed for 10 years is lasted at Harbin Institute of Technology's social computing and Research into information retrieval center, comprise lexical analysis (participle, part-of-speech tagging and named entity recognition), syntactic analysis (interdependent syntactic analysis), semantic analysis (word sense disambiguation and semantic character labeling) three aspect six term language processing capacity.This platform is opened to the outside world, easy to use.System provides an application programming interfaces (API), and user, according to the application demand of oneself, arranges API parameter, and structure HTTP request, submits to system by content of text, can obtain analysis result online.Mainly use the interdependent syntactic analysis function of LTP herein, text to be analyzed is submitted to platform, obtain the dependence between each composition of statement in text through platform processes.According to dependence, through processing the subject-predicate language obtaining statement further.Its basic process is as follows:

Content of text to be analyzed is converted to character string, API parameter and method of calling are set, the character string comprising content of text is submitted to LTP and carries out interdependent syntactic analysis.Obtain the result of LTP, obtain the file that comprises the XML format of sentence element dependence in text.This file contains para(paragraph), sent(sentence), word(participle) etc. node.Each participle node (word) has with properties: id, is the sequence number of participle in sentence; Cont is participle content; Parent, is No. id of the father node of interdependent syntactic analysis; Relate is corresponding relation.For finding out all subject-predicate languages in text, each section in searching loop text, each sentence, each participle.When searching loop is to word node, attribute relate=if " HED " (HED represents Key Relationships), then the cont value of word node is this predicate, then search this and whether there is such word node, the relate property value of node is that SBV(SBV represents subject-predicate relation), and its parent property value is equal with No. id of predicate node.If existed, the cont value of this word node is exactly this subject.This subject and predicate are extracted, is stored in a set.

Below provide the concrete function code and annotation that realize watermark embed process:

According to the requirement of LTP system application interface API, the text-converted analyzed is submitted to be the form of character string by needing, can realize with System.IO.File.ReadAllText (stringpath, the Encoding.Default) function of C# language, corresponding program code is:

Stringtext=System.IO.File.ReadAllText(path,Encoding.Default);

Wherein, path is the path of text to be analyzed, and text is the character string comprising content of text.

API parameter is set, comprise the address urlbase of access LTPWeb service, obtain when using the key api_key(user of API to register), analytical model pattern(selects dp, interdependent syntactic analysis), result Format Type format(selects XML format), HTTP request mode (selecting GET mode) etc., the character string (text) comprising content of text is submitted to LTP platform and carries out interdependent syntactic analysis.The core code realizing this process is as follows:

stringurlbase="http://api.ltp-cloud.com/analysis/";

stringapi_key="k2r3q7tqGgWp5zBZRSnEHvNKfTRSFhjMtnHQ0QeP";

stringpattern="dp";

stringformat="xml";

stringstrParam=("api_key="+api_key+"&text="+text.ToString()

+"&pattern="+pattern+"&format="+format);

Encodingencoding=Encoding.GetEncoding("utf-8");

HttpWebRequestreq=

WebRequest.Create(urlbase+strParam)asHttpWebRequest;

req.Method="GET";

Obtain the result of LTP, obtain the character string that comprises the XML format of sentence element dependence in text.This process can realize by the StreamReader class of C# language, and corresponding program code is:

HttpWebResponsewebResponse=req.GetResponse()asHttpWebResponse;

StreamReaderstreamReader=

newStreamReader(webResponse.GetResponseStream(),encoding);

Stringresult=streamReader.ReadToEnd();

The result be after process deposited in result.

The character string of the XML format obtained is converted to XML file, DOM parsing is carried out to XML file, according to the HED(Key Relationships of relate attribute of every in interdependent syntactic analysis result) and SBV(subject-predicate relation) between contact, each section of searching loop, each sentence, each participle, finds out the subject-predicate language of every.The core code realizing this process is as follows:

XmlDocumentdoc=newXmlDocument (); // be converted to XML file

doc.LoadXml(result);

XmlElementroot=doc.DocumentElement; // searching loop

XmlNodeListlist1,list2,list3;

XmlNodelist4;

list1=root.SelectNodes("//para");

Foreach (XmlNodenode1inlist1) { // searching loop para node

list2=node1.ChildNodes;

Foreach (XmlNodenode2inlist2) { // searching loop sent node

list3=node2.ChildNodes;

Foreach (XmlNodenode3inlist3) { // searching loop word node

If (node3.Attributes [" relate "] .InnerText==" HED ") // judge predicate

list4=node3;

foreach(XmlNodenode4inlist3){

If (node4.Attributes [" parent "] .InnerText==list4.Attributes [" id "] .InnerText & & node4.Attributes [" relate "] .InnerText==" SBV ") // judge subject hs.Add (node4.Attributes [" cont "] .InnerText+list4.Attributes [" cont "] .InnerText+ " ");

}

List<string>sbv=newList<string>();

sbv.AddRange(hs);

The subject-predicate language be in text deposited in set sbv.

By needing the watermark information embedded to encode according to UTF-16, convert a Unicode code string to.Corresponding code is:

byte[]bts=Encoding.Unicode.GetBytes(info);

for(inti=0;i<bts.Length;i+=2)

uc+=bts[i+1].ToString("x").PadLeft(2,'0')+bts[i].ToString("x").PadLeft(2,'0');

Wherein, what deposit in info is watermark information, is the Unicode code string of generation in uc.

With the subject-predicate language in the set of subject-predicate language, above-mentioned Unicode code string is encoded.Take out each the subject-predicate language in set successively, for it distributes one section of Unicode code, and a given numbering, separate with space respectively between subject-predicate language, Unicode code section, numbering, form code book.The core code realizing this process is as follows.

Be defined as the code of the figure place of the Unicode code that each subject-predicate language distributes:

StrU_size=uc.length (); //strU_size is the figure place of Unicode code string

Sbv_size=sbv.Count; //sbv_size is the number of subject-predicate language

Count_size=strU_size/sbv_size; //count_size is the figure place for subject-predicate language distribution Unicode code

For each subject-predicate language distributes the code of one section of Unicode code:

for(intx=0;x<sbv_size;x++){

If (x==sbv_size-1) // be last subject-predicate language distribution Unicode code (figure place is different, processes separately)

code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size)+""+(x+1));

Else{ // be subject-predicate language above distributes Unicode code (mean allocation, figure place is identical)

if(x*count_size-1>0){

code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size,count_size)+""+(x+1));

}else{

code_list.Add(sbv[x]+""+uc.ToString().Substring(0,x*count_size+count_size)+""+(x+1));

}

What set code_list deposited is codebook content, is write a txt file, just obtains the codebook file of embed watermark.

According to the above-mentioned Text Watermarking embedding grammar based on subject-predicate language coding, propose a kind of extracting method of the Text Watermarking based on subject-predicate language coding, its embodiment is:

When needs extract watermark, submit to LTP platform to carry out interdependent syntactic analysis in detected text, analysis result is further processed to the subject-predicate language obtained in text, deposits in a set.Identical when the code realizing this process and embed watermark above.

The codebook file formed when opening embed watermark, contrast code book, carries out decoding to each the subject-predicate language in above-mentioned set.Namely successively each the subject-predicate language in set and each subject-predicate language in code book are compared one by one, if both are consistent, then Unicode code section corresponding for this subject-predicate language and numbering thereof are taken out.The each Unicode code section obtained is stitched together by its number order, obtains the Unicode code string representing watermark information.

Below provide code and the annotation of the main operation realizing said process:

Read every a line of code book, put it into an array.Code is:

string[]lines=File.ReadAllLines(path);

Wherein, path is the path of codebook file, and lines is the array comprising the every a line content of code book.

According to space, the subject-predicate language of every a line is split, compare one by one with the subject-predicate language in the detected text deposited in foregoing assemblage, if any consistent person, take out Unicode code section and the numbering thereof of this row, put into a set.Code is:

for(inti=0;i<sbv.Count;i++){

for(intj=0;j<lines.Length;j++){

string[]lgs=lines[j].ToString().Split(newChar[]{''},2);

if(sbv[i]==lgs[0])

st.Add(lgs[1]);}

}

St is the set of depositing Unicode code section and numbering thereof.

According to space, each Unicode code Duan Yuqi is numbered separated, according to number order, each code section is stitched together, obtains the Unicode code string representing watermark information.Code is:

for(intx=0;x<st.Count;x++){

for(inty=0;y<st.Count;y++){

string[]lgs=st[y].ToString().Split(newChar[]{''},2);

if(Convert.ToInt32(lgs[1])==(x+1))

drawUc.Append(lgs[0]);

}

The Unicode code string representing watermark information is in drawUc.

According to the UTF-16 coding rule used during embed watermark, above-mentioned Unicode code string is converted to corresponding character, just obtains the watermark information embedded.The core code realizing this process is as follows:

MatchCollectionmc=Regex.Matches(str,"([\w]{2})([\w]{2})",

RegexOptions.Compiled|RegexOptions.IgnoreCase);

byte[]bts=newbyte[2];

foreach(Matchminmc){

bts[0]=(byte)int.Parse(m.Groups[2].Value,NumberStyles.HexNumber);

bts[1]=(byte)int.Parse(m.Groups[1].Value,NumberStyles.HexNumber);

toStr+=Encoding.Unicode.GetString(bts);

}

The watermark information extracted is contained by toStr.

Claims

1., based on a Text Watermarking embedding grammar for subject-predicate language coding, it is characterized in that comprising

1) by each character Unicode coded representation of watermark information, a Unicode code string is formed;

2) detect the subject-predicate language of statement in text to be embedded, deposit in a set;

3) according to the subject-predicate language quantity detected, Unicode code string is divided into some sections, each subject-predicate pragmatic wherein one section carrys out coded representation, to the given numbering of the Unicode code section that each subject-predicate language is corresponding, according to numbering splicing Unicode code string during for extracting watermark;

2. the Text Watermarking embedding grammar based on subject-predicate language coding according to claim 1, it is characterized in that described Unicode encodes and adopt UTF-16 form, each character is 4 sexadecimal numbers, forms a hexadecimal Unicode code string.

3. Text Watermarking embedding grammar according to claim 1, is characterized in that described step 2) in detect that the subject-predicate language in text to be embedded comprises the steps:

To the text-converted of watermark to be embedded be submitted to be the form of character string;

4. the Text Watermarking embedding grammar based on subject-predicate language coding according to claim 1, is characterized in that separating with space respectively between the subject-predicate language of every a line in described code book, Unicode code section, numbering.

5. the Text Watermarking embedding grammar based on subject-predicate language coding according to any one of claim 1-4, a kind of extracting method of the Text Watermarking based on subject-predicate language coding is proposed, it is characterized in that, comprise: find out the subject-predicate language in detected text, the described code book formed during contrast embed watermark, Unicode code section, numbering that each subject-predicate language is corresponding is taken out from code book, Unicode code section is got up by the sequential concatenation of the numbering of correspondence, obtain the Unicode code string representing watermark information, convert corresponding character again to, form the watermark information embedded.

6. the extracting method of the Text Watermarking based on subject-predicate language coding according to claim 5, it is characterized in that the step of Unicode code section that in described taking-up detected text, each subject-predicate language is corresponding and numbering comprises: each subject-predicate language in the detected text found out and each subject-predicate language in code book are compared one by one, if both are consistent, then from code book, take out Unicode code section, numbering that this subject-predicate language is corresponding.