CN102495881A - Genetic word-based file processing method and device - Google Patents

Genetic word-based file processing method and device Download PDF

Info

Publication number
CN102495881A
CN102495881A CN2011104002534A CN201110400253A CN102495881A CN 102495881 A CN102495881 A CN 102495881A CN 2011104002534 A CN2011104002534 A CN 2011104002534A CN 201110400253 A CN201110400253 A CN 201110400253A CN 102495881 A CN102495881 A CN 102495881A
Authority
CN
China
Prior art keywords
character
source
word
gene
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104002534A
Other languages
Chinese (zh)
Other versions
CN102495881B (en
Inventor
郝佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Founder International Co Ltd
Founder International Beijing Co Ltd
Original Assignee
Founder International Co Ltd
Founder International Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Founder International Co Ltd, Founder International Beijing Co Ltd filed Critical Founder International Co Ltd
Priority to CN201110400253.4A priority Critical patent/CN102495881B/en
Publication of CN102495881A publication Critical patent/CN102495881A/en
Application granted granted Critical
Publication of CN102495881B publication Critical patent/CN102495881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a genetic word-based file processing method and a device. The method comprises the following steps that: one or more source characters are extracted from an original file according to a genetic word stock to obtain a source character set, wherein the source characters in the source character set have corresponding genetic words in the genetic word stock; a repetition frequency of each source character in the source character set is calculated, and the source characters in the source character set are sorted according to the repetition frequency and a character internal code of each source character; the source characters in the source character set are grouped by snakelike algorithm according to a preset group number so as to obtain the character groups of the preset number; and all source characters in one group or multiple groups of character groups are replaced by corresponding genetic words in the genetic word stock to obtain a file with embedded genetic words. Due to the adoption of the method, when the file with the embedded genetic words is identified, character information in the file can be more accurate to read, and the reading accuracy is higher.

Description

Document processing method and device based on the gene word
Technical field
The present invention relates to document processing field, in particular to a kind of document processing method and device based on the gene word.
Background technology
The switching technology of electronic government documents or document is a kind of through Computer information network, in the technology of not transmitting electronic government documents between the commensurate.Along with the development of infotech especially internet technique, each unit or intramural each department can connect mutually through LAN or WWW.Simultaneously, constituent parts or department also generally adopt the computword software for editing to draft official document or document.Electronic government documents or document exchange technology are exactly based on this, through standard electronic government documents form, and unified conveying flow and record; A kind of technology and the system of internet safe transmission means are provided, make official document just can be delivered to recruiting unit from issuing unit apace through network with electronic form; No longer need the special messenger between each unit, to deliver; Thereby, alleviate workload, increase work efficiency.
Continuous development along with infotech; Official document or document exchange particularly electronic government documents or document exchange are frequent day by day; No matter be in the process of party and government's authority management national affairs; Still in the daily administration of enterprises and institutions, official document or document are the important carriers that transmits important information, implements higher level's spirit.Therefore; Reinforcement is to the particularly management of electronic government documents or document of official document or document; Make electronic government documents or document have certain confidentiality and antifalsification just seems particularly important, and for the special document of some special machine-operated department, the confidentiality of document and false proof has even more important meaning.In the prior art, most of official document or document do not have false proof function, normally judge the source and the true and false of official document through sequence number on official document or the document or official seal.But the sequence number on official document or the document can be blocked easily or duplicated, and present chromoscan, duplicating and printing technique make the official seal on official document or the document also hold very much to such an extent that be replicated.
Prior art solves the problems referred to above through encrypting identification, but will realize encrypting and identification, generally can adopt the digital watermark technology of text, and it is the important technology in the Information Hiding Techniques field, more commonly image digital watermark.And there is a large amount of text (like electronic government documents) need to be keep secret in the reality; The e-text that electronic document system inside can be limited after having encrypted flows out; This type systematic often limits the file that transfers papery to through modes such as restriction printing times in addition; But in case transfer to after the papery, system just can't limit the duplicating of file document, often also can't follow the trail of the primary source of paper document.
Because the gene word is the set of all characters in a kind of special-purpose character library; Its font and original character library have nuance; Be difficult for being forged and discovering, can use specific program to detect very easily simultaneously, therefore; The technician can solve the problem that can't limit its printing or duplicate number of times to the document that transfers papery to through the mode that embeds the gene word; But existing gene word embeds the mode of document because redundance is unbalanced and utilization factor is low, when having caused reading the document that has embedded the gene word in system, and the problem that the identification character accuracy is lower.
At present the gene word to correlation technique embeds the mode of document because redundance is unbalanced and utilization factor is low, and when having caused reading this document that embeds the gene word in system, the problem that the identification character accuracy is lower does not propose effective solution as yet.
Summary of the invention
Gene word to correlation technique embeds the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading this document that embeds the gene word in system; The problem that the identification character accuracy is lower does not propose effective problem as yet at present and proposes the present invention, for this reason; Fundamental purpose of the present invention is to provide a kind of document processing method and device based on the gene word, to address the above problem.
To achieve these goals; According to an aspect of the present invention; A kind of document processing method based on the gene word is provided, and this method comprises: from source document, extract one or more source word symbols according to the gene character library, close to obtain source character set; Wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; The repetition frequency of each source word symbol in the calculation sources character set, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Divide into groups according to preset group number according to the source word symbol of source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.
Further; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and comprises: the source word in closing according to repetition frequency inferior ordered pair source character set from high to low accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.
Further; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and comprises: the source word in closing according to repetition frequency inferior ordered pair source character set from low to high accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.
Further; Source word symbol in according to S-Shaped Algorithm the source character set after sorting being closed divides into groups according to preset group number; Before the character group of obtaining predetermined number; Method also comprises: embedding information is set to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number; Embedding information is encrypted, to obtain safe embedding information.
Further, the source word symbol in according to S-Shaped Algorithm the source character set after sorting being closed divides into groups according to preset group number, after the character group of obtaining predetermined number; Method also comprises: the character information that reads all source word symbols in every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.
Further; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; Comprise to obtain the document that embeds the gene word: when the corresponding informance of character group is 0, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation.
Further; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; Comprise to obtain the document that embeds the gene word: when the corresponding informance of character group is 1, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.
To achieve these goals; According to a further aspect in the invention, a kind of document processing device, document processing based on the gene word is provided, this device comprises: extraction module; Be used for extracting one or more source word symbols from source document according to the gene character library; Close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; Processing module is used for the repetition frequency of each source word of calculation sources character set symbol, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Grouping module is used for according to S-Shaped Algorithm the source word symbol that the source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number; The replacement module is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.
Further, processing module comprises: first order module, be used for according to repetition frequency from high to low or the source word symbol that closes of inferior ordered pair source character set from low to high sort, to obtain first ordered set that source character set closes; Second order module is used for sorting according to the identical source word symbol of the descending or ascending inferior ordered pair first ordered set repetition frequency of character ISN.
Further, device also comprises: module is set, is used to be provided with embedding information to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number, and embedding information is encrypted, to obtain safe embedding information.
Further; Device also comprises: read module is used for reading the character information that all source words of every group of character group accord with, to obtain the corresponding informance of each character group; Wherein, In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.
Further, the replacement module comprises: the first replacement module, be used for when the corresponding informance of character group is 0, and all source words of this character group are accorded with replace with gene word corresponding in the gene character library with it; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps, the second replacement module is used for when the corresponding informance of character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.
Through the present invention, adopt according to the gene character library and from source document, extract one or more source word symbols, close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; The repetition frequency of each source word symbol in the calculation sources character set, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Divide into groups according to preset group number according to the source word symbol of source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; To obtain the document that embeds the gene word; Because the above-mentioned mode that in document, embeds the gene word has adopted the method for balanced statistics, is convenient to reuse the gene word, thereby the gene word that has solved related art embeds the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading this document that embeds the gene word in system; The problem that the identification character accuracy is lower, and then be implemented in when discerning the document that embeds the gene word, read the accurate more and higher effect of accuracy of character information in the document.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 is the structural representation based on the document processing device, document processing of gene word according to the embodiment of the invention;
Fig. 2 is the process flow diagram based on the document processing method of gene word according to the embodiment of the invention.
Embodiment
Need to prove that under the situation of not conflicting, embodiment and the characteristic among the embodiment among the application can make up each other.Below with reference to accompanying drawing and combine embodiment to specify the present invention.
Fig. 1 is the structural representation based on the document processing device, document processing of gene word according to the embodiment of the invention.As shown in Figure 1; Should comprise based on the document processing device, document processing of gene word: extraction module 10, be used for extracting one or more source word symbols from source document according to the gene character library, close to obtain source character set; Wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; Processing module 30 is used for the repetition frequency of each source word of calculation sources character set symbol, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Grouping module 50 is used for according to S-Shaped Algorithm the source word symbol that the source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number; Replacement module 70 is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.
The application's the foregoing description; That will extract through processing module and grouping module realization divides into groups and snakelike ordering with the corresponding source word symbol of gene word; After accomplishing grouping and snakelike ordering, the gene word in the gene character library is replaced to source word symbol corresponding in the original.Said apparatus utilizes gene word and word various combination frequently thereof to carry bulk information with grouping; Because the balanced statistical technique of snakelike ordering is convenient to reuse the gene word; Thereby solved existing gene word and embedded the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading the document that has embedded the gene word in system, the problem that the identification character accuracy is lower, and then realized when identification embeds the document of gene word; Read in the document character information more accurately and accuracy higher, and improved the robustness of file when extracting embedding information of using the gene word greatly.
The original that said apparatus is realized according to established rule embedding gene word; Since mobile equilibrium embed the amount of redundancy of every group of gene word of original; Make Information hiding property good; After being printed or duplicate repeatedly, system can judge accurately whether this encrypt file has exceeded the number of times that permission is printed or duplicated.
Processing module in the application's the foregoing description can comprise: first order module 301, be used for according to repetition frequency from high to low or the source word symbol that closes of inferior ordered pair source character set from low to high sort, to obtain first ordered set that source character set closes; Second order module 302 is used for sorting according to the identical source word symbol of the descending or ascending inferior ordered pair first ordered set repetition frequency of character ISN.Ordering array mode among this embodiment has effects equivalent in implementation process, mainly being provides based on the balanced source word symbol of the ordering of statistics gained for follow-up anabolic process.
The processing module in the foregoing description and the combination of grouping module are used for the word quantity of every group of character of dynamic assignment frequently according to the statistics gained; Thereby the amount of redundancy of character in every group of the mobile equilibrium; Help hiding of information, can reuse relevant gene word simultaneously, improved the quality of balance that the gene word utilizes information in frequency and every group greatly; Help improving the information embedded quantity
Device in the application's the foregoing description can also comprise: module 80 is set, is used to be provided with embedding information to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number, and embedding information is encrypted, to obtain safe embedding information.The number of packet of this embodiment for character is divided into groups to preset, and, can be in order to improve security with should embedding information encrypting; For example; When embedding information is 0110, can pass through encryption, the embedding information that makes other non-validated user see is 0011 or 1100 etc.; Rather than 0110, have only validated user can discern correct embedding information.
Therefore; The foregoing description can be realized the source word in all character sets in the source document is accorded with; According to the code length that embeds information and according to the word of the statistics gained quantity of character in every group of the dynamic assignment frequently; Promptly realized the amount of redundancy in every group of source word symbol of mobile equilibrium, helped Information hiding, and can reuse the related gene word.
Device in the foregoing description can also comprise: read module 90; Be used for reading the character information of all source word symbols of every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.
Replacement module in the application's the foregoing description can comprise: the first replacement module, be used for when the corresponding informance of character group is 0, and all source words of this character group are accorded with replace with gene word corresponding in the gene character library with it; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps, the second replacement module is used for when the corresponding informance of character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.
Fig. 2 is the process flow diagram based on the document processing method of gene word according to the embodiment of the invention, and this method as shown in Figure 2 comprises the steps:
Step S102 realizes from source document, extracting one or more source word symbols according to the gene character library through the extraction module among Fig. 1, closes to obtain source character set, and wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library.
Step S104, the repetition frequency of coming in the calculation sources character set each source word symbol through the processing module among Fig. 1, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting.
Step S106 realizes dividing into groups according to preset group number based on the source word symbol of source character set in closing of S-Shaped Algorithm after to ordering through the grouping module among Fig. 1, to obtain the character group of predetermined number.
Step S108 replaces with gene word corresponding with it in the gene character library through the replacement module among Fig. 1 with all the source word symbols in one or more groups character group, to obtain the document that embeds the gene word.
The application's the foregoing description, will extract with the corresponding source word of gene word symbol divide into groups with snakelike ordering after, the gene word in the gene character library is replaced to source word symbol corresponding in the original.Said method utilizes gene word and word various combination frequently thereof to carry bulk information with grouping; Because the balanced statistical technique of snakelike ordering is convenient to reuse the gene word; Thereby solved existing gene word and embedded the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading the document that has embedded the gene word in system, the problem that the identification character accuracy is lower, and then realized when identification embeds the document of gene word; Read in the document character information more accurately and accuracy higher, and improved the robustness of file when extracting embedding information of using the gene word greatly.
The original that aforesaid way is realized according to established rule embedding gene word; Since mobile equilibrium embed the amount of redundancy of every group of gene word of original; Make Information hiding property good; After being printed or duplicate repeatedly, system can judge accurately whether this encrypt file has exceeded the number of times that permission is printed or duplicated.
Step S104 in the application's the foregoing description; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and can specifically implement through following steps: the source word in closing according to repetition frequency inferior ordered pair source character set from high to low accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.Perhaps, step S104 also can specifically implement through following steps: the source word symbol in closing according to repetition frequency inferior ordered pair source character set from low to high sorts, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.The foregoing description is after quantity (comprising repetition) that statistics has the source word symbol of corresponding gene word in the source document and word frequency; Realization according to word frequently height (when word is frequently identical according to the character ISN) symbol of the source word in all source character sets is sorted, this mode can guarantee the uniqueness of character ordering.
Concrete, the embodiment of said method is following: at first with in the source document with the gene character library in the corresponding source word symbol of gene word extract, the word that begins to add up each source word symbol that extracts then is frequently.Wherein the character information of each source word symbol can adopt binary digit to represent (for example 0 or 1), when the word that occurs as the gene word when a source word symbol be n frequently, can this source word be accorded with and be characterized by n position 0 or 1.
Be that example describes for example with following passage.The passage of this source document is: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged; A source word symbol regarded as in each literal; Through contrasting and inquire about the gene word that obtains this section literal correspondence in the gene character library; Thereby getting access to a source character set closes: the long skill of the main exploitation in mountain is given birth to, each source word symbol during this source character set closes all has corresponding gene word in the gene character library, that is to say that this section literal contains the gene word of 7 correspondences (not comprising repetition) altogether.
Obtain according to statistics:
Figure BDA0000116762460000061
Word that statistics obtains the corresponding gene word of each source word symbol frequently after, obtain the ISN of each source word symbol simultaneously, if adopt repetition frequency from high to low and the ascending order of character ISN sort that (for example the mountain all is 2 with main word frequently; The ISN 0x5c71 on mountain; Main ISN position 0x4e3b, then the master comes face in front of the mountains, all the other are similar); Then the character sequence after the ordering is: the skill growth is sent out out on main mountain, and its word is followed successively by 2211111 frequently.
Based on the foregoing description; Source word symbol in step S106 closes the source character set after sorting according to S-Shaped Algorithm divides into groups according to preset group number; Before the character group of obtaining predetermined number; Method can also comprise: embedding information is set to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number; Embedding information is encrypted, to obtain safe embedding information.Concrete, still the passage with source document is: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged, and is illustrated; At this moment, can embedding information be set according to demand, for example; Embedding information is set is: 0110; Therefore totally 4, can know that the source word symbol of source character set in closing after adopting said method to sort, can be divided into 4 groups and distribute.In addition, can be in order to improve security with should embedding information encrypting, for example; When embedding information is 0110, can pass through encryption, the embedding information that makes other non-validated user see is 0011 or 1100 etc.; Rather than 0110, have only validated user can discern correct embedding information.
Get access to the number of packet of confirming by embedding information through step S106 after, can embed the figure place of information according to this, each the source word symbol that has sorted in according to S-Shaped Algorithm this source character set being closed is assigned in each character group; Concrete can make every group to get a source word symbol successively according to the length of the information of embedding; Distribute to each character group according to snakelike principle then, assign, can better solve the uneven frequently problem of word in the assigning process until all gene words; To realize the middle amount of redundancy of every group of mobile equilibrium; And can reuse the related gene word, promptly make in every group of character group with the 0 or 1 source character quantity represented average basically, thereby it is a lot of to have improved in the grouping that other modes cause having in the character group 0 or 1 source character quantity; Perhaps seldom problem; Can not occur in a certain group of character group characterizing the source word symbol 0 or 1 seldom, and cause the wrong phenomenon of information of this character group, make after the source word in using gene word replacement source character group; The utilization rate of gene word improves, and has improved the robustness of follow-up identification or detection file greatly.
Concrete, can be still with the passage of source document be: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged, and is illustrated, and according to embedding message length 0110, we can be divided into following four groups with above literal:
First group: main
Second group: the mountain is long
The 3rd group: take place
The 4th group: open skill
The gene word of the correspondence in using the gene character library is replaced after the source word symbol in every group; System identification the document; Promptly when the gene word is detected, can not occur having improved the accuracy rate that detects the gene word owing to characterize the situation that 0 or 1 of source word symbol seldom causes the original breakage in a certain group of character group; Be about to should be 0 be expressed as 1, maybe will should be 1 be expressed as 0.
Based on the foregoing description, the source word symbol in step S106 closes the source character set after sorting according to S-Shaped Algorithm divides into groups according to preset group number, after the character group of obtaining predetermined number; Method can also comprise: the character information that reads all source word symbols in every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.Among this embodiment, because each character group all is made up of some positions 0 or 1,0 and 1 quantity confirms that this character group characterizes with 0 or 1 in can organizing according to each.If the source word that characterizes with " 0 " in character group symbol then can use " 0 " to characterize this character group more than the source word symbol that characterizes with " 1 " here.
In the application's the foregoing description; Step S108 replaces with gene word corresponding with it in the gene character library with all the source word symbols in one or more groups character group; Can comprise following a kind of implementation step to obtain the document that embeds the gene word: when the corresponding informance of character group is 0, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation.This step can also be to comprise following other a kind of implementation step: when the corresponding informance of character group is 1, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.In the concrete implementation process, system can be based on the implementation step of requirement definition 0 or 1, and all source word symbols that can define in the character group that usefulness " 0 " characterizes need be by the replacement of gene word, and use all source words in the character group of " 1 " sign to accord with and need not replace.In concrete implementation process, all source word symbols that also can be defined as in the character group that usefulness " 1 " characterizes need be replaced by the gene word.
Main points of the present invention are to embed replacement simply, and speed is fast, are easy to realize that gene word utilization factor is high, and redundance is relatively more balanced, and Information hiding property is good, and the information embedded quantity is bigger.
From the above; The application's said method embodiment has embedded gene word corresponding in the gene word table in existing source document; Key has adopted the distribution technique based on S-Shaped Algorithm in the process that embeds, and needs the source character set of replacement gene word in the first extraction source file, then the symbol of the source word in this source character set is sorted and packet transaction; Realized the process of balanced different word source word symbol frequently; And in order to guarantee to verify the validity of source document and gene word set table, integrality, need be provided with embedding information and figure place.
Need to prove; Can in computer system, carry out in the step shown in the process flow diagram of accompanying drawing such as a set of computer-executable instructions; And; Though logical order has been shown in process flow diagram, in some cases, can have carried out step shown or that describe with the order that is different from here.
From above description, can find out that the present invention has realized following technique effect: the mode that in document, embeds the gene word of realization of the present invention; It is simple to embed replacement, and speed is fast, is easy to realize; And gene word utilization factor is high; Redundance is relatively more balanced, and Information hiding property is good, and the information embedded quantity is bigger.
Obviously, it is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize with the general calculation device; They can concentrate on the single calculation element; Perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element; Thereby; Can they be stored in the memory storage and carry out, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize by calculation element.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. the document processing method based on the gene word is characterized in that, comprising:
From source document, extract one or more source word symbols according to the gene character library, close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during said source character set closes in said gene character library;
Calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting;
Divide into groups according to preset group number based on the source word symbol of said source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering;
All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.
2. method according to claim 1 is characterized in that, calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with sorting and comprises:
Source word symbol in closing according to the repetition frequency said source character set of inferior ordered pair from high to low sorts, to obtain first ordered set that said source character set closes;
Source word symbol according to repetition frequency is identical in said first ordered set of the descending or ascending inferior ordered pair of character ISN sorts.
3. method according to claim 1 is characterized in that, calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with sorting and comprises:
Source word symbol in closing according to the repetition frequency said source character set of inferior ordered pair from low to high sorts, to obtain first ordered set that said source character set closes;
Source word symbol according to repetition frequency is identical in said first ordered set of the descending or ascending inferior ordered pair of character ISN sorts.
4. according to any described method among the claim 1-3; It is characterized in that; Source word symbol in according to S-Shaped Algorithm the said source character set after sorting being closed divides into groups according to preset group number, and before the character group of obtaining predetermined number, said method also comprises:
Embedding information is set to obtain the figure place of said embedding information, wherein, the figure place of said embedding information is said preset group number;
Said embedding information is encrypted, to obtain safe embedding information.
5. according to any described method among the claim 1-3; It is characterized in that; Source word symbol in according to S-Shaped Algorithm the said source character set after sorting being closed divides into groups according to preset group number, and after the character group of obtaining predetermined number, said method also comprises:
Read the character information of all source word symbols in every group of character group, to obtain the corresponding informance of each character group, wherein,
In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0;
When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.
6. method according to claim 5 is characterized in that, all the source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, comprises to obtain the document that embeds the gene word:
When the corresponding informance of said character group is 0, all source words symbol of this character group is replaced with gene word corresponding with it in the gene character library;
When the corresponding informance of said character group was 1, all source word symbols of this character group were not carried out replacement operation.
7. method according to claim 5 is characterized in that, all the source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, comprises to obtain the document that embeds the gene word:
When the corresponding informance of said character group is 1, all source words symbol of this character group is replaced with gene word corresponding with it in the gene character library;
When the corresponding informance of said character group was 0, all source word symbols of this character group were not carried out replacement operation.
8. the document processing device, document processing based on the gene word is characterized in that, comprising:
Extraction module is used for extracting one or more source word symbols according to the gene character library from source document, closes to obtain source character set, and wherein, there is corresponding gene word in the source word symbol during said source character set closes in said gene character library;
Processing module is used for calculating the repetition frequency that said source character set closes each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting;
Grouping module is used for according to S-Shaped Algorithm the source word symbol that the said source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number;
The replacement module is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.
9. device according to claim 8 is characterized in that, said processing module comprises:
First order module, be used for according to repetition frequency from high to low or the source word symbol that closes of the said source character set of inferior ordered pair from low to high sort, to obtain first ordered set that said source character set closes;
Second order module is used for sorting according to the identical source word symbol of the said first ordered set repetition frequency of the descending or ascending inferior ordered pair of character ISN.
10. according to Claim 8 or 9 described devices, it is characterized in that said device also comprises:
Module is set, is used to be provided with embedding information to obtain the figure place of said embedding information, wherein, the figure place of said embedding information is said preset group number, and said embedding information is encrypted, to obtain safe embedding information.
11. according to Claim 8 or 9 described devices, it is characterized in that said device also comprises:
Read module is used for reading the character information that all source words of every group of character group accord with, to obtain the corresponding informance of each character group; Wherein, In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.
12. device according to claim 11 is characterized in that, said replacement module comprises:
The first replacement module is used for when the corresponding informance of said character group is 0, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of said character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps,
The second replacement module is used for when the corresponding informance of said character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of said character group was 0, all source word symbols of this character group were not carried out replacement operation.
CN201110400253.4A 2011-12-06 2011-12-06 Genetic word-based file processing method and device Active CN102495881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110400253.4A CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110400253.4A CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Publications (2)

Publication Number Publication Date
CN102495881A true CN102495881A (en) 2012-06-13
CN102495881B CN102495881B (en) 2014-06-25

Family

ID=46187706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110400253.4A Active CN102495881B (en) 2011-12-06 2011-12-06 Genetic word-based file processing method and device

Country Status (1)

Country Link
CN (1) CN102495881B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103199548A (en) * 2013-04-02 2013-07-10 国家电网公司 Capacitor grouping balancing system and capacitor grouping balancing method
CN107169722A (en) * 2017-03-23 2017-09-15 高泽 A kind of complete intelligent tracing management system of official document operating and method
CN117891787A (en) * 2024-03-15 2024-04-16 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
WO2007062554A1 (en) * 2005-12-01 2007-06-07 Peking University Founder Group Co. Ltd A method and device for embedding digital watermark into a text document and detecting it
US20070157123A1 (en) * 2005-12-22 2007-07-05 Yohei Ikawa Character string processing method, apparatus, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
WO2007062554A1 (en) * 2005-12-01 2007-06-07 Peking University Founder Group Co. Ltd A method and device for embedding digital watermark into a text document and detecting it
US20070157123A1 (en) * 2005-12-22 2007-07-05 Yohei Ikawa Character string processing method, apparatus, and program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103199548A (en) * 2013-04-02 2013-07-10 国家电网公司 Capacitor grouping balancing system and capacitor grouping balancing method
CN107169722A (en) * 2017-03-23 2017-09-15 高泽 A kind of complete intelligent tracing management system of official document operating and method
CN117891787A (en) * 2024-03-15 2024-04-16 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment
CN117891787B (en) * 2024-03-15 2024-05-28 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Also Published As

Publication number Publication date
CN102495881B (en) 2014-06-25

Similar Documents

Publication Publication Date Title
Camara et al. Distortion‐Free Watermarking Approach for Relational Database Integrity Checking
US20060095775A1 (en) Fragile watermarks
CN105512523B (en) The digital watermark embedding and extracting method of a kind of anonymization
CN104463529A (en) Logistics distribution bill generating method based on two-dimension code and encryption technology
CN108829899B (en) Data table storage, modification, query and statistical method
CN104392197A (en) Method for increasing reading rate and encryption of website two-dimensional code tags
CN106126982B (en) A kind of PDF document copy-right protection method based on digital finger-print
CN114884697B (en) Data encryption and decryption method and related equipment based on cryptographic algorithm
CN102647423A (en) Identifying method and system of digital signature and seal
CN109840401A (en) For the watermark embedding method of data text
CN102495881B (en) Genetic word-based file processing method and device
CN113822675A (en) Block chain based message processing method, device, equipment and storage medium
CN104504507A (en) Network verification system with verification code seal and operation method of network verification system
CN102194292B (en) Billing server, tax copying system and tax copying method
CN111612963A (en) Bill voucher anti-counterfeiting detection method and device based on intelligent equipment
CN115840787A (en) Supply chain data sharing method, device, equipment and medium based on block chain
Tiwari et al. A novel watermarking scheme for secure relational databases
CN102270312B (en) Method for making point bitmap, and goods-fleeing prevention verification method
CN102142073A (en) System for preventing and identifying disclosure of paper documents based on hidden watermarks
CN106910149A (en) Replacement number generation system and the generation method of a kind of citizen ID certificate number
CN112910923A (en) Intelligent financial big data processing system
CN102833069A (en) Cross-platform electronic certificate based on combination of plain code and password
CN102496137B (en) Method and device for dynamically generating watermark
CN103984550A (en) Serial number elongation algorithm of distributed modular system
Murugan et al. A robust watermarking technique for copyright protection for relational databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant