CN102495881A

CN102495881A - Genetic word-based file processing method and device

Info

Publication number: CN102495881A
Application number: CN2011104002534A
Authority: CN
Inventors: 郝佳
Original assignee: Founder International Co Ltd; Founder International Beijing Co Ltd
Current assignee: Founder International Co Ltd; Founder International Beijing Co Ltd
Priority date: 2011-12-06
Filing date: 2011-12-06
Publication date: 2012-06-13
Anticipated expiration: 2031-12-06
Also published as: CN102495881B

Abstract

The invention discloses a genetic word-based file processing method and a device. The method comprises the following steps that: one or more source characters are extracted from an original file according to a genetic word stock to obtain a source character set, wherein the source characters in the source character set have corresponding genetic words in the genetic word stock; a repetition frequency of each source character in the source character set is calculated, and the source characters in the source character set are sorted according to the repetition frequency and a character internal code of each source character; the source characters in the source character set are grouped by snakelike algorithm according to a preset group number so as to obtain the character groups of the preset number; and all source characters in one group or multiple groups of character groups are replaced by corresponding genetic words in the genetic word stock to obtain a file with embedded genetic words. Due to the adoption of the method, when the file with the embedded genetic words is identified, character information in the file can be more accurate to read, and the reading accuracy is higher.

Description

Document processing method and device based on the gene word

Technical field

The present invention relates to document processing field, in particular to a kind of document processing method and device based on the gene word.

Background technology

The switching technology of electronic government documents or document is a kind of through Computer information network, in the technology of not transmitting electronic government documents between the commensurate.Along with the development of infotech especially internet technique, each unit or intramural each department can connect mutually through LAN or WWW.Simultaneously, constituent parts or department also generally adopt the computword software for editing to draft official document or document.Electronic government documents or document exchange technology are exactly based on this, through standard electronic government documents form, and unified conveying flow and record; A kind of technology and the system of internet safe transmission means are provided, make official document just can be delivered to recruiting unit from issuing unit apace through network with electronic form; No longer need the special messenger between each unit, to deliver; Thereby, alleviate workload, increase work efficiency.

Continuous development along with infotech; Official document or document exchange particularly electronic government documents or document exchange are frequent day by day; No matter be in the process of party and government's authority management national affairs; Still in the daily administration of enterprises and institutions, official document or document are the important carriers that transmits important information, implements higher level's spirit.Therefore; Reinforcement is to the particularly management of electronic government documents or document of official document or document; Make electronic government documents or document have certain confidentiality and antifalsification just seems particularly important, and for the special document of some special machine-operated department, the confidentiality of document and false proof has even more important meaning.In the prior art, most of official document or document do not have false proof function, normally judge the source and the true and false of official document through sequence number on official document or the document or official seal.But the sequence number on official document or the document can be blocked easily or duplicated, and present chromoscan, duplicating and printing technique make the official seal on official document or the document also hold very much to such an extent that be replicated.

Prior art solves the problems referred to above through encrypting identification, but will realize encrypting and identification, generally can adopt the digital watermark technology of text, and it is the important technology in the Information Hiding Techniques field, more commonly image digital watermark.And there is a large amount of text (like electronic government documents) need to be keep secret in the reality; The e-text that electronic document system inside can be limited after having encrypted flows out; This type systematic often limits the file that transfers papery to through modes such as restriction printing times in addition; But in case transfer to after the papery, system just can't limit the duplicating of file document, often also can't follow the trail of the primary source of paper document.

Because the gene word is the set of all characters in a kind of special-purpose character library; Its font and original character library have nuance; Be difficult for being forged and discovering, can use specific program to detect very easily simultaneously, therefore; The technician can solve the problem that can't limit its printing or duplicate number of times to the document that transfers papery to through the mode that embeds the gene word; But existing gene word embeds the mode of document because redundance is unbalanced and utilization factor is low, when having caused reading the document that has embedded the gene word in system, and the problem that the identification character accuracy is lower.

At present the gene word to correlation technique embeds the mode of document because redundance is unbalanced and utilization factor is low, and when having caused reading this document that embeds the gene word in system, the problem that the identification character accuracy is lower does not propose effective solution as yet.

Summary of the invention

Gene word to correlation technique embeds the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading this document that embeds the gene word in system; The problem that the identification character accuracy is lower does not propose effective problem as yet at present and proposes the present invention, for this reason; Fundamental purpose of the present invention is to provide a kind of document processing method and device based on the gene word, to address the above problem.

To achieve these goals; According to an aspect of the present invention; A kind of document processing method based on the gene word is provided, and this method comprises: from source document, extract one or more source word symbols according to the gene character library, close to obtain source character set; Wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; The repetition frequency of each source word symbol in the calculation sources character set, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Divide into groups according to preset group number according to the source word symbol of source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.

Further; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and comprises: the source word in closing according to repetition frequency inferior ordered pair source character set from high to low accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.

Further; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and comprises: the source word in closing according to repetition frequency inferior ordered pair source character set from low to high accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.

Further; Source word symbol in according to S-Shaped Algorithm the source character set after sorting being closed divides into groups according to preset group number; Before the character group of obtaining predetermined number; Method also comprises: embedding information is set to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number; Embedding information is encrypted, to obtain safe embedding information.

Further, the source word symbol in according to S-Shaped Algorithm the source character set after sorting being closed divides into groups according to preset group number, after the character group of obtaining predetermined number; Method also comprises: the character information that reads all source word symbols in every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.

Further; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; Comprise to obtain the document that embeds the gene word: when the corresponding informance of character group is 0, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation.

Further; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; Comprise to obtain the document that embeds the gene word: when the corresponding informance of character group is 1, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.

To achieve these goals; According to a further aspect in the invention, a kind of document processing device, document processing based on the gene word is provided, this device comprises: extraction module; Be used for extracting one or more source word symbols from source document according to the gene character library; Close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; Processing module is used for the repetition frequency of each source word of calculation sources character set symbol, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Grouping module is used for according to S-Shaped Algorithm the source word symbol that the source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number; The replacement module is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.

Further, processing module comprises: first order module, be used for according to repetition frequency from high to low or the source word symbol that closes of inferior ordered pair source character set from low to high sort, to obtain first ordered set that source character set closes; Second order module is used for sorting according to the identical source word symbol of the descending or ascending inferior ordered pair first ordered set repetition frequency of character ISN.

Further, device also comprises: module is set, is used to be provided with embedding information to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number, and embedding information is encrypted, to obtain safe embedding information.

Further; Device also comprises: read module is used for reading the character information that all source words of every group of character group accord with, to obtain the corresponding informance of each character group; Wherein, In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.

Further, the replacement module comprises: the first replacement module, be used for when the corresponding informance of character group is 0, and all source words of this character group are accorded with replace with gene word corresponding in the gene character library with it; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps, the second replacement module is used for when the corresponding informance of character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.

Through the present invention, adopt according to the gene character library and from source document, extract one or more source word symbols, close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; The repetition frequency of each source word symbol in the calculation sources character set, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Divide into groups according to preset group number according to the source word symbol of source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering; All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library; To obtain the document that embeds the gene word; Because the above-mentioned mode that in document, embeds the gene word has adopted the method for balanced statistics, is convenient to reuse the gene word, thereby the gene word that has solved related art embeds the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading this document that embeds the gene word in system; The problem that the identification character accuracy is lower, and then be implemented in when discerning the document that embeds the gene word, read the accurate more and higher effect of accuracy of character information in the document.

Description of drawings

Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:

Fig. 1 is the structural representation based on the document processing device, document processing of gene word according to the embodiment of the invention;

Fig. 2 is the process flow diagram based on the document processing method of gene word according to the embodiment of the invention.

Embodiment

Need to prove that under the situation of not conflicting, embodiment and the characteristic among the embodiment among the application can make up each other.Below with reference to accompanying drawing and combine embodiment to specify the present invention.

Fig. 1 is the structural representation based on the document processing device, document processing of gene word according to the embodiment of the invention.As shown in Figure 1; Should comprise based on the document processing device, document processing of gene word: extraction module 10, be used for extracting one or more source word symbols from source document according to the gene character library, close to obtain source character set; Wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library; Processing module 30 is used for the repetition frequency of each source word of calculation sources character set symbol, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting; Grouping module 50 is used for according to S-Shaped Algorithm the source word symbol that the source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number; Replacement module 70 is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.

The application's the foregoing description; That will extract through processing module and grouping module realization divides into groups and snakelike ordering with the corresponding source word symbol of gene word; After accomplishing grouping and snakelike ordering, the gene word in the gene character library is replaced to source word symbol corresponding in the original.Said apparatus utilizes gene word and word various combination frequently thereof to carry bulk information with grouping; Because the balanced statistical technique of snakelike ordering is convenient to reuse the gene word; Thereby solved existing gene word and embedded the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading the document that has embedded the gene word in system, the problem that the identification character accuracy is lower, and then realized when identification embeds the document of gene word; Read in the document character information more accurately and accuracy higher, and improved the robustness of file when extracting embedding information of using the gene word greatly.

The original that said apparatus is realized according to established rule embedding gene word; Since mobile equilibrium embed the amount of redundancy of every group of gene word of original; Make Information hiding property good; After being printed or duplicate repeatedly, system can judge accurately whether this encrypt file has exceeded the number of times that permission is printed or duplicated.

Processing module in the application's the foregoing description can comprise: first order module 301, be used for according to repetition frequency from high to low or the source word symbol that closes of inferior ordered pair source character set from low to high sort, to obtain first ordered set that source character set closes; Second order module 302 is used for sorting according to the identical source word symbol of the descending or ascending inferior ordered pair first ordered set repetition frequency of character ISN.Ordering array mode among this embodiment has effects equivalent in implementation process, mainly being provides based on the balanced source word symbol of the ordering of statistics gained for follow-up anabolic process.

The processing module in the foregoing description and the combination of grouping module are used for the word quantity of every group of character of dynamic assignment frequently according to the statistics gained; Thereby the amount of redundancy of character in every group of the mobile equilibrium; Help hiding of information, can reuse relevant gene word simultaneously, improved the quality of balance that the gene word utilizes information in frequency and every group greatly; Help improving the information embedded quantity

Device in the application's the foregoing description can also comprise: module 80 is set, is used to be provided with embedding information to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number, and embedding information is encrypted, to obtain safe embedding information.The number of packet of this embodiment for character is divided into groups to preset, and, can be in order to improve security with should embedding information encrypting; For example; When embedding information is 0110, can pass through encryption, the embedding information that makes other non-validated user see is 0011 or 1100 etc.; Rather than 0110, have only validated user can discern correct embedding information.

Therefore; The foregoing description can be realized the source word in all character sets in the source document is accorded with; According to the code length that embeds information and according to the word of the statistics gained quantity of character in every group of the dynamic assignment frequently; Promptly realized the amount of redundancy in every group of source word symbol of mobile equilibrium, helped Information hiding, and can reuse the related gene word.

Device in the foregoing description can also comprise: read module 90; Be used for reading the character information of all source word symbols of every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.

Replacement module in the application's the foregoing description can comprise: the first replacement module, be used for when the corresponding informance of character group is 0, and all source words of this character group are accorded with replace with gene word corresponding in the gene character library with it; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps, the second replacement module is used for when the corresponding informance of character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.

Fig. 2 is the process flow diagram based on the document processing method of gene word according to the embodiment of the invention, and this method as shown in Figure 2 comprises the steps:

Step S102 realizes from source document, extracting one or more source word symbols according to the gene character library through the extraction module among Fig. 1, closes to obtain source character set, and wherein, there is corresponding gene word in the source word symbol during source character set closes in the gene character library.

Step S104, the repetition frequency of coming in the calculation sources character set each source word symbol through the processing module among Fig. 1, and the source word in source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting.

Step S106 realizes dividing into groups according to preset group number based on the source word symbol of source character set in closing of S-Shaped Algorithm after to ordering through the grouping module among Fig. 1, to obtain the character group of predetermined number.

Step S108 replaces with gene word corresponding with it in the gene character library through the replacement module among Fig. 1 with all the source word symbols in one or more groups character group, to obtain the document that embeds the gene word.

The application's the foregoing description, will extract with the corresponding source word of gene word symbol divide into groups with snakelike ordering after, the gene word in the gene character library is replaced to source word symbol corresponding in the original.Said method utilizes gene word and word various combination frequently thereof to carry bulk information with grouping; Because the balanced statistical technique of snakelike ordering is convenient to reuse the gene word; Thereby solved existing gene word and embedded the mode of document because redundance is unbalanced and utilization factor is low; When having caused reading the document that has embedded the gene word in system, the problem that the identification character accuracy is lower, and then realized when identification embeds the document of gene word; Read in the document character information more accurately and accuracy higher, and improved the robustness of file when extracting embedding information of using the gene word greatly.

The original that aforesaid way is realized according to established rule embedding gene word; Since mobile equilibrium embed the amount of redundancy of every group of gene word of original; Make Information hiding property good; After being printed or duplicate repeatedly, system can judge accurately whether this encrypt file has exceeded the number of times that permission is printed or duplicated.

Step S104 in the application's the foregoing description; The repetition frequency of each source word symbol in the calculation sources character set; And the source word symbol in source character set being closed with the character ISN according to the repetition frequency of each source word symbol sorts and can specifically implement through following steps: the source word in closing according to repetition frequency inferior ordered pair source character set from high to low accords with and sorting, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.Perhaps, step S104 also can specifically implement through following steps: the source word symbol in closing according to repetition frequency inferior ordered pair source character set from low to high sorts, to obtain first ordered set that source character set closes; Source word symbol according to repetition frequency is identical in descending or ascending inferior ordered pair first ordered set of character ISN sorts.The foregoing description is after quantity (comprising repetition) that statistics has the source word symbol of corresponding gene word in the source document and word frequency; Realization according to word frequently height (when word is frequently identical according to the character ISN) symbol of the source word in all source character sets is sorted, this mode can guarantee the uniqueness of character ordering.

Concrete, the embodiment of said method is following: at first with in the source document with the gene character library in the corresponding source word symbol of gene word extract, the word that begins to add up each source word symbol that extracts then is frequently.Wherein the character information of each source word symbol can adopt binary digit to represent (for example 0 or 1), when the word that occurs as the gene word when a source word symbol be n frequently, can this source word be accorded with and be characterized by n position 0 or 1.

Be that example describes for example with following passage.The passage of this source document is: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged; A source word symbol regarded as in each literal; Through contrasting and inquire about the gene word that obtains this section literal correspondence in the gene character library; Thereby getting access to a source character set closes: the long skill of the main exploitation in mountain is given birth to, each source word symbol during this source character set closes all has corresponding gene word in the gene character library, that is to say that this section literal contains the gene word of 7 correspondences (not comprising repetition) altogether.

Obtain according to statistics:

Word that statistics obtains the corresponding gene word of each source word symbol frequently after, obtain the ISN of each source word symbol simultaneously, if adopt repetition frequency from high to low and the ascending order of character ISN sort that (for example the mountain all is 2 with main word frequently; The ISN 0x5c71 on mountain; Main ISN position 0x4e3b, then the master comes face in front of the mountains, all the other are similar); Then the character sequence after the ordering is: the skill growth is sent out out on main mountain, and its word is followed successively by 2211111 frequently.

Based on the foregoing description; Source word symbol in step S106 closes the source character set after sorting according to S-Shaped Algorithm divides into groups according to preset group number; Before the character group of obtaining predetermined number; Method can also comprise: embedding information is set to obtain the figure place of embedding information, wherein, the figure place that embeds information is preset group number; Embedding information is encrypted, to obtain safe embedding information.Concrete, still the passage with source document is: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged, and is illustrated; At this moment, can embedding information be set according to demand, for example; Embedding information is set is: 0110; Therefore totally 4, can know that the source word symbol of source character set in closing after adopting said method to sort, can be divided into 4 groups and distribute.In addition, can be in order to improve security with should embedding information encrypting, for example; When embedding information is 0110, can pass through encryption, the embedding information that makes other non-validated user see is 0011 or 1100 etc.; Rather than 0110, have only validated user can discern correct embedding information.

Get access to the number of packet of confirming by embedding information through step S106 after, can embed the figure place of information according to this, each the source word symbol that has sorted in according to S-Shaped Algorithm this source character set being closed is assigned in each character group; Concrete can make every group to get a source word symbol successively according to the length of the information of embedding; Distribute to each character group according to snakelike principle then, assign, can better solve the uneven frequently problem of word in the assigning process until all gene words; To realize the middle amount of redundancy of every group of mobile equilibrium; And can reuse the related gene word, promptly make in every group of character group with the 0 or 1 source character quantity represented average basically, thereby it is a lot of to have improved in the grouping that other modes cause having in the character group 0 or 1 source character quantity; Perhaps seldom problem; Can not occur in a certain group of character group characterizing the source word symbol 0 or 1 seldom, and cause the wrong phenomenon of information of this character group, make after the source word in using gene word replacement source character group; The utilization rate of gene word improves, and has improved the robustness of follow-up identification or detection file greatly.

Concrete, can be still with the passage of source document be: the mountain is main to be had longly in the exploitation has skill that living mountain master is arranged, and is illustrated, and according to embedding message length 0110, we can be divided into following four groups with above literal:

First group: main

Second group: the mountain is long

The 3rd group: take place

The 4th group: open skill

The gene word of the correspondence in using the gene character library is replaced after the source word symbol in every group; System identification the document; Promptly when the gene word is detected, can not occur having improved the accuracy rate that detects the gene word owing to characterize the situation that 0 or 1 of source word symbol seldom causes the original breakage in a certain group of character group; Be about to should be 0 be expressed as 1, maybe will should be 1 be expressed as 0.

Based on the foregoing description, the source word symbol in step S106 closes the source character set after sorting according to S-Shaped Algorithm divides into groups according to preset group number, after the character group of obtaining predetermined number; Method can also comprise: the character information that reads all source word symbols in every group of character group; To obtain the corresponding informance of each character group, wherein, in any one group of character group; When character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.Among this embodiment, because each character group all is made up of some positions 0 or 1,0 and 1 quantity confirms that this character group characterizes with 0 or 1 in can organizing according to each.If the source word that characterizes with " 0 " in character group symbol then can use " 0 " to characterize this character group more than the source word symbol that characterizes with " 1 " here.

In the application's the foregoing description; Step S108 replaces with gene word corresponding with it in the gene character library with all the source word symbols in one or more groups character group; Can comprise following a kind of implementation step to obtain the document that embeds the gene word: when the corresponding informance of character group is 0, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 1, all source word symbols of this character group were not carried out replacement operation.This step can also be to comprise following other a kind of implementation step: when the corresponding informance of character group is 1, all source words symbols of this character group are replaced with gene word corresponding with it in the gene character library; When the corresponding informance of character group was 0, all source word symbols of this character group were not carried out replacement operation.In the concrete implementation process, system can be based on the implementation step of requirement definition 0 or 1, and all source word symbols that can define in the character group that usefulness " 0 " characterizes need be by the replacement of gene word, and use all source words in the character group of " 1 " sign to accord with and need not replace.In concrete implementation process, all source word symbols that also can be defined as in the character group that usefulness " 1 " characterizes need be replaced by the gene word.

Main points of the present invention are to embed replacement simply, and speed is fast, are easy to realize that gene word utilization factor is high, and redundance is relatively more balanced, and Information hiding property is good, and the information embedded quantity is bigger.

From the above; The application's said method embodiment has embedded gene word corresponding in the gene word table in existing source document; Key has adopted the distribution technique based on S-Shaped Algorithm in the process that embeds, and needs the source character set of replacement gene word in the first extraction source file, then the symbol of the source word in this source character set is sorted and packet transaction; Realized the process of balanced different word source word symbol frequently; And in order to guarantee to verify the validity of source document and gene word set table, integrality, need be provided with embedding information and figure place.

Need to prove; Can in computer system, carry out in the step shown in the process flow diagram of accompanying drawing such as a set of computer-executable instructions; And; Though logical order has been shown in process flow diagram, in some cases, can have carried out step shown or that describe with the order that is different from here.

From above description, can find out that the present invention has realized following technique effect: the mode that in document, embeds the gene word of realization of the present invention; It is simple to embed replacement, and speed is fast, is easy to realize; And gene word utilization factor is high; Redundance is relatively more balanced, and Information hiding property is good, and the information embedded quantity is bigger.

Obviously, it is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize with the general calculation device; They can concentrate on the single calculation element; Perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element; Thereby; Can they be stored in the memory storage and carry out, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize by calculation element.Like this, the present invention is not restricted to any specific hardware and software combination.

The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the document processing method based on the gene word is characterized in that, comprising:

From source document, extract one or more source word symbols according to the gene character library, close to obtain source character set, wherein, there is corresponding gene word in the source word symbol during said source character set closes in said gene character library;

Calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting;

Divide into groups according to preset group number based on the source word symbol of said source character set in closing of S-Shaped Algorithm after, to obtain the character group of predetermined number to ordering;

All source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.

2. method according to claim 1 is characterized in that, calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with sorting and comprises:

Source word symbol in closing according to the repetition frequency said source character set of inferior ordered pair from high to low sorts, to obtain first ordered set that said source character set closes;

Source word symbol according to repetition frequency is identical in said first ordered set of the descending or ascending inferior ordered pair of character ISN sorts.

3. method according to claim 1 is characterized in that, calculate said source character set close in the repetition frequency of each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with sorting and comprises:

Source word symbol in closing according to the repetition frequency said source character set of inferior ordered pair from low to high sorts, to obtain first ordered set that said source character set closes;

4. according to any described method among the claim 1-3; It is characterized in that; Source word symbol in according to S-Shaped Algorithm the said source character set after sorting being closed divides into groups according to preset group number, and before the character group of obtaining predetermined number, said method also comprises:

Embedding information is set to obtain the figure place of said embedding information, wherein, the figure place of said embedding information is said preset group number;

Said embedding information is encrypted, to obtain safe embedding information.

5. according to any described method among the claim 1-3; It is characterized in that; Source word symbol in according to S-Shaped Algorithm the said source character set after sorting being closed divides into groups according to preset group number, and after the character group of obtaining predetermined number, said method also comprises:

Read the character information of all source word symbols in every group of character group, to obtain the corresponding informance of each character group, wherein,

In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0;

When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.

6. method according to claim 5 is characterized in that, all the source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, comprises to obtain the document that embeds the gene word:

When the corresponding informance of said character group is 0, all source words symbol of this character group is replaced with gene word corresponding with it in the gene character library;

When the corresponding informance of said character group was 1, all source word symbols of this character group were not carried out replacement operation.

7. method according to claim 5 is characterized in that, all the source word symbols in one or more groups character group are replaced with gene word corresponding with it in the gene character library, comprises to obtain the document that embeds the gene word:

When the corresponding informance of said character group is 1, all source words symbol of this character group is replaced with gene word corresponding with it in the gene character library;

When the corresponding informance of said character group was 0, all source word symbols of this character group were not carried out replacement operation.

8. the document processing device, document processing based on the gene word is characterized in that, comprising:

Extraction module is used for extracting one or more source word symbols according to the gene character library from source document, closes to obtain source character set, and wherein, there is corresponding gene word in the source word symbol during said source character set closes in said gene character library;

Processing module is used for calculating the repetition frequency that said source character set closes each source word symbol, and the source word in said source character set being closed with the character ISN according to the repetition frequency of each source word symbol accords with and sorting;

Grouping module is used for according to S-Shaped Algorithm the source word symbol that the said source character set after sorting closes being divided into groups according to preset group number, to obtain the character group of predetermined number;

The replacement module is used for all source word symbols of one or more groups character group are replaced with gene word corresponding with it in the gene character library, to obtain the document that embeds the gene word.

9. device according to claim 8 is characterized in that, said processing module comprises:

First order module, be used for according to repetition frequency from high to low or the source word symbol that closes of the said source character set of inferior ordered pair from low to high sort, to obtain first ordered set that said source character set closes;

Second order module is used for sorting according to the identical source word symbol of the said first ordered set repetition frequency of the descending or ascending inferior ordered pair of character ISN.

10. according to Claim 8 or 9 described devices, it is characterized in that said device also comprises:

Module is set, is used to be provided with embedding information to obtain the figure place of said embedding information, wherein, the figure place of said embedding information is said preset group number, and said embedding information is encrypted, to obtain safe embedding information.

11. according to Claim 8 or 9 described devices, it is characterized in that said device also comprises:

Read module is used for reading the character information that all source words of every group of character group accord with, to obtain the corresponding informance of each character group; Wherein, In any one group of character group, when character information when to be 0 source number of characters greater than character information be 1 source number of characters, the corresponding informance of this character group is 0; When character information when to be 1 source number of characters greater than character information be 0 source number of characters, the corresponding informance of this character group is 1.

12. device according to claim 11 is characterized in that, said replacement module comprises:

The first replacement module is used for when the corresponding informance of said character group is 0, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of said character group was 1, all source word symbols of this character group were not carried out replacement operation; Perhaps,

The second replacement module is used for when the corresponding informance of said character group is 1, all source words of this character group is accorded with replace with gene word corresponding with it in the gene character library; When the corresponding informance of said character group was 0, all source word symbols of this character group were not carried out replacement operation.