CN117648681B - OFD format electronic document hidden information extraction and embedding method - Google Patents

OFD format electronic document hidden information extraction and embedding method Download PDF

Info

Publication number
CN117648681B
CN117648681B CN202410123051.7A CN202410123051A CN117648681B CN 117648681 B CN117648681 B CN 117648681B CN 202410123051 A CN202410123051 A CN 202410123051A CN 117648681 B CN117648681 B CN 117648681B
Authority
CN
China
Prior art keywords
chinese character
chinese
stroke
characters
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410123051.7A
Other languages
Chinese (zh)
Other versions
CN117648681A (en
Inventor
杨瑞钦
范红达
陆猛
朱静宇
赵云
庄玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dianju Information Technology Co ltd
Original Assignee
Beijing Dianju Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dianju Information Technology Co ltd filed Critical Beijing Dianju Information Technology Co ltd
Priority to CN202410123051.7A priority Critical patent/CN117648681B/en
Publication of CN117648681A publication Critical patent/CN117648681A/en
Application granted granted Critical
Publication of CN117648681B publication Critical patent/CN117648681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention relates to the technical field of document information processing, in particular to an OFD format electronic document hidden information extraction and embedding method, which comprises the following steps: acquiring an OFD format electronic document; extracting a steganographic carrier text by combining an OCR algorithm, and acquiring initial groups of Chinese characters; constructing word frequency co-occurrence adhesion degree for the character frequency characteristics of the initial grouping Chinese characters and the word forming of the Chinese characters; constructing a Chinese character stroke comparison matrix; acquiring a Chinese character stroke matrix and a stroke sequence of each Chinese character; analyzing the difference characteristics between each Chinese character and other Chinese characters in the initial grouping of the Chinese characters to obtain a Chinese character stroke method steganography embedded evaluation factor; grouping of the paragraphs is completed; and combining the Chinese character strokes of each Chinese character to steganographically embed the evaluation factors to complete the embedding of the hidden information. The invention aims to solve the defects of uneven hidden writing and embedding of Chinese characters and easy detection caused by uneven grouping of Chinese characters, thereby improving the embedding quality of hidden information.

Description

OFD format electronic document hidden information extraction and embedding method
Technical Field
The application relates to the technical field of document information extraction, in particular to an OFD format electronic document hidden information extraction and embedding method.
Background
OFD (Open Fixed-layout Document) is an electronic Document standard for storing office documents, is a Document created according to the OFD standard, supports multimedia, digital signature, and has the advantage of multi-platform system compatibility. Text steganography is an information hiding technology for embedding hidden information into a text, and has wide application in the fields of information security, copyright protection, text steganography traceability and the like.
The steganography based on the micro-deformation of the Chinese characters has the advantages of simple technical realization, imperceptibility of vision and adaptability of the Chinese text, utilizes small changes on strokes of the Chinese characters to embed hidden information, has good concealment and robustness, and can keep the readability and the voice integrity of the text. However, the algorithm also has a certain problem that the size of the Chinese character packet influences the text embedding rate and the extraction difficulty in the steganographic embedding process. The traditional Chinese character grouping usually adopts uniform grouping and random grouping, and the uniform grouping directly divides the text into Chinese character blocks with equal size, so that local abnormality of the Chinese characters is easy to cause, and hidden information is concentrated in a single grouping; random grouping increases the difficulty of steganographic embedding and extraction and may also result in increased text unnaturalness.
Disclosure of Invention
In order to solve the technical problems, the invention provides an OFD format electronic document hidden information extraction and embedding method, which aims to solve the existing problems.
The invention discloses an OFD format electronic document hidden information extraction and embedding method, which adopts the following technical scheme:
the embodiment of the invention provides an OFD format electronic document hidden information extraction and embedding method, which comprises the following steps:
acquiring an OFD format electronic document; extracting each sentence sequence of the OFD format electronic document by adopting an OCR algorithm to form a steganography carrier text; taking the Chinese characters of the same sentence in the steganography carrier text as initial grouping of Chinese characters; acquiring a Chinese character frequency table according to the distribution of each Chinese character in the text and the paragraphs;
initial grouping of each Chinese character in the carrier text; acquiring the word frequency joint coefficients of each Chinese character of the initial grouping of Chinese characters according to the word frequency characteristics of each Chinese character of the initial grouping of Chinese characters in a Chinese character word frequency table; according to the probability of each Chinese character in the initial grouping of Chinese characters and the composition words of the rest Chinese characters, the word frequency joint coefficient of each Chinese character is combined to obtain the word frequency co-occurrence adhesion degree of each Chinese character in the initial grouping of Chinese characters; constructing a Chinese character stroke comparison matrix according to the stroke characteristics of all Chinese characters in the steganographic carrier text; according to the initial grouping of the strokes of each Chinese character and the combination of the stroke comparison matrix of the Chinese character, a Chinese character stroke matrix and a stroke order sequence of each Chinese character are obtained; acquiring the strokes double-end region graduation of each Chinese character of the initial group of Chinese characters according to the Chinese character stroke matrix and the frequency co-occurrence adhesion degree of each Chinese character of the initial group of Chinese characters;
acquiring a Chinese character stroke method steganography embedded evaluation factor of each Chinese character of the initial group of Chinese characters according to the difference characteristics of stroke matrixes and shape sequence sequences between each Chinese character of the initial group of Chinese characters and other Chinese characters and combining the stroke double-end degree of the Chinese characters; grouping the paragraphs according to the distribution characteristics of the Chinese character strokes of the Chinese characters in the paragraphs of the steganography carrier text;
and combining the Chinese character strokes of each Chinese character to steganographically embed the evaluation factors to complete the embedding of the hidden information.
Preferably, the obtaining a chinese character frequency table according to the distribution of each chinese character in the text and the paragraphs specifically includes:
for each Chinese character in the steganographic text;
taking the probability of each Chinese character appearing in the paragraph as the paragraph probability of each Chinese character; taking the probability of each Chinese character in the whole text as the text probability of each Chinese character;
and saving a table consisting of segment probabilities and text probabilities of all Chinese characters as a Chinese character frequency table.
Preferably, the method obtains the word frequency joint coefficient of each Chinese character in the initial group of Chinese characters according to the word frequency characteristics of each Chinese character in the word frequency table of the Chinese characters in the initial group of Chinese characters, specifically comprises the following steps:
acquiring segment probability and text probability of each Chinese character in a Chinese character frequency table; presetting a first weight adjustment factor and a second weight adjustment factor; calculating the product of the segment probability of each Chinese character and the first weight adjustment factor; calculating the result of multiplying the text probability of each Chinese character by the second weight adjusting factor; and taking the sum of the product and the result as a word frequency joint coefficient of each Chinese character.
Preferably, the obtaining the co-occurrence adhesion of the frequency of each Chinese character in the initial Chinese character group according to the probability of each Chinese character and the other Chinese character composition words in the initial Chinese character group and combining the frequency of each Chinese character with the frequency of each Chinese character combination coefficient specifically comprises:
the length of the preset word is recorded asWherein, the length of the word is the number of Chinese characters contained in the word; character frequency co-occurrence adhesion of the kth Chinese character initial grouping ith Chinese character +.>The expression is:
in the method, in the process of the invention,word frequency joint coefficient representing the kth Chinese character initial group ith Chinese character, ++>、/>Respectively representing the ith Chinese character, the ith-1 Chinese character and the ith-n-1 Chinese character in the kth Chinese character initial group,representing a counting function.
Preferably, the construction of the Chinese character stroke comparison matrix according to the stroke characteristics of all Chinese characters in the steganographic carrier text specifically comprises:
acquiring a front with a high probability of occurrence in a steganographic carrier textSeed penDrawing; numbering each stroke from 1; the strokes and the corresponding numbers are used as elements of a Chinese character stroke comparison matrix, wherein +.>Is a preset value.
Preferably, the method includes the steps of obtaining a Chinese character stroke matrix and a stroke order sequence of each Chinese character according to the initial grouping of the stroke characteristics of each Chinese character and combining the Chinese character stroke comparison matrix, specifically:
acquiring a stroke set and a stroke sequence of each Chinese character of the initial grouping of the Chinese characters;
if each element of the Chinese character comparison matrix appears in the stroke set, the position of the corresponding element is marked as 1, otherwise, the position is marked as 0; the marking result is used as a Chinese character stroke matrix of each Chinese character;
and taking the number in the stroke comparison matrix of each element of the stroke sequence of each Chinese character as each element of the stroke sequence of each Chinese character.
Preferably, the method for obtaining the strokes double-end region division of each Chinese character in the initial group of Chinese characters according to the Chinese character stroke matrix and the frequency co-occurrence adhesion of each Chinese character in the initial group of Chinese characters comprises the specific steps of:
for each Chinese character in the initial grouping of Chinese characters;
calculating F norms of Chinese character stroke matrixes of all Chinese characters; calculating the sum of Chinese character stroke matrixes of the front and rear Chinese characters of each Chinese character; calculating a logarithmic function value taking 2 as a base number and taking a normalized value of the frequency co-occurrence adhesion of each Chinese character as a true number; calculating the product of the logarithmic function value and the F norm of the sum value; and taking the ratio of the F norm and the product as the double-end region division of the strokes of the Chinese characters in the initial grouping of the Chinese characters.
Preferably, the method for obtaining the hidden-writing embedded evaluation factor of the Chinese character strokes of each Chinese character of the initial group of Chinese characters according to the difference characteristics of the stroke matrix and the stroke sequence between each Chinese character of the initial group of Chinese characters and other Chinese characters and combining the stroke double-end distinguishing degree of the Chinese characters comprises the specific steps of:
aiming at the ith Chinese character in the kth Chinese character initial group;
method penalty factor for ith Chinese character in kth Chinese character initial groupingThe expression is:
in the method, in the process of the invention,representing the number of Chinese characters in the kth Chinese character initial group, < ->、/>Respectively representing the sequence of the stroke order of the jth and the ith Chinese characters in the initial group of the kth Chinese characters in the steganographic carrier text,/v>Represents DTW distance, +.>Respectively representing the stroke matrix of the jth and ith Chinese characters in the initial group of the kth Chinese character in the steganographic carrier text,representing the L1 norm of the matrix;
taking the punishment factors of the penmanship as indexes of an exponential function based on natural constants; calculating the product of the exponential function and the indexing of the strokes of the Chinese characters in the double end areas; acquiring the length of a stroke sequence of the Chinese character; calculating a logarithmic function value taking 2 as a base, wherein the sum of the length and 1 is a true number; and taking the ratio of the product to the logarithmic function value as a Chinese character writing method steganography embedding evaluation factor of the Chinese character.
Preferably, the grouping size of each paragraph is obtained according to the distribution characteristics of the Chinese character stroke steganography embedded evaluation factors of the Chinese characters in each paragraph of the steganography carrier text, specifically:
for each paragraph of the steganographic carrier text;
the opposite number of the Chinese character stroke method steganography embedded evaluation factors of each Chinese character is used as an index of an exponential function taking a natural constant as a base number; taking the sum of the exponential functions of all Chinese characters in each paragraph as the grouping factor of each paragraph; presetting a grouping adjustment factor; and taking the value of the downward rounding of the grouping factors and the maximum value among the grouping adjustment factors as the grouping size of each paragraph.
Preferably, the embedding of the hidden information is completed by combining the Chinese character writing method steganography embedding evaluation factors of each Chinese character, and specifically comprises the following steps:
aiming at each grouped Chinese character group;
and obtaining the Chinese character with the maximum evaluation factor of the Chinese character stroke method steganography embedding in each Chinese character group, and performing micro-deformation operation on the Chinese character according to a steganography word stock to complete the embedding of hidden information.
The invention has at least the following beneficial effects:
the invention mainly obtains the double-end-region graduation of the strokes of the Chinese character by constructing a stroke matrix of the Chinese character, measures the complexity of the Chinese character in a steganography carrier text, and finally determines the size of Chinese character packets in a paragraph by analyzing the stroke matrix and the stroke sequence of the Chinese character to distinguish and construct the stroke sequence of the Chinese character with the same composition and different word sense. Compared with the traditional uniform grouping and random grouping, the method solves the defect that the Chinese characters are unevenly hidden and embedded and are easy to detect due to uneven grouping of the Chinese characters, and the stroke information of the Chinese characters is measured through a Chinese character stroke matrix and a stroke sequence, so that the size of the paragraph Chinese character grouping is constructed, the hidden information is evenly embedded into a hidden carrier text, the embedding rate is improved, and meanwhile, the robustness of an embedding algorithm is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an OFD format electronic document hidden information extraction and embedding method provided by the invention;
FIG. 2 is a diagram of a Chinese character stroke contrast matrix;
fig. 3 is a flow chart for packet size acquisition.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of specific implementation, structure, characteristics and effects of the method for extracting and embedding hidden information of an OFD format electronic document according to the invention in combination with the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the method for extracting and embedding hidden information of an OFD format electronic document provided by the invention with reference to the accompanying drawings.
The invention provides an OFD format electronic document hidden information extraction and embedding method.
Specifically, the method for extracting and embedding hidden information of an OFD format electronic document is provided, please refer to fig. 1, and the method comprises the following steps:
step S001: and obtaining the OFD format electronic document as a steganographic carrier text, and extracting the Chinese character data.
Firstly, the text information needing to be hidden is obtained, in this embodiment, the OCR (Optical Character Recognition) algorithm is adopted to perform text extraction on the text data, and because the OCR algorithm is a known technology, the embodiment is not described again. The text Chinese character data to be hidden is obtained by inputting the OFD format electronic document into the OCR algorithm and outputting the corresponding Chinese character sequence, and for simplifying the description, the embodiment is called as a hidden carrier text.
Because the steganography carrier text usually contains a plurality of Chinese characters, the length of the obtained text Chinese character sequence is longer, which is not beneficial to analysis of the steganography of the Chinese characters. Thus based on the original paragraph and sentence division in the steganographic carrier text, usingThe i-th Chinese character of the kth sentence in the steganographic carrier text is represented, and the Chinese character group divided by the sentence is called a Chinese character initial group.
Thus, the initial grouping of the steganographic carrier text and the Chinese characters is obtained.
Step S002: obtaining word frequency co-occurrence adhesion according to the frequency of the Chinese characters in the steganographic carrier text and the association degree between the Chinese characters, constructing a Chinese character stroke matrix for measuring the feasibility of the steganographic information of the Chinese characters, obtaining the stroke double-end region graduation of the Chinese characters according to the stroke matrix and the word frequency co-occurrence adhesion of the Chinese characters, constructing a stroke sequence according to the stroke sequence of the Chinese characters, obtaining a Chinese character stroke method steganographic embedding evaluation factor aiming at the Chinese character information in the current initial grouping, and finally obtaining the self-adaptive Chinese character grouping size of the current paragraph.
Usually, chinese character patterns are formed by minimum structural units of various specific points and lines, each Chinese character can be formed by arranging and combining three layers of components, strokes and strokes in a two-dimensional space, and the special meaning of the Chinese character is endowed by fine change of the Chinese character by adjusting the strokes and the structures of the Chinese character, so that the steganography of the Chinese character is realized. The Chinese characters are distinguished by changing the tiny strokes of the Chinese characters based on the steganography of the micro-deformation of the Chinese characters, and the Chinese characters formed by the difference of strokes of different fonts correspond to different information, so that the aim of transmitting the hidden information of the Chinese character text is fulfilled. The Chinese character grouping will seriously affect the embedding rate and the steganography effect of the hidden information, so that the Chinese character grouping needs to be adaptively adjusted.
In text steganography, the size of a Chinese character packet directly determines the performance and reliability of a steganography system in the process of embedding steganography information into text. The proper Chinese character grouping is beneficial to enabling the hidden information to be uniformly embedded into each grouping, improving the embedding rate of the hidden information, and improving the difficulty and algorithm robustness of the found hidden information and improving the steganography efficiency because the hidden information is scattered into the whole text. Therefore, the grouping of Chinese characters needs to be adaptively adjusted according to the Chinese character information in the steganographic carrier text.
The statistical information obtains a Chinese character word frequency table B of the full text of the steganographic carrier, namely, the word frequency of each Chinese character is obtained through statistics, and the segment probability and the text probability of the ith Chinese character can be obtained, wherein the segment probability and the text probability respectively represent the occurrence probability of the current Chinese character in the segment and the text.
Although Chinese characters are independent individuals, each Chinese character does not exist independently, semantic relativity often exists among Chinese characters, and relativity among Chinese characters is often embodied in terms or idioms. The frequency co-occurrence adhesion degree of the current Chinese character can be obtained by combining the occurrence probability of each Chinese character and the relevance of the Chinese characters, and the expression is as follows:
in the method, in the process of the invention,word frequency co-occurrence adhesion of the ith Chinese character in the kth initial group, < ->Word frequency joint coefficient representing the ith Chinese character in the kth initial group, ++>Length of the expression word->、/>Respectively representing the ith Chinese character, the ith-1 th Chinese character and the ith-n-1 th Chinese character in the kth Chinese character initial group,representing a counting function +_>、/>Respectively representing segment probability, text probability and ++of the kth Chinese character initial grouping ith Chinese character in a Chinese character word frequency table>、/>Respectively representing the first and second weight adjustment factors. It should be noted that the length of the words is measured by the number of Chinese characters, generally 2, 3 and 4, and the embodiment is provided with +.>According to experience set upThe implementation can be set by the user according to the actual situation.
It is further to be noted that,representing the +.f in the kth initial group of Chinese characters>To->The number of sequential combinations of Chinese characters, wherein ∈>And if the set value is i=1, the range which can be obtained by the initial grouping of the current Chinese characters is taken, and the set value is 0.
The frequency co-occurrence adhesion reflects the association condition of the frequency of the current Chinese character and the word formation of the Chinese character. If the association of the current Chinese characters in the initial group is relatively tight, the occurrence probability of Chinese character combinations is relatively high, namelyIn addition, when the probability of occurrence of the Chinese character in the text is large, the value of the word frequency joint coefficient is large, and finally, the value of the word frequency co-occurrence adhesion of the current Chinese character is large. On the contrary, if the current Chinese character is relatively independent and the probability of occurrence in the whole text is low, the value of the frequency co-occurrence adhesion of the current Chinese character is finally obtained.
Because of the condition of the association between the Chinese characters, if the Chinese characters with relatively compact association and relatively high occurrence rate are subjected to steganography, the Chinese characters with relatively independent text and relatively low occurrence rate are the best steganography carrier for hiding information. The word frequency co-occurrence adhesion degree obtains the possibility of steganography of the Chinese characters on the semanteme by analyzing the relation between the Chinese characters and the word frequency of the single Chinese character, and thus the possibility that the steganography difficulty is high is possibly screened out, for example, if the Chinese character screened out by the word frequency co-occurrence adhesion degree is 'one', the stroke is simple, and the hidden writing is easy to be found out.
The association relation of Chinese characters exists between Chinese characters, and for a single Chinese character, the association relation is formed by a plurality of Chinese character strokes, and the difficulty of different Chinese characters on the hidden writing information of the Chinese characters is different due to different strokes, so that the more complex the strokes of the Chinese characters are, the easier the hidden writing information of the Chinese characters is indicated and the more difficult the hidden writing information of the Chinese characters is detected. According to the standard of the Chinese character of the printing regular script in the GB13000.1 character set Chinese character break stroke Specification, chinese characters are generally usedIs divided into 32 types, and the first 25 types of strokes with high occurrence probability in the steganographic carrier text are as follows: point, horizontal, vertical, skim, right-falling, lifting, transverse fold, transverse skim, transverse hook, transverse fold lifting, transverse fold, transverse oblique hook, vertical fold skim, skim point, skim, oblique hook, vertical lift, vertical fold, horizontal hook and other strokes, a Chinese character stroke comparison matrix is constructed as shown in fig. 2. The Chinese character stroke matrix isWhen the corresponding stroke occurrence of the Chinese character is recorded as 1, no matter how many times the corresponding stroke occurrence is only recorded as 1, therefore, only 0 and 1 are in the Chinese character stroke matrix, such as the Chinese character stroke matrix constructed for Chinese character 'lake'>
For each Chinese character in the initial group of Chinese charactersObtaining the corresponding Chinese character stroke matrix ++>Therefore, the Chinese character stroke matrix and the frequency co-occurrence adhesion degree are combined, the double-end-region graduation of the Chinese character strokes can be obtained, and the expression is as follows:
in the method, in the process of the invention,chinese character stroke double-end division representing the ith Chinese character in the kth initial group of Chinese characters in the steganographic carrier text, < >>A Chinese character stroke matrix representing the i-th Chinese character in the k-th initial group of Chinese characters in the steganographic carrier text,representing the F-norm of the matrix,>represents a logarithmic function based on 2, < ->Representing the frequency co-occurrence adhesion degree of the ith Chinese character normalized in the kth Chinese character initial group in the steganographic carrier text,/Chinese character initial group>、/>Respectively representing the Chinese character stroke matrix of the i-1 th and i+1 th Chinese characters in the initial group of the kth Chinese character in the steganographic carrier text. It should be noted that, the calculation of the F-norm is a known technique, and will not be described in detail in this embodiment.
The Chinese character stroke double-end region graduation can reflect the independence and stroke complexity of the current Chinese character, and the more complex strokes, the easier the Chinese character is to carry out information steganography. If the Chinese character stroke composition is more complex, a corresponding Chinese character stroke matrix is obtainedThe larger the norm value of the Chinese character is, the larger the distinguishing degree of the Chinese character and the front and the rear Chinese characters is, the larger the ratio between the Chinese character and the rear Chinese character is, namelyThe larger the value of the Chinese character is, the smaller the value of the co-occurrence adhesion of the normalized character frequency is obtained when the Chinese character is relatively independent, and finally the value of the indexing of the strokes and the double end areas of the Chinese character is increased. In contrast, if the correlation degree between the current Chinese character is relatively simple and the Chinese character is larger, the value of the stroke and double-end region graduation of the Chinese character is reduced.
The Chinese character stroke complexity is measured by the double-end degree of the Chinese character stroke, but certain defects exist in the Chinese character stroke matrixThe Chinese character construction is carried out by strokes, and only the composition difference of the Chinese character can be represented. More complex Chinese characters often can contain all strokes of a Chinese character stroke matrix, so that the Chinese characters cannot be distinguished through the stroke matrix, further analysis is needed on the basis of the strokes, and a stroke order sequence is constructed on the basis of the Chinese character stroke matrix, and a lake is taken as an example here: />The resulting stroke order sequence is thus constructed for sequence numbers in the stroke matrix>
The sequence of Chinese character stroke order is constructed for the Chinese characters in the current initial group, so that the interference of the same composition of partial Chinese characters, namely the same Chinese character stroke matrix, on Chinese character steganography is reduced. Aiming at the initial grouping, a Chinese character stroke method steganography embedding evaluation factor is constructed according to Chinese character strokes and strokes, and the expression is as follows:
in the method, in the process of the invention,chinese character writing method steganography embedded evaluation factor for representing the ith Chinese character in the kth Chinese character initial grouping, ++>Chinese character stroke double-end region graduation for representing the ith Chinese character in the kth Chinese character initial grouping,/->A stroke penalty factor representing the ith Chinese character in the kth initial group, ++>Representing the length of the acquisition sequence, +.>Representing the number of Chinese characters in the kth Chinese character initial group, < ->、/>Respectively representing the sequence of the stroke order of the jth and the ith Chinese characters in the initial group of the kth Chinese characters in the steganographic carrier text,/v>Represents DTW distance, +.>、/>Stroke matrices respectively representing the jth and ith Chinese characters in the initial group of the kth Chinese character in the steganographic carrier text,>representing the L1 norm of the matrix. It should be noted that, the calculation of the DTW distance and the L1 norm is a known technique, and will not be described in detail in this embodiment.
The Chinese character stroke method steganography embedding evaluation factor is obtained by comprehensively evaluating the strokes and the strokes of the Chinese characters in the current initial group, and can comprehensively measure the successful feasibility of the steganography embedding of the current Chinese characters. If the strokes of the current Chinese character are complex and the stroke shapes are special, the distinction degree with other Chinese characters in the current group is larger, the penalty factor of the stroke method of the current Chinese character is larger, in addition, the independence and the stroke complexity of the current Chinese character are higher, the value of the stroke double-end region division of the obtained Chinese character is larger, and finally, the value of the steganography embedded evaluation factor of the Chinese character is larger. Conversely, if the strokes and the strokes of the current Chinese character are relatively simple and the association degree with the front and rear Chinese characters is higher, finally the Chinese character stroke method steganography embedded evaluation factors are obtainedThe value is small. It should be noted that the more complex Chinese characters are better when the Chinese characters are hidden, and when a certain Chinese character is excessively complex, the Chinese characters are easier to be hidden, but the specificity of the excessive complex is easier to be focused, so that by adding the length of the stroke sequenceFine tuning is carried out, and proper tuning is carried out on the hidden-write embedded evaluation factors of the Chinese character strokes of the more complex Chinese characters.
Traversing each Chinese character in the current paragraph can obtain corresponding Chinese character stroke steganography embedding evaluation factors, and can reflect the successful embedding feasibility of each Chinese character, thereby steganography embedding evaluation factors aiming at the Chinese character stroke of each Chinese character in the current paragraphThe paragraph Chinese characters are adaptively grouped:
in the method, in the process of the invention,grouping factor representing the c-th paragraph of the steganographic carrier text,>representing the number of initial packets in the c-th paragraph in the steganographic carrier text,/for>Representing the number of Chinese characters in the initial group of the current Chinese characters, +.>Chinese character method steganography embedded evaluation factors representing the ith Chinese character in the kth Chinese character initial group in the steganographic carrier text, exp () representsExponential function based on natural constant ++>Packet size representing the c-th paragraph in the steganographic carrier text,/v>Represents a maximum function>Representing a rounding down, a +.>Represents a grouping adjustment factor, empirically set +.>. The flow of obtaining the packet size of each paragraph is shown in fig. 3.
If each Chinese character in the current paragraph is more complex and independent, that is, each Chinese character can be more easily hidden for information, the size of the Chinese character packet needs to be reduced at this time, and the embedding rate of the hidden characters in the current paragraph is improved. If the Chinese character patterns in the current paragraph are relatively simple and the relevance between Chinese characters is strong, the size of the Chinese character packet should be increased at this time, and the success rate of hidden information steganography is increased.
Step S003: chinese characters are grouped according to the self-adaptive grouping size of each paragraph, the Chinese characters in each Chinese character group are subjected to steganography, and steganography extraction is performed at a receiving end.
The self-adaptive grouping size of the current paragraph can be obtained through the steps, and the self-adaptive adjustment can be carried out according to the complexity and the independence of the Chinese characters in the current paragraph. Thereby for adaptive packet size in the current paragraphGrouping Chinese characters, wherein in the grouped Chinese character groups, the Chinese character stroke method steganography embedding evaluation factors of each Chinese character are ordered from large to small, and the Chinese characters with the largest Chinese character stroke method steganography embedding evaluation factors are processed according to a steganography librarySteganography, the strokes of Chinese characters are subjected to micro-deformation to be used for representing the two-bit information in the binary sequence A of the hidden information, so that the hidden information is embedded, and a hidden carrier text after the hidden information is embedded is called as a hidden text. The micro-deformation operation is a known technology, and will not be described in detail in this embodiment.
The hidden text can be transmitted in a common channel and transferred to a receiver by a sender, while the word stock for kanji steganography is not publicly transmittable. And comparing the Chinese characters according to the steganographic text and the steganographic library at a receiver, obtaining font codes such as '01', '10' or '11' through font recognition, and splicing the obtained codes to obtain the steganographic information A in the steganographic text.
In summary, the embodiment of the invention obtains the double-end-region division of the strokes of the Chinese character by constructing the stroke matrix of the Chinese character, measures the complexity of the Chinese character in the steganography carrier text, and in order to distinguish the Chinese characters with the same constituent components and different word senses, constructs the stroke sequence of the Chinese character, analyzes the stroke matrix and the stroke sequence of the Chinese character to obtain the steganography embedding evaluation factor of the Chinese character, and finally determines the size of the Chinese character packet in the paragraph. Compared with the traditional uniform grouping and random grouping, the method solves the defect that the Chinese characters are unevenly hidden and embedded and are easy to detect due to uneven grouping of the Chinese characters, and the stroke information of the Chinese characters is measured through a Chinese character stroke matrix and a stroke sequence, so that the size of the paragraph Chinese character grouping is constructed, the hidden information is evenly embedded into a hidden carrier text, the embedding rate is improved, and meanwhile, the robustness of an embedding algorithm is improved.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims (6)

1. The method for extracting and embedding the hidden information of the OFD format electronic document is characterized by comprising the following steps of:
acquiring an OFD format electronic document; extracting each sentence sequence of the OFD format electronic document by adopting an OCR algorithm to form a steganography carrier text; taking the Chinese characters of the same sentence in the steganography carrier text as initial grouping of Chinese characters; acquiring a Chinese character frequency table according to the distribution of each Chinese character in the text and the paragraphs;
initial grouping of each Chinese character in the carrier text; acquiring the word frequency joint coefficients of each Chinese character of the initial grouping of Chinese characters according to the word frequency characteristics of each Chinese character of the initial grouping of Chinese characters in a Chinese character word frequency table; according to the probability of each Chinese character in the initial grouping of Chinese characters and the composition words of the rest Chinese characters, the word frequency joint coefficient of each Chinese character is combined to obtain the word frequency co-occurrence adhesion degree of each Chinese character in the initial grouping of Chinese characters; constructing a Chinese character stroke comparison matrix according to the stroke characteristics of all Chinese characters in the steganographic carrier text; according to the initial grouping of the strokes of each Chinese character and the combination of the stroke comparison matrix of the Chinese character, a Chinese character stroke matrix and a stroke order sequence of each Chinese character are obtained; acquiring the strokes double-end region graduation of each Chinese character of the initial group of Chinese characters according to the Chinese character stroke matrix and the frequency co-occurrence adhesion degree of each Chinese character of the initial group of Chinese characters;
acquiring a Chinese character stroke method steganography embedded evaluation factor of each Chinese character of the initial group of Chinese characters according to the difference characteristics of stroke matrixes and shape sequence sequences between each Chinese character of the initial group of Chinese characters and other Chinese characters and combining the stroke double-end degree of the Chinese characters; grouping the paragraphs according to the distribution characteristics of the Chinese character strokes of the Chinese characters in the paragraphs of the steganography carrier text;
combining the Chinese character writing method steganography embedding evaluation factors of each Chinese character to complete the embedding of hidden information;
the method for obtaining the Chinese character frequency table according to the distribution of each Chinese character in the text and the paragraphs specifically comprises the following steps:
for each Chinese character in the steganographic text;
taking the probability of each Chinese character appearing in the paragraph as the paragraph probability of each Chinese character; taking the probability of each Chinese character in the whole text as the text probability of each Chinese character;
saving a table consisting of segment probabilities and text probabilities of all Chinese characters as a Chinese character frequency table;
the method comprises the steps of obtaining the word frequency joint coefficients of each Chinese character in the initial Chinese character group according to the word frequency characteristics of each Chinese character in the word frequency table of the Chinese character in the initial Chinese character group, wherein the word frequency joint coefficients comprise the following specific steps:
acquiring segment probability and text probability of each Chinese character in a Chinese character frequency table; presetting a first weight adjustment factor and a second weight adjustment factor; calculating the product of the segment probability of each Chinese character and the first weight adjustment factor; calculating the result of multiplying the text probability of each Chinese character by the second weight adjusting factor; taking the sum of the product and the result as a word frequency joint coefficient of each Chinese character;
the method for obtaining the frequency co-occurrence adhesion degree of each Chinese character in the initial Chinese character group according to the probability of each Chinese character and the rest Chinese character composition words in the initial Chinese character group and combining the frequency co-occurrence adhesion degree of each Chinese character in the initial Chinese character group comprises the following specific steps:
the length of the preset word is recorded asWherein, the length of the word is the number of Chinese characters contained in the word; character frequency co-occurrence adhesion of the kth Chinese character initial grouping ith Chinese character +.>The expression is:
in the method, in the process of the invention,word frequency joint coefficient representing the kth Chinese character initial group ith Chinese character, ++>、/>Respectively representing the ith Chinese character, the ith-1 Chinese character and the ith-n-1 Chinese character in the kth Chinese character initial group,representing a count function;
the method for obtaining the Chinese character stroke method steganography embedded evaluation factors of the Chinese characters of the initial group of Chinese characters according to the difference characteristics of stroke matrixes and stroke order sequences between the initial group of Chinese characters and other Chinese characters and combining the stroke double-end distinction degree of the Chinese characters comprises the following specific steps:
aiming at the ith Chinese character in the kth Chinese character initial group;
method penalty factor for ith Chinese character in kth Chinese character initial groupingThe expression is:
in the method, in the process of the invention,representing the number of Chinese characters in the kth Chinese character initial group, < ->、/>Respectively representing the sequence of the stroke order of the jth and the ith Chinese characters in the initial group of the kth Chinese characters in the steganographic carrier text,/v>Represents DTW distance, +.>Respectively representing the stroke matrix of the jth and ith Chinese characters in the initial group of the kth Chinese character in the steganographic carrier text,representing the L1 norm of the matrix;
taking the punishment factors of the penmanship as indexes of an exponential function based on natural constants; calculating the product of the exponential function and the indexing of the strokes of the Chinese characters in the double end areas; acquiring the length of a stroke sequence of the Chinese character; calculating a logarithmic function value taking 2 as a base, wherein the sum of the length and 1 is a true number; and taking the ratio of the product to the logarithmic function value as a Chinese character writing method steganography embedding evaluation factor of the Chinese character.
2. The method for extracting and embedding hidden information of an OFD format electronic document according to claim 1, wherein the construction of a chinese character stroke comparison matrix according to the stroke characteristics of all chinese characters in the steganographic carrier text comprises the following steps:
acquiring a front with a high probability of occurrence in a steganographic carrier textPlanting strokes; numbering each stroke from 1; the strokes and the corresponding numbers are used as elements of a Chinese character stroke comparison matrix, wherein +.>Is a preset value.
3. The method for extracting and embedding hidden information of an OFD format electronic document according to claim 1, wherein the method for obtaining the stroke matrix and the stroke sequence of each Chinese character according to the stroke characteristics of each Chinese character initially grouped in combination with the stroke comparison matrix of the Chinese character comprises the following steps:
acquiring a stroke set and a stroke sequence of each Chinese character of the initial grouping of the Chinese characters;
if each element of the Chinese character comparison matrix appears in the stroke set, the position of the corresponding element is marked as 1, otherwise, the position is marked as 0; the marking result is used as a Chinese character stroke matrix of each Chinese character;
and taking the number in the stroke comparison matrix of each element of the stroke sequence of each Chinese character as each element of the stroke sequence of each Chinese character.
4. The method for extracting and embedding hidden information of an OFD format electronic document according to claim 1, wherein the step of obtaining the two-terminal region division of strokes of each Chinese character of the initial group of Chinese characters according to the stroke matrix and the co-occurrence adhesion of the frequency of the characters of each Chinese character of the initial group of Chinese characters comprises the following specific steps:
for each Chinese character in the initial grouping of Chinese characters;
calculating F norms of Chinese character stroke matrixes of all Chinese characters; calculating the sum of Chinese character stroke matrixes of the front and rear Chinese characters of each Chinese character; calculating a logarithmic function value taking 2 as a base number and taking a normalized value of the frequency co-occurrence adhesion of each Chinese character as a true number; calculating the product of the logarithmic function value and the F norm of the sum value; and taking the ratio of the F norm and the product as the double-end region division of the strokes of the Chinese characters in the initial grouping of the Chinese characters.
5. The method for extracting and embedding hidden information of an OFD format electronic document according to claim 1, wherein the step of obtaining the group size of each paragraph according to the distribution characteristics of the kanji pen method steganography and embedding evaluation factors of kanji in each paragraph of the steganographic carrier text comprises the following steps:
for each paragraph of the steganographic carrier text;
the opposite number of the Chinese character stroke method steganography embedded evaluation factors of each Chinese character is used as an index of an exponential function taking a natural constant as a base number; taking the sum of the exponential functions of all Chinese characters in each paragraph as the grouping factor of each paragraph; presetting a grouping adjustment factor; and taking the value of the downward rounding of the grouping factors and the maximum value among the grouping adjustment factors as the grouping size of each paragraph.
6. The method for extracting and embedding hidden information of an OFD format electronic document according to claim 1, wherein the method for embedding hidden information by combining Chinese character strokes and steganography and embedding evaluation factors of each Chinese character comprises the following steps:
aiming at each grouped Chinese character group;
and obtaining the Chinese character with the maximum evaluation factor of the Chinese character stroke method steganography embedding in each Chinese character group, and performing micro-deformation operation on the Chinese character according to a steganography word stock to complete the embedding of hidden information.
CN202410123051.7A 2024-01-30 2024-01-30 OFD format electronic document hidden information extraction and embedding method Active CN117648681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410123051.7A CN117648681B (en) 2024-01-30 2024-01-30 OFD format electronic document hidden information extraction and embedding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410123051.7A CN117648681B (en) 2024-01-30 2024-01-30 OFD format electronic document hidden information extraction and embedding method

Publications (2)

Publication Number Publication Date
CN117648681A CN117648681A (en) 2024-03-05
CN117648681B true CN117648681B (en) 2024-04-05

Family

ID=90046444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410123051.7A Active CN117648681B (en) 2024-01-30 2024-01-30 OFD format electronic document hidden information extraction and embedding method

Country Status (1)

Country Link
CN (1) CN117648681B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5212769A (en) * 1989-02-23 1993-05-18 Pontech, Inc. Method and apparatus for encoding and decoding chinese characters
US6813367B1 (en) * 2000-09-11 2004-11-02 Seiko Epson Corporation Method and apparatus for site selection for data embedding
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN103258314A (en) * 2005-09-16 2013-08-21 北京书生国际信息技术有限公司 Method for embedding and detecting cryptical code
CN104834864A (en) * 2015-04-09 2015-08-12 南京安斯克信息科技有限公司 Print document information tracing method based on topological invariance and image morphing
CN109992783A (en) * 2019-04-03 2019-07-09 同济大学 Chinese term vector modeling method
CN111274793A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN114048314A (en) * 2021-11-11 2022-02-15 长沙理工大学 Natural language steganalysis method
CN115114597A (en) * 2022-06-19 2022-09-27 北卡科技有限公司 Tracing watermark embedding and extracting method based on character information
CN115409020A (en) * 2022-08-24 2022-11-29 杭州电子科技大学 Chinese character grouping test method and system based on word balance and computer readable storage medium
CN115952528A (en) * 2023-03-14 2023-04-11 南京信息工程大学 Multi-scale combined text steganography method and system
CN116192507A (en) * 2023-02-27 2023-05-30 盐城工学院 Information hiding method based on deep learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5212769A (en) * 1989-02-23 1993-05-18 Pontech, Inc. Method and apparatus for encoding and decoding chinese characters
US6813367B1 (en) * 2000-09-11 2004-11-02 Seiko Epson Corporation Method and apparatus for site selection for data embedding
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
CN103258314A (en) * 2005-09-16 2013-08-21 北京书生国际信息技术有限公司 Method for embedding and detecting cryptical code
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN104834864A (en) * 2015-04-09 2015-08-12 南京安斯克信息科技有限公司 Print document information tracing method based on topological invariance and image morphing
CN111274793A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN109992783A (en) * 2019-04-03 2019-07-09 同济大学 Chinese term vector modeling method
CN114048314A (en) * 2021-11-11 2022-02-15 长沙理工大学 Natural language steganalysis method
CN115114597A (en) * 2022-06-19 2022-09-27 北卡科技有限公司 Tracing watermark embedding and extracting method based on character information
CN115409020A (en) * 2022-08-24 2022-11-29 杭州电子科技大学 Chinese character grouping test method and system based on word balance and computer readable storage medium
CN116192507A (en) * 2023-02-27 2023-05-30 盐城工学院 Information hiding method based on deep learning
CN115952528A (en) * 2023-03-14 2023-04-11 南京信息工程大学 Multi-scale combined text steganography method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"An image steganography method based on texture perception";Lianqiang Niu 等;2022 IEEE 2nd International Conference on Data Science and Computer Application (ICDSCA);20221229;全文 *
"基于汉字字符特征的文本无载体隐写方法设计";于翔美;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20200215;全文 *
基于汉字笔画编码矩阵的文本隐写方法;于翔美;王开西;;青岛大学学报(自然科学版);20190515(第02期);全文 *

Also Published As

Publication number Publication date
CN117648681A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
US20120076295A1 (en) Message Key Generation
CN107248134B (en) Method and device for hiding information in text document
CN115689853A (en) Robust text watermarking method based on Chinese character characteristic modification and grouping
CN109190630A (en) Character identifying method
AU2006223761B2 (en) Method and system for adaptive recognition of distorted text in computer images
CN113095992A (en) Novel bar code screenshot steganography traceability combined algorithm
CN108376257B (en) Incomplete code word identification method for gas meter
US9323726B1 (en) Optimizing a glyph-based file
CN117648681B (en) OFD format electronic document hidden information extraction and embedding method
CN115293311A (en) Color watermark anti-counterfeiting method and device based on micro-point code
CN111242829A (en) Watermark extraction method, device, equipment and storage medium
CN109508712A (en) A kind of Chinese written language recognition methods based on image
CN112417087A (en) Character-based tracing method and system
CN115618809A (en) Character grouping method based on binary character frequency and safe word stock construction method
CN105956590A (en) Character recognition method and character recognition system
CN114898376A (en) Formula identification method, device, equipment and medium
CN115409020A (en) Chinese character grouping test method and system based on word balance and computer readable storage medium
CN108416726B (en) A kind of digital picture steganography method keeping pixel frequency balance
CN115455965B (en) Character grouping method based on word distance word chain, storage medium and electronic equipment
CN106650716A (en) Identification method and device for computer font
CN115455966B (en) Safe word stock construction method and safe code extraction method thereof
CN113177556A (en) Text image enhancement model, training method, enhancement method and electronic equipment
CN115455987A (en) Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
Shetty et al. A Kannada Handwritten Character Recognition System Exploiting Machine Learning Approach
CN113435426B (en) Data augmentation method, device and equipment for OCR recognition and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant