CN112990178A

CN112990178A - Text digital information embedding and extracting method and system based on character segmentation

Info

Publication number: CN112990178A
Application number: CN202110392436.XA
Authority: CN
Inventors: 史祎诗; 祝玉鹏; 吕文晋; 陶冶; 孙鑫凯; 方俊
Original assignee: Fuxin Kunpeng Beijing Information Technology Co ltd; University of Chinese Academy of Sciences
Current assignee: Fuxin Kunpeng Beijing Information Technology Co ltd; University of Chinese Academy of Sciences
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-06-18
Anticipated expiration: 2041-04-13
Also published as: CN112990178B

Abstract

The invention relates to a text digital information embedding and extracting method and a text digital information embedding and extracting system based on character segmentation, wherein the method comprises the following steps: segmenting the binary text image to obtain line height threshold values of all lines; segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters; determining a minimum distance value according to the distance between two adjacent characters; marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region; binary coding is carried out on the information to be embedded; adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary coding: if character 1 is embedded, the distance between two adjacent characters remains unchanged, and if character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value. The invention improves the robustness and hidden capacity of information embedding.

Description

Text digital information embedding and extracting method and system based on character segmentation

Technical Field

The invention relates to the field of text digital watermarks, in particular to a text digital information embedding and extracting method and system based on character segmentation.

Background

With the popularization of the internet, the spread and application of digital image products and electronic documents become more convenient. The text data is an important medium for people to obtain information in daily life at present, and documents such as periodicals, newspapers and books greatly enrich the reading modes of people. However, because the information redundancy of the text document is much less than that of the image, the copyright protection is important, and meanwhile, many security department internal personnel also have the risk of disclosure of confidentiality, so that the security problems of copyright and tracing after disclosure of confidential documents and the like can be effectively solved by embedding the name and date information of the operating personnel in the electronic document.

After tracing information is embedded in an electronic document, the most difficult problem to be solved is that the tracing information is lost or an error is generated when the information is extracted after the document is printed for multiple times. The existing tracing technology for the print and scan resistant text document is roughly divided into three types, namely an algorithm based on a text image, an algorithm based on a text format and an algorithm based on text content. The first method is mainly to embed information by changing edge pixel points of cut characters, and the robustness of the method is deteriorated after a plurality of printing and scanning processes; the second method is mostly to change the line and column spacing or file format of text characters to hide information, but the amount of hidden information is too small to effectively hide information in general texts; the third method, mainly changing the content of the text, is represented in the form of embedding and hiding information through replacement of synonyms, but the content of many files is required to be unmodified, and the application range is small.

Disclosure of Invention

The invention aims to provide a text digital information embedding and extracting method and system based on character segmentation, which improve the robustness and hidden capacity of information embedding.

In order to achieve the purpose, the invention provides the following scheme:

a text digital information embedding method based on character segmentation comprises the following steps:

carrying out binarization processing on the text image to obtain a binarized text image;

segmenting the binary text image to obtain line height threshold values of all lines;

segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters;

determining a minimum distance value according to the distance between two adjacent characters;

marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region;

binary coding is carried out on the information to be embedded;

adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.

Optionally, the segmenting the binarized text image to obtain row height threshold values of each row specifically includes:

and determining the row height threshold value of each row according to the row projection of each row.

Optionally, the segmenting each row of rows according to the row height threshold of each row to obtain the distance between each two adjacent characters specifically includes:

determining the width between characters according to the column projection of each column;

obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;

and if the ratio is within the range of the set ratio, performing column segmentation between the characters, otherwise, not performing the column segmentation, and recording the space between the two characters subjected to the column segmentation.

The invention also discloses a text digital information embedding system based on character segmentation, which comprises:

the binarization text image obtaining module is used for carrying out binarization processing on the text image to obtain a binarization text image;

the row height threshold value acquisition module is used for segmenting the binary text image to obtain row height threshold values of all rows;

the row segmentation module is used for segmenting rows and columns of each row according to the row high threshold value of each row to obtain the space between every two adjacent characters;

the minimum distance value acquisition module is used for determining a minimum distance value according to the distance between two adjacent characters;

the coding region marking module is used for marking the space between the characters adjacent to the punctuation marks as a non-coding region and marking the space between the characters of the non-punctuation marks as a coding region;

the embedded information coding module is used for carrying out binary coding on the information to be embedded;

the information embedding module is used for adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.

Optionally, the row height threshold obtaining module specifically includes:

and the row height threshold value determining unit is used for determining the row height threshold value of each row according to the row projection of each row.

Optionally, the column splitting module specifically includes:

a first inter-character width determining unit for determining a width between characters according to a column projection of each column;

a ratio determination unit between the first width and the row height threshold value, for obtaining the ratio between the width between the characters and the row height threshold value of the row corresponding to the characters;

and the column segmentation unit is used for performing column segmentation between the characters if the ratio is within a set ratio range, otherwise not performing the column segmentation, and recording the distance between the two characters subjected to the column segmentation.

The invention also discloses a text digital information extraction method based on character segmentation, which is applied to the text digital information embedding method based on character segmentation and comprises the following steps:

carrying out binarization processing on a printed scanning piece of the text embedded with the information to obtain a binarization text image of the embedded information;

segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;

segmenting rows and columns of each row according to the row height threshold value of each row, and carrying out size statistics on character images segmented from the columns;

extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.

Optionally, the segmenting rows and columns according to the row height threshold of each row and performing size statistics on the character images segmented from the rows specifically includes:

and if the ratio is within the set ratio range, performing column segmentation between the characters, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.

The invention also discloses a system for extracting the text digital information based on character segmentation, which comprises the following steps:

the embedded information binaryzation text image acquisition module is used for carrying out binaryzation processing on a printing scanning piece of the embedded information text to obtain an embedded information binaryzation text image;

the line segmentation module is used for segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;

the character image size statistic module is used for segmenting rows and columns according to the row height threshold value of each row and carrying out size statistics on the character images segmented from the rows;

an embedded information extraction module for extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.

Optionally, the character graph size statistics module specifically includes:

a second inter-character width determining unit for determining a width between characters according to the column projection of each column;

a ratio determination unit between the second width and the row height threshold value, configured to obtain a ratio between the width between the characters and the row height threshold value of the row corresponding to the character;

and the character size statistical unit is used for performing column segmentation between characters if the ratio is within a set ratio range, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention adjusts the space between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary coding of the embedded information, reduces the influence of printing and scanning on the text image, improves the robustness of information embedding, and improves the capacity of the embedded information by embedding the information in a mode of adjusting the space between every two adjacent characters.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a text-to-digital information embedding method based on character segmentation according to the present invention;

FIG. 2 is a schematic structural diagram of a text-to-digital information embedding system based on character segmentation according to the present invention;

FIG. 3 is a schematic flow chart of a text-to-digital information extraction method based on character segmentation according to the present invention;

FIG. 4 is a schematic structural diagram of a text-to-digital information extraction system based on character segmentation according to the present invention;

FIG. 5 is a text image before information is embedded according to an embodiment of the present invention;

FIG. 6 is a text image after information is embedded according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a text-to-digital information embedding method based on character segmentation, and as shown in fig. 1, the text-to-digital information embedding method based on character segmentation includes the following steps:

step 101: carrying out binarization processing on the text image to obtain a binarized text image;

step 102: and segmenting the binary text image to obtain the line height threshold values of all lines.

The segmenting the binarization text image to obtain the line height threshold values of all lines specifically comprises:

and determining the row height threshold value of each row according to the row projection of each row. The row projection of each row produces a row height equal to the row height threshold of each row.

Step 103: segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters;

the line-row segmentation is performed on each line according to the line height threshold of each line to obtain the distance between each two adjacent characters, and the method specifically comprises the following steps:

Step 104: determining a minimum distance value according to the distance between two adjacent characters;

step 105: marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region;

step 106: binary coding is carried out on the information to be embedded;

step 107: adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters remains unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value, specifically, the distance between two adjacent characters is adjusted to be smaller than two pixels of the minimum distance value.

The invention discloses a text digital information embedding method based on character segmentation, which has the following technical effects:

1. invisibility of embedded information is improved: according to the difference of the image resolution, the adjusted character spacing is correspondingly different, the adjusted character spacing is only two pixels less than the minimum text spacing before the information is embedded, and the text image with 600dpi is tested to prove that the text image with the embedded information is good in invisibility.

2. The robustness of the embedded information is improved: the influence of printing and scanning on the text image is reduced, and the source tracing success rate of the text image with embedded information is 100% under the condition of one-time printing and scanning; in the case of copy scanning, the traceability success rate of the text image of the embedded information is 90%.

3. The hidden capacity of the embedded information is improved: because the number of characters contained in the text line is different, the hiding capacity of the text image is slightly deviated every time, and the method is tested aiming at a large number of document text images, wherein the hiding capacity of the method is 100 Chinese characters and can embed the information amount of 40-50 bits.

Fig. 2 is a schematic structural diagram of a text-to-digital information embedding system based on character segmentation, as shown in fig. 2, the present invention also discloses a text-to-digital information embedding system based on character segmentation, which includes:

a binarization text image obtaining module 201, configured to perform binarization processing on the text image to obtain a binarization text image;

a row height threshold value obtaining module 202, configured to segment the binarized text image to obtain row height threshold values of each row;

the row height threshold obtaining module specifically includes:

And the column segmentation module 203 is configured to segment rows and columns of each row according to the row height threshold of each row, so as to obtain a distance between each two adjacent characters.

The column cutting module specifically comprises:

A minimum distance value obtaining module 204, configured to determine a minimum distance value according to a distance between two adjacent characters;

an encoding region marking module 205, configured to mark a space between characters adjacent to the punctuation mark as a non-encoding region, and mark a space between characters of the non-punctuation mark as an encoding region;

an embedded information encoding module 206, configured to perform binary encoding on information to be embedded;

an information embedding module 207, configured to adjust, according to the character 0 or the character 1 in the binary code, a distance marked as the encoding area between each two adjacent characters in the text image: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.

Fig. 3 is a schematic flow chart of a method for extracting text numeric information based on character segmentation, as shown in fig. 3, the method for extracting text numeric information based on character segmentation includes the following steps:

step 301: carrying out binarization processing on a printed scanning piece of the text embedded with the information to obtain a binarization text image of the embedded information;

step 302: segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;

step 303: segmenting rows and columns of each row according to the row height threshold value of each row, and carrying out size statistics on character images segmented from the columns;

the row and column segmentation is performed on each row according to the row height threshold value of each row, and the size statistics is performed on the character images segmented from the rows, and the method specifically comprises the following steps:

Step 304: extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.

Fig. 4 is a schematic structural diagram of a text-to-digital information extraction system based on character segmentation, as shown in fig. 4, the invention also discloses a text-to-digital information extraction system based on character segmentation, which includes:

an information-embedded binary text image acquisition module 401, configured to perform binarization processing on a printed scanned part of an information-embedded text to obtain an information-embedded binary text image;

a line segmentation module 402, configured to segment the binarized text image of the embedded information to obtain line height thresholds of each line;

a character image size statistic module 403, configured to segment rows and columns of each row according to a row height threshold of each row, and perform size statistics on the character images segmented from the rows;

an embedded information extraction module 404, configured to extract embedded information according to a size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.

The character graph size statistics module 403 specifically includes:

The following describes in detail a text-to-digital information embedding and extracting method based on character segmentation according to the present invention.

An information embedding section:

step 1: performing binarization processing on an input text image, then performing line segmentation on the binary text image, and segmenting line height h of a character image according to the line segmentation₁,h₂…h_M(M is the line number of the text image) and records the space SP between the characters₁,SP₂…SP_N(N is the word space sequence number of the current line), and finding out a minimum space value min _ SP;

line segmentation: and performing segmentation according to a threshold value generated by line projection, wherein the specific operation is to sum up black point pixels of each line of the whole picture to judge whether the current line has characters.

Dividing rows and columns: the segmentation is performed according to the threshold (character width threshold) generated by the column projection, as described above.

Step 2: recording and screening the character space SP after removing the punctuation mark between lines_X…SP_Y(X,YThe position of the punctuation mark).

Step 3: binary coding is carried out on the character information to be embedded, the distance between adjacent characters is adjusted according to the embedded binary information, the adjustment strategy is that the hidden information is 1, and the characters are kept still; the hidden information is 0, two adjacent characters respectively move towards the middle, and the moved character interval is smaller than the minimum spacing value min _ SP;

an information extraction section:

step 1: acquiring a printing scanning piece embedded with an information document text, carrying out binarization processing on an image, firstly, using line segmentation to obtain each line of image line₁,line₂…line_M(M is the line number of the text image) and the line height of each line is h₁ ^’，h₂ ^’…h_M ^’。

Step 2: line for each line of image₁,line₂…line_MRespectively performing column segmentation with the segmentation distance of each row being high h₁ ^’，h₂ ^’…h_M ^’Multiplied by a suitable scaling factor theta.

Step 3: and carrying out size statistics on the character images cut out by the column segmentation.

Step 4: the cut characters are stuck due to the change of the character spacing, the image size of the stuck characters is larger than that of the normal characters, so the image size and the line height h of each line according to each character₁ ^’，h₂ ^’…h_M ^’To judge whether the hidden information is 0 or 1 so as to accurately extract the embedded information.

For example, the line height of the first row is h1 ', based on available font library statistics, the width and height of the font (line height and character height are about the same) theoretically should not be the difference in multiples, while the width of the operated sticky character is significantly larger than the width of a single character, we compare the counted character width and line height, and a certain multiple larger than the line height h 1' is considered that the binary information is 0, otherwise, 1, the text image before information embedding is shown in fig. 5, and the text image after information embedding is shown in fig. 6.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A text digital information embedding method based on character segmentation is characterized by comprising the following steps:

binary coding is carried out on the information to be embedded;

2. The method for embedding text digital information based on character segmentation as claimed in claim 1, wherein the step of segmenting the binarized text image to obtain row height threshold values of each row specifically comprises:

3. The method for embedding text digital information based on character segmentation as claimed in claim 1, wherein the segmenting of rows and columns according to the row height threshold value of each row to obtain the space between each two adjacent characters specifically comprises:

4. A system for embedding text digital information based on character segmentation, comprising:

5. The system for embedding text digital information based on character segmentation as claimed in claim 4, wherein the row height threshold obtaining module specifically comprises:

6. The system for embedding text-to-digital information based on character segmentation as claimed in claim 4, wherein the column segmentation module specifically comprises:

7. A method for extracting digital information from a text based on character segmentation, which is applied to the method for embedding digital information from a text based on character segmentation as claimed in any one of claims 1 to 3, and comprises:

8. The method for extracting digital information from a text based on character segmentation as claimed in claim 7, wherein the step of segmenting rows and columns according to the row height threshold and performing size statistics on the character images segmented from the columns comprises:

9. A system for extracting text digital information based on character segmentation is characterized by comprising:

10. The system for extracting digital information from a text based on character segmentation according to claim 9, wherein the character graph size statistic module specifically includes: