CN112990178A - Text digital information embedding and extracting method and system based on character segmentation - Google Patents

Text digital information embedding and extracting method and system based on character segmentation Download PDF

Info

Publication number
CN112990178A
CN112990178A CN202110392436.XA CN202110392436A CN112990178A CN 112990178 A CN112990178 A CN 112990178A CN 202110392436 A CN202110392436 A CN 202110392436A CN 112990178 A CN112990178 A CN 112990178A
Authority
CN
China
Prior art keywords
character
characters
row
segmentation
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110392436.XA
Other languages
Chinese (zh)
Other versions
CN112990178B (en
Inventor
史祎诗
祝玉鹏
吕文晋
陶冶
孙鑫凯
方俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuxin Kunpeng Beijing Information Technology Co ltd
University of Chinese Academy of Sciences
Original Assignee
Fuxin Kunpeng Beijing Information Technology Co ltd
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuxin Kunpeng Beijing Information Technology Co ltd, University of Chinese Academy of Sciences filed Critical Fuxin Kunpeng Beijing Information Technology Co ltd
Priority to CN202110392436.XA priority Critical patent/CN112990178B/en
Publication of CN112990178A publication Critical patent/CN112990178A/en
Application granted granted Critical
Publication of CN112990178B publication Critical patent/CN112990178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention relates to a text digital information embedding and extracting method and a text digital information embedding and extracting system based on character segmentation, wherein the method comprises the following steps: segmenting the binary text image to obtain line height threshold values of all lines; segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters; determining a minimum distance value according to the distance between two adjacent characters; marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region; binary coding is carried out on the information to be embedded; adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary coding: if character 1 is embedded, the distance between two adjacent characters remains unchanged, and if character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value. The invention improves the robustness and hidden capacity of information embedding.

Description

Text digital information embedding and extracting method and system based on character segmentation
Technical Field
The invention relates to the field of text digital watermarks, in particular to a text digital information embedding and extracting method and system based on character segmentation.
Background
With the popularization of the internet, the spread and application of digital image products and electronic documents become more convenient. The text data is an important medium for people to obtain information in daily life at present, and documents such as periodicals, newspapers and books greatly enrich the reading modes of people. However, because the information redundancy of the text document is much less than that of the image, the copyright protection is important, and meanwhile, many security department internal personnel also have the risk of disclosure of confidentiality, so that the security problems of copyright and tracing after disclosure of confidential documents and the like can be effectively solved by embedding the name and date information of the operating personnel in the electronic document.
After tracing information is embedded in an electronic document, the most difficult problem to be solved is that the tracing information is lost or an error is generated when the information is extracted after the document is printed for multiple times. The existing tracing technology for the print and scan resistant text document is roughly divided into three types, namely an algorithm based on a text image, an algorithm based on a text format and an algorithm based on text content. The first method is mainly to embed information by changing edge pixel points of cut characters, and the robustness of the method is deteriorated after a plurality of printing and scanning processes; the second method is mostly to change the line and column spacing or file format of text characters to hide information, but the amount of hidden information is too small to effectively hide information in general texts; the third method, mainly changing the content of the text, is represented in the form of embedding and hiding information through replacement of synonyms, but the content of many files is required to be unmodified, and the application range is small.
Disclosure of Invention
The invention aims to provide a text digital information embedding and extracting method and system based on character segmentation, which improve the robustness and hidden capacity of information embedding.
In order to achieve the purpose, the invention provides the following scheme:
a text digital information embedding method based on character segmentation comprises the following steps:
carrying out binarization processing on the text image to obtain a binarized text image;
segmenting the binary text image to obtain line height threshold values of all lines;
segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters;
determining a minimum distance value according to the distance between two adjacent characters;
marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region;
binary coding is carried out on the information to be embedded;
adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.
Optionally, the segmenting the binarized text image to obtain row height threshold values of each row specifically includes:
and determining the row height threshold value of each row according to the row projection of each row.
Optionally, the segmenting each row of rows according to the row height threshold of each row to obtain the distance between each two adjacent characters specifically includes:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the range of the set ratio, performing column segmentation between the characters, otherwise, not performing the column segmentation, and recording the space between the two characters subjected to the column segmentation.
The invention also discloses a text digital information embedding system based on character segmentation, which comprises:
the binarization text image obtaining module is used for carrying out binarization processing on the text image to obtain a binarization text image;
the row height threshold value acquisition module is used for segmenting the binary text image to obtain row height threshold values of all rows;
the row segmentation module is used for segmenting rows and columns of each row according to the row high threshold value of each row to obtain the space between every two adjacent characters;
the minimum distance value acquisition module is used for determining a minimum distance value according to the distance between two adjacent characters;
the coding region marking module is used for marking the space between the characters adjacent to the punctuation marks as a non-coding region and marking the space between the characters of the non-punctuation marks as a coding region;
the embedded information coding module is used for carrying out binary coding on the information to be embedded;
the information embedding module is used for adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.
Optionally, the row height threshold obtaining module specifically includes:
and the row height threshold value determining unit is used for determining the row height threshold value of each row according to the row projection of each row.
Optionally, the column splitting module specifically includes:
a first inter-character width determining unit for determining a width between characters according to a column projection of each column;
a ratio determination unit between the first width and the row height threshold value, for obtaining the ratio between the width between the characters and the row height threshold value of the row corresponding to the characters;
and the column segmentation unit is used for performing column segmentation between the characters if the ratio is within a set ratio range, otherwise not performing the column segmentation, and recording the distance between the two characters subjected to the column segmentation.
The invention also discloses a text digital information extraction method based on character segmentation, which is applied to the text digital information embedding method based on character segmentation and comprises the following steps:
carrying out binarization processing on a printed scanning piece of the text embedded with the information to obtain a binarization text image of the embedded information;
segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;
segmenting rows and columns of each row according to the row height threshold value of each row, and carrying out size statistics on character images segmented from the columns;
extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
Optionally, the segmenting rows and columns according to the row height threshold of each row and performing size statistics on the character images segmented from the rows specifically includes:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the set ratio range, performing column segmentation between the characters, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
The invention also discloses a system for extracting the text digital information based on character segmentation, which comprises the following steps:
the embedded information binaryzation text image acquisition module is used for carrying out binaryzation processing on a printing scanning piece of the embedded information text to obtain an embedded information binaryzation text image;
the line segmentation module is used for segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;
the character image size statistic module is used for segmenting rows and columns according to the row height threshold value of each row and carrying out size statistics on the character images segmented from the rows;
an embedded information extraction module for extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
Optionally, the character graph size statistics module specifically includes:
a second inter-character width determining unit for determining a width between characters according to the column projection of each column;
a ratio determination unit between the second width and the row height threshold value, configured to obtain a ratio between the width between the characters and the row height threshold value of the row corresponding to the character;
and the character size statistical unit is used for performing column segmentation between characters if the ratio is within a set ratio range, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention adjusts the space between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary coding of the embedded information, reduces the influence of printing and scanning on the text image, improves the robustness of information embedding, and improves the capacity of the embedded information by embedding the information in a mode of adjusting the space between every two adjacent characters.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a text-to-digital information embedding method based on character segmentation according to the present invention;
FIG. 2 is a schematic structural diagram of a text-to-digital information embedding system based on character segmentation according to the present invention;
FIG. 3 is a schematic flow chart of a text-to-digital information extraction method based on character segmentation according to the present invention;
FIG. 4 is a schematic structural diagram of a text-to-digital information extraction system based on character segmentation according to the present invention;
FIG. 5 is a text image before information is embedded according to an embodiment of the present invention;
FIG. 6 is a text image after information is embedded according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a text digital information embedding and extracting method and system based on character segmentation, which improve the robustness and hidden capacity of information embedding.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a text-to-digital information embedding method based on character segmentation, and as shown in fig. 1, the text-to-digital information embedding method based on character segmentation includes the following steps:
step 101: carrying out binarization processing on the text image to obtain a binarized text image;
step 102: and segmenting the binary text image to obtain the line height threshold values of all lines.
The segmenting the binarization text image to obtain the line height threshold values of all lines specifically comprises:
and determining the row height threshold value of each row according to the row projection of each row. The row projection of each row produces a row height equal to the row height threshold of each row.
Step 103: segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters;
the line-row segmentation is performed on each line according to the line height threshold of each line to obtain the distance between each two adjacent characters, and the method specifically comprises the following steps:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the range of the set ratio, performing column segmentation between the characters, otherwise, not performing the column segmentation, and recording the space between the two characters subjected to the column segmentation.
Step 104: determining a minimum distance value according to the distance between two adjacent characters;
step 105: marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region;
step 106: binary coding is carried out on the information to be embedded;
step 107: adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters remains unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value, specifically, the distance between two adjacent characters is adjusted to be smaller than two pixels of the minimum distance value.
The invention discloses a text digital information embedding method based on character segmentation, which has the following technical effects:
1. invisibility of embedded information is improved: according to the difference of the image resolution, the adjusted character spacing is correspondingly different, the adjusted character spacing is only two pixels less than the minimum text spacing before the information is embedded, and the text image with 600dpi is tested to prove that the text image with the embedded information is good in invisibility.
2. The robustness of the embedded information is improved: the influence of printing and scanning on the text image is reduced, and the source tracing success rate of the text image with embedded information is 100% under the condition of one-time printing and scanning; in the case of copy scanning, the traceability success rate of the text image of the embedded information is 90%.
3. The hidden capacity of the embedded information is improved: because the number of characters contained in the text line is different, the hiding capacity of the text image is slightly deviated every time, and the method is tested aiming at a large number of document text images, wherein the hiding capacity of the method is 100 Chinese characters and can embed the information amount of 40-50 bits.
Fig. 2 is a schematic structural diagram of a text-to-digital information embedding system based on character segmentation, as shown in fig. 2, the present invention also discloses a text-to-digital information embedding system based on character segmentation, which includes:
a binarization text image obtaining module 201, configured to perform binarization processing on the text image to obtain a binarization text image;
a row height threshold value obtaining module 202, configured to segment the binarized text image to obtain row height threshold values of each row;
the row height threshold obtaining module specifically includes:
and the row height threshold value determining unit is used for determining the row height threshold value of each row according to the row projection of each row.
And the column segmentation module 203 is configured to segment rows and columns of each row according to the row height threshold of each row, so as to obtain a distance between each two adjacent characters.
The column cutting module specifically comprises:
a first inter-character width determining unit for determining a width between characters according to a column projection of each column;
a ratio determination unit between the first width and the row height threshold value, for obtaining the ratio between the width between the characters and the row height threshold value of the row corresponding to the characters;
and the column segmentation unit is used for performing column segmentation between the characters if the ratio is within a set ratio range, otherwise not performing the column segmentation, and recording the distance between the two characters subjected to the column segmentation.
A minimum distance value obtaining module 204, configured to determine a minimum distance value according to a distance between two adjacent characters;
an encoding region marking module 205, configured to mark a space between characters adjacent to the punctuation mark as a non-encoding region, and mark a space between characters of the non-punctuation mark as an encoding region;
an embedded information encoding module 206, configured to perform binary encoding on information to be embedded;
an information embedding module 207, configured to adjust, according to the character 0 or the character 1 in the binary code, a distance marked as the encoding area between each two adjacent characters in the text image: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.
Fig. 3 is a schematic flow chart of a method for extracting text numeric information based on character segmentation, as shown in fig. 3, the method for extracting text numeric information based on character segmentation includes the following steps:
step 301: carrying out binarization processing on a printed scanning piece of the text embedded with the information to obtain a binarization text image of the embedded information;
step 302: segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;
step 303: segmenting rows and columns of each row according to the row height threshold value of each row, and carrying out size statistics on character images segmented from the columns;
the row and column segmentation is performed on each row according to the row height threshold value of each row, and the size statistics is performed on the character images segmented from the rows, and the method specifically comprises the following steps:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the set ratio range, performing column segmentation between the characters, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
Step 304: extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
Fig. 4 is a schematic structural diagram of a text-to-digital information extraction system based on character segmentation, as shown in fig. 4, the invention also discloses a text-to-digital information extraction system based on character segmentation, which includes:
an information-embedded binary text image acquisition module 401, configured to perform binarization processing on a printed scanned part of an information-embedded text to obtain an information-embedded binary text image;
a line segmentation module 402, configured to segment the binarized text image of the embedded information to obtain line height thresholds of each line;
a character image size statistic module 403, configured to segment rows and columns of each row according to a row height threshold of each row, and perform size statistics on the character images segmented from the rows;
an embedded information extraction module 404, configured to extract embedded information according to a size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
The character graph size statistics module 403 specifically includes:
a second inter-character width determining unit for determining a width between characters according to the column projection of each column;
a ratio determination unit between the second width and the row height threshold value, configured to obtain a ratio between the width between the characters and the row height threshold value of the row corresponding to the character;
and the character size statistical unit is used for performing column segmentation between characters if the ratio is within a set ratio range, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
The following describes in detail a text-to-digital information embedding and extracting method based on character segmentation according to the present invention.
An information embedding section:
step 1: performing binarization processing on an input text image, then performing line segmentation on the binary text image, and segmenting line height h of a character image according to the line segmentation1,h2…hM(M is the line number of the text image) and records the space SP between the characters1,SP2…SPN(N is the word space sequence number of the current line), and finding out a minimum space value min _ SP;
line segmentation: and performing segmentation according to a threshold value generated by line projection, wherein the specific operation is to sum up black point pixels of each line of the whole picture to judge whether the current line has characters.
Dividing rows and columns: the segmentation is performed according to the threshold (character width threshold) generated by the column projection, as described above.
Step 2: recording and screening the character space SP after removing the punctuation mark between linesX…SPY(X,YThe position of the punctuation mark).
Step 3: binary coding is carried out on the character information to be embedded, the distance between adjacent characters is adjusted according to the embedded binary information, the adjustment strategy is that the hidden information is 1, and the characters are kept still; the hidden information is 0, two adjacent characters respectively move towards the middle, and the moved character interval is smaller than the minimum spacing value min _ SP;
an information extraction section:
step 1: acquiring a printing scanning piece embedded with an information document text, carrying out binarization processing on an image, firstly, using line segmentation to obtain each line of image line1,line2…lineM(M is the line number of the text image) and the line height of each line is h1 ,h2 …hM
Step 2: line for each line of image1,line2…lineMRespectively performing column segmentation with the segmentation distance of each row being high h1 ,h2 …hM Multiplied by a suitable scaling factor theta.
Step 3: and carrying out size statistics on the character images cut out by the column segmentation.
Step 4: the cut characters are stuck due to the change of the character spacing, the image size of the stuck characters is larger than that of the normal characters, so the image size and the line height h of each line according to each character1 ,h2 …hM To judge whether the hidden information is 0 or 1 so as to accurately extract the embedded information.
For example, the line height of the first row is h1 ', based on available font library statistics, the width and height of the font (line height and character height are about the same) theoretically should not be the difference in multiples, while the width of the operated sticky character is significantly larger than the width of a single character, we compare the counted character width and line height, and a certain multiple larger than the line height h 1' is considered that the binary information is 0, otherwise, 1, the text image before information embedding is shown in fig. 5, and the text image after information embedding is shown in fig. 6.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A text digital information embedding method based on character segmentation is characterized by comprising the following steps:
carrying out binarization processing on the text image to obtain a binarized text image;
segmenting the binary text image to obtain line height threshold values of all lines;
segmenting rows and columns according to the row height threshold values of the rows to obtain the space between every two adjacent characters;
determining a minimum distance value according to the distance between two adjacent characters;
marking the space between the characters adjacent to the punctuation marks as a non-coding region, and marking the space between the characters of the non-punctuation marks as a coding region;
binary coding is carried out on the information to be embedded;
adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.
2. The method for embedding text digital information based on character segmentation as claimed in claim 1, wherein the step of segmenting the binarized text image to obtain row height threshold values of each row specifically comprises:
and determining the row height threshold value of each row according to the row projection of each row.
3. The method for embedding text digital information based on character segmentation as claimed in claim 1, wherein the segmenting of rows and columns according to the row height threshold value of each row to obtain the space between each two adjacent characters specifically comprises:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the range of the set ratio, performing column segmentation between the characters, otherwise, not performing the column segmentation, and recording the space between the two characters subjected to the column segmentation.
4. A system for embedding text digital information based on character segmentation, comprising:
the binarization text image obtaining module is used for carrying out binarization processing on the text image to obtain a binarization text image;
the row height threshold value acquisition module is used for segmenting the binary text image to obtain row height threshold values of all rows;
the row segmentation module is used for segmenting rows and columns of each row according to the row high threshold value of each row to obtain the space between every two adjacent characters;
the minimum distance value acquisition module is used for determining a minimum distance value according to the distance between two adjacent characters;
the coding region marking module is used for marking the space between the characters adjacent to the punctuation marks as a non-coding region and marking the space between the characters of the non-punctuation marks as a coding region;
the embedded information coding module is used for carrying out binary coding on the information to be embedded;
the information embedding module is used for adjusting the distance marked as the coding region between every two adjacent characters in the text image according to the character 0 or the character 1 in the binary code: if the character 1 is embedded, the distance between two adjacent characters is kept unchanged, and if the character 0 is embedded, the distance between two adjacent characters is adjusted to be smaller than the minimum distance value.
5. The system for embedding text digital information based on character segmentation as claimed in claim 4, wherein the row height threshold obtaining module specifically comprises:
and the row height threshold value determining unit is used for determining the row height threshold value of each row according to the row projection of each row.
6. The system for embedding text-to-digital information based on character segmentation as claimed in claim 4, wherein the column segmentation module specifically comprises:
a first inter-character width determining unit for determining a width between characters according to a column projection of each column;
a ratio determination unit between the first width and the row height threshold value, for obtaining the ratio between the width between the characters and the row height threshold value of the row corresponding to the characters;
and the column segmentation unit is used for performing column segmentation between the characters if the ratio is within a set ratio range, otherwise not performing the column segmentation, and recording the distance between the two characters subjected to the column segmentation.
7. A method for extracting digital information from a text based on character segmentation, which is applied to the method for embedding digital information from a text based on character segmentation as claimed in any one of claims 1 to 3, and comprises:
carrying out binarization processing on a printed scanning piece of the text embedded with the information to obtain a binarization text image of the embedded information;
segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;
segmenting rows and columns of each row according to the row height threshold value of each row, and carrying out size statistics on character images segmented from the columns;
extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
8. The method for extracting digital information from a text based on character segmentation as claimed in claim 7, wherein the step of segmenting rows and columns according to the row height threshold and performing size statistics on the character images segmented from the columns comprises:
determining the width between characters according to the column projection of each column;
obtaining the ratio of the width between the characters to the height threshold of the line corresponding to the characters;
and if the ratio is within the set ratio range, performing column segmentation between the characters, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
9. A system for extracting text digital information based on character segmentation is characterized by comprising:
the embedded information binaryzation text image acquisition module is used for carrying out binaryzation processing on a printing scanning piece of the embedded information text to obtain an embedded information binaryzation text image;
the line segmentation module is used for segmenting the binary text image of the embedded information to obtain line height threshold values of all lines;
the character image size statistic module is used for segmenting rows and columns according to the row height threshold value of each row and carrying out size statistics on the character images segmented from the rows;
an embedded information extraction module for extracting embedded information according to the size of each character image: if the ratio of the width to the height of the character image is within the set range, the embedded binary information is 1, and if the ratio of the width to the height of the character image exceeds the set range, the embedded binary information is 0.
10. The system for extracting digital information from a text based on character segmentation according to claim 9, wherein the character graph size statistic module specifically includes:
a second inter-character width determining unit for determining a width between characters according to the column projection of each column;
a ratio determination unit between the second width and the row height threshold value, configured to obtain a ratio between the width between the characters and the row height threshold value of the row corresponding to the character;
and the character size statistical unit is used for performing column segmentation between characters if the ratio is within a set ratio range, otherwise, not performing column segmentation, and recording the size of each character after the column segmentation, wherein the size of each character comprises the width of the character and the height of the character.
CN202110392436.XA 2021-04-13 2021-04-13 Text digital information embedding and extracting method and system based on character segmentation Active CN112990178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392436.XA CN112990178B (en) 2021-04-13 2021-04-13 Text digital information embedding and extracting method and system based on character segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392436.XA CN112990178B (en) 2021-04-13 2021-04-13 Text digital information embedding and extracting method and system based on character segmentation

Publications (2)

Publication Number Publication Date
CN112990178A true CN112990178A (en) 2021-06-18
CN112990178B CN112990178B (en) 2022-06-24

Family

ID=76338124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392436.XA Active CN112990178B (en) 2021-04-13 2021-04-13 Text digital information embedding and extracting method and system based on character segmentation

Country Status (1)

Country Link
CN (1) CN112990178B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761231A (en) * 2021-09-07 2021-12-07 浙江传媒学院 Text character feature-based text data attribution description and generation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5761686A (en) * 1996-06-27 1998-06-02 Xerox Corporation Embedding encoded information in an iconic version of a text image
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character
US20080310672A1 (en) * 2005-09-16 2008-12-18 Donglin Wang Embedding and detecting hidden information
US20150074814A1 (en) * 2013-09-10 2015-03-12 Crimsonlogic Pte Ltd Method and system for embedding data in a text document
CN107248134A (en) * 2017-04-25 2017-10-13 李晓妮 Information concealing method and device in a kind of text document

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5761686A (en) * 1996-06-27 1998-06-02 Xerox Corporation Embedding encoded information in an iconic version of a text image
US20080310672A1 (en) * 2005-09-16 2008-12-18 Donglin Wang Embedding and detecting hidden information
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character
US20150074814A1 (en) * 2013-09-10 2015-03-12 Crimsonlogic Pte Ltd Method and system for embedding data in a text document
CN107248134A (en) * 2017-04-25 2017-10-13 李晓妮 Information concealing method and device in a kind of text document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NAIXIN ZHANG ET AL.: "《Chinese NER Using Dynamic Meta-Embeddings》", 《IEEE ACCESS》 *
李向辉等: "提高Word文本文档信息隐藏容量的方法研究", 《计算机技术与发展》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761231A (en) * 2021-09-07 2021-12-07 浙江传媒学院 Text character feature-based text data attribution description and generation method

Also Published As

Publication number Publication date
CN112990178B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
EP0660275B1 (en) Document copying deterrent method
Brassil et al. Hiding information in document images
US5862270A (en) Clock free two-dimensional barcode and method for printing and reading the same
CN107248134B (en) Method and device for hiding information in text document
Amano et al. A feature calibration method for watermarking of document images
US8144361B2 (en) Creation and placement of two-dimensional barcode stamps on printed documents for storing authentication information
US10949509B2 (en) Watermark embedding and extracting method for protecting documents
JP4904175B2 (en) Method and apparatus for creating high fidelity glyph prototypes from low resolution glyph images
CN101119429A (en) Digital watermark embedded and extracting method and device
JP5669957B2 (en) Watermark image segmentation method and apparatus for Western language watermark processing
Antonacopoulos et al. A robust braille recognition system
US20060255141A1 (en) Machine readable data
WO2011112573A2 (en) Paragraph recognition in an optical character recognition (ocr) process
Stojanov et al. A new property coding in text steganography of Microsoft Word documents
CN112990178B (en) Text digital information embedding and extracting method and system based on character segmentation
Chotikakamthorn Document image data hiding technique using character spacing width sequence coding
US20110170133A1 (en) Image forming apparatus, method of forming image and method of authenticating document
Varna et al. Data hiding in hard-copy text documents robust to print, scan and photocopy operations
JP2008085579A (en) Device for embedding information, information reader, method for embedding information, method for reading information and computer program
CN112966679A (en) Information tracing method and system based on minimum character connected domain deviation
Monsignori et al. Watermarking music sheets while printing
US20240037689A1 (en) Watermarks for text documents
EP0692768A2 (en) Full text storage and retrieval in image at OCR and code speed
Safonov et al. Embedding digital hidden data into hardcopy
EP2119217A1 (en) Document with encoded portion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant