WO2015021737A1 - 一种将纸质文件转换为电子文件的方法 - Google Patents
一种将纸质文件转换为电子文件的方法 Download PDFInfo
- Publication number
- WO2015021737A1 WO2015021737A1 PCT/CN2014/000694 CN2014000694W WO2015021737A1 WO 2015021737 A1 WO2015021737 A1 WO 2015021737A1 CN 2014000694 W CN2014000694 W CN 2014000694W WO 2015021737 A1 WO2015021737 A1 WO 2015021737A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- electronic
- file
- picture file
- electronic picture
- blocks
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/158—Segmentation of character regions using character size, text spacings or pitch estimation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the present invention relates to the field of converting paper documents into electronic files, and more particularly to a method of converting paper documents into electronic files. Background technique
- the common technique for converting paper documents into electronic files is OCR (Optical Character Recognition) technology.
- OCR Optical Character Recognition
- the specific process is as follows: scanning paper documents into electronic picture files; dividing the electronic picture files into multiple pieces Character picture, each character picture includes only one character; each character picture is recognized one by one, including error correction and association function to reduce the error rate; the character recognition result is output in order, thereby obtaining the final electronic file .
- the core of the OCR technology is to identify the character pictures one by one, and the judgment is based on the outline of the character picture. Since there are many characters with similar contours, the correct rate of recognition is not high, and the resulting electronic file will not be too accurate. In order to improve the recognition accuracy, OCR technology spends a lot of time on character recognition, suspicious characters, error correction, etc., so OCR technology is also less efficient. Summary of the invention
- the technical problem to be solved by the present invention is to provide a method for converting a paper document into an electronic document, which can simultaneously improve the conversion efficiency and the degree of conformity between the electronic document and the paper document content.
- a method for converting a paper file into an electronic file comprising:
- Step 1 scanning the paper document into an electronic picture file
- Step 2 segmenting the non-blank portion included in the electronic picture file into blocks, so that the non-blank portion is divided into a plurality of the blocks; wherein the block is one of a row and a column ;
- Step 3 Divide each of the blocks into more than one character picture
- Step 4 determining a positional relationship between the blocks and a positional relationship between the character pictures belonging to the same block Department
- Step 5 Arrange all the character pictures belonging to the same block into a new block according to the positional relationship between them;
- Step 6 Arrange all the new blocks according to the positional relationship between the blocks to obtain the electronic file.
- the beneficial effects of the present invention are: In the present invention, the paper document is scanned into an electronic picture file, and the non-blank portion of the electronic picture file is segmented according to the block to obtain a plurality of blocks, and then the block is divided into character pictures, The invention rearranges the character pictures into a new block according to the positional relationship between the character pictures, and arranges the obtained new blocks into electronic files according to the positional relationship between the blocks.
- the present invention does not need to perform character recognition, search for suspicious characters, error correction, association, and the like in the existing OCR technology, and the conversion task can be realized by using the character picture obtained by dividing the electronic picture file, which greatly improves the conversion.
- Efficiency at the same time, because the present invention uses the character images obtained by segmentation to rearrange the electronic files, no recognition error is introduced, and the degree of conformity between the electronic files and the paper files is greatly improved, and the character correctness rate can basically reach 100. %.
- the present invention can also be improved as follows:
- the method further includes the step 1-2: rotating the electronic picture file so that the characters therein are in the forward direction.
- the method before rotating the electronic picture file, the method further includes: deleting the stains and scratches in the electronic picture file.
- the method further includes: enlarging the electronic picture file.
- the method further includes: placing the electronic picture file in the top margin, the bottom margin, the left margin, and the right margin range The white side inside is partially cut off.
- FIG. 1 is a flow chart of a method for converting a paper document into an electronic file according to the present invention
- FIG. 2 is a schematic diagram of an electronic picture file scanned by the present invention
- FIG. 3 is a schematic diagram of rotating an electronic picture file by using the present invention.
- FIG. 4 is a schematic diagram of a white edge portion within a range of four margins of an electronic picture file cut out by using the present invention
- FIG. 5 is a schematic view showing a non-blank portion included in an electronic picture file by line according to the present invention
- 6 is a schematic diagram of dividing a block into character pictures by the present invention.
- the present invention proposes a method of converting a paper document into an electronic file
- Figure 1 is a flow chart of the method. As shown in Figure 1, the method includes:
- Step 101 Scan the paper document as an electronic picture file.
- the paper document in the present invention may be any document recorded on paper such as a book or a book.
- Scanning a paper document to obtain an electronic picture file is the first step in electronically implementing a paper document. This step can be done using a scanner.
- Step 102 Segment the non-blank portion included in the electronic picture file by a block, so that the non-blank portion is divided into several blocks.
- a block in the present invention refers to one of a row and a column.
- the electronic picture file is obtained by the scanning step in step 101.
- the characters, pictures, tables and the like in the paper file are necessarily reflected in the electronic picture file in some form (such as in the form of pictures), which corresponds to A non-blank part of the electronic picture file.
- the electronic picture file must also contain blank parts, such as the top margin, the bottom margin, the left margin, the white border portion in the right margin range, and the like.
- This step only splits the non-blank part of the electronic picture file, and the result of the segmentation is several blocks.
- the result of the segmentation here is also in the form of an electronic picture.
- the result of the segmentation is a number of lines in the form of an electronic picture.
- the segmentation result obtained in this step is an electronic picture of each line of the text; if the content of the non-blank portion is a table, the table is distinguished as a table with a border when the segmentation Or a table without a border.
- the table is treated as a row, that is, the result of the segmentation is an electronic image of the table, and if the table is without a border, the content of the table is followed by a row.
- To divide into blocks that is, to divide the result into an electronic picture of each line of the table; it should be noted here that the segmentation result of the part of the electronic picture file whose content is the picture is still the electronic picture of the picture, that is, if the non-blank part If the content is a picture, the result of the segmentation is still an electronic picture of the picture.
- the method of dividing the non-blank part by column and the like if the content of the non-blank part is text, the segmentation result obtained in this step is an electronic picture of each column of the text; if the content of the non-blank part is a table , also distinguish whether the table is a table with a border or a table without a border.
- the table is treated as a column, that is, the result of the segmentation is an electronic picture of the table, if there is no border
- the content of the table is divided into blocks according to the column, that is, the segmentation result is an electronic picture of each column of the table; if the content of the non-blank portion is a picture, the segmentation result is still the picture Electronic picture, this is the same as the result of segmentation by line.
- Step 103 Divide each block into more than one character picture.
- the block obtained in step 102 is only a preliminary segmentation of the non-blank portion of the electronic picture file.
- the amount of information of each block ie, the content corresponding to the content in the paper file
- this step further divides each block, and the result is called a character picture. Since the block is divided into more than one character picture, in most cases, each character picture contains less information than the block to which it belongs. Of course, it is not excluded that one block is divided into one character picture, or All the information in the block is divided into one character picture, and the other character pictures do not contain the information amount. In both cases, the information amount of a character picture is the same as the block to which it belongs.
- the character picture in this step is still in the form of an electronic picture, and the information contained therein cannot be changed.
- Step 104 Determine a positional relationship between the blocks and a positional relationship between the character pictures belonging to the same block.
- This step is a step of determining the layout of the non-blank portion of the electronic picture file. By determining the positional relationship between the blocks, the order between the rows and the rows, or between the columns and the columns can be determined. By determining the positional relationship between the character images belonging to the same block, the character images of the same row can be determined. The order of precedence.
- Step 105 Arrange all the character pictures belonging to the same block into a new block according to the positional relationship between them.
- This step is a step of rearranging each character picture to obtain a new block, and the arrangement rule is the positional relationship between the character pictures belonging to the same block determined in step 104.
- the content of the obtained new block is the same as the block to which the corresponding character picture belongs, and since the arrangement does not involve the recognition of the character, the character is not misreaded, as long as the order of the characters is correctly arranged.
- the correct rate of characters in each new block can reach 100%.
- Step 106 Arrange all the new blocks according to the positional relationship between the blocks to obtain an electronic file.
- This step is a step of rearranging the new blocks arranged in step 105, and the arrangement rule is the positional relationship between the blocks determined in step 104. That is to say, in this step, the new block is "# column according to the order of its corresponding block in the electronic picture file, thereby obtaining the layout of the layout and the electronic picture file, and also the electronic file having the same layout of the paper file. It can be seen that, in the present invention, the paper document is scanned into an electronic picture file, and the non-blank portion of the electronic picture file is segmented according to the block to obtain a plurality of blocks, and then the block is divided into character pictures, and the present invention is based on characters.
- the positional relationship between the pictures rearranges the character pictures into a new block, and the resulting new blocks are arranged into electronic files according to the positional relationship between the blocks. Therefore, the present invention does not need to perform character recognition, search for suspicious characters, error correction, association, and the like in the existing OCR technology, and the conversion task can be realized by using the character picture obtained by dividing the electronic picture file, which greatly improves the conversion. Efficiency, at the same time, because the present invention uses the character images obtained by segmentation to rearrange the electronic files, no recognition error is introduced, and the degree of conformity between the electronic files and the paper files is greatly improved, and the character correctness rate can basically reach 100. %.
- steps 101-102 may also be included: rotating the electronic picture file such that the characters therein are in the forward direction.
- step 101402 the meaning of "character in the forward direction" is: If the electronic picture file on which the character is located is displayed on the screen, the angle displayed on the screen is exactly the same as its standard angle. For example, the standard angle of the number "1" is parallel to the left and right sides of the screen or the paper, but in the scanning step of step 101, the scanned electronic image file is often angled due to the non-standard placement of the paper document. The rotation, so that the number "1" displayed in the electronic picture file is no longer at its standard angle, but has a certain angle with the left and right sides of the electronic picture file (or screen), so the steps need to be performed. Before the 102, the electronic picture file is rotated so that the characters therein are in the forward direction to improve the correct rate of the steps 102 and 103.
- the method further includes: deleting the stains and scratches in the electronic picture file.
- the method further includes: enlarging the electronic picture file.
- Enlarging the electronic picture file is beneficial to reduce the difficulty of judging stains and scratches, and improve the accuracy of judgment.
- step 101-102 after rotating the electronic picture file to make the characters in the forward direction, the method further includes: placing the white border portion of the electronic picture file in the range of the top margin, the bottom margin, the left margin, and the right margin. resection.
- the page range of the electronic image file can be reduced, the workload of subsequent steps can be reduced, and the conversion efficiency and accuracy can be improved.
- FIG. 2 is a schematic diagram of an electronic picture file scanned by the present invention.
- the content shown in FIG. 2 is rotated at a certain angle in a clockwise direction compared with the content of the paper file before scanning.
- the four black lines of the lower, left, and right indicate the boundary of the electronic picture file, and have no practical meaning.
- the meanings of the black lines in FIGS. 3-6 are the same.
- Figures 3-6 are schematic views of some of the operational steps of the present invention performed on the electronic picture file of Figure 2.
- 3 is a schematic diagram of rotating an electronic picture file by using the present invention. As shown in FIG. 3, the entire electronic picture file is rotated in a counterclockwise direction with respect to FIG. 2, thereby making the top picture (marked with The "Foxit Software” text and icons, the "black background image of the "Company Brochure” text) and the text below are in their respective positive directions.
- the range indicated by reference numeral 301 is the white border portion within the left margin of the electronic picture file of FIG. 3
- such a range indicated by the reference numeral 302 is within the right margin of the electronic picture file of FIG.
- the white edge portion, the range indicated by the reference numeral 303 is the white border portion in the upper margin range of the electronic picture file of Fig. 3, and the range indicated by the reference numeral 304 is the white border portion in the lower margin range of the electronic picture file of Fig. 3.
- the schematic shown in Fig. 4 is obtained.
- the non-blank part included in the electronic picture file is further divided according to the line, and the schematic diagram of FIG. 5 is obtained, and then the lines in FIG. 5 (including the picture at the top) are further divided according to step 103. , get Figure 6.
- the character picture here can contain only one character, such as dividing the "Company Brochure” into 15 letters and multiple spaces. Of course, the letters and spaces here still exist as electronic pictures.
- the character picture in Figure 6 can also include multiple characters, such as the words “Solution”, “details”, and so on.
- the image at the top is still a character image in Figure 6.
- the present invention has the following advantages:
- the paper document is scanned into an electronic picture file, and the non-blank portion of the electronic picture file is segmented according to the block to obtain a plurality of blocks, and then the block is divided into character pictures, and the present invention is based on The positional relationship between the character pictures rearranges the character pictures into a new block, and the resulting new blocks are arranged into electronic files according to the positional relationship between the blocks. Therefore, the present invention does not need to perform character recognition, search for suspicious characters, error correction, association, and the like in the existing OCR technology, and the conversion task can be realized by using the character picture obtained by dividing the electronic picture file, which greatly improves the conversion.
- the present invention uses the character images obtained by segmentation to rearrange the electronic files, no recognition error is introduced, and the degree of conformity between the electronic files and the paper files is greatly improved, and the character correctness rate can basically reach 100. %.
- the electronic picture file is rotated so that the characters therein are in the forward direction, which is advantageous for improving the accuracy of the segmentation step.
- the present invention can reduce the page range of the electronic picture file by cutting off the white edge portion of the electronic picture file in the range of the top margin, the bottom margin, the left margin and the right margin, thereby reducing the workload of the subsequent steps and improving the conversion efficiency and Correct rate.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Character Input (AREA)
- Processing Or Creating Images (AREA)
- Editing Of Facsimile Originals (AREA)
Abstract
一种将纸质文件转换为电子文件的方法。该方法包括:步骤1:将纸质文件扫描为电子图片文件;步骤2:按块对电子图片文件所包含的非空白部分进行切分,使非空白部分被切分为若干个块;其中,块为行和列中的一种;步骤3:将每个块切分为一个以上的字符图片;步骤4:确定块之间的位置关系以及属于同一块的字符图片之间的位置关系;步骤5:将属于同一块的所有字符图片按照相互之间的位置关系排列为一个新块;步骤6:将所有新块按照块之间的位置关系排列,得到电子文件。该方法能同时提高转换效率以及电子文件与纸质文件内容的相符程度。
Description
一种将纸质文件转换为电子文件的方法 技术领域
本发明涉及将紙质文件转换为电子文件的技术领域, 特别是涉及一种将纸质文件转 换为电子文件的方法。 背景技术
平板电脑、 电纸书等技术的出现, 使得阅读对象逐渐从纸质文件转换为电子文件, 而目前纸质文件浩如烟海, 这就需要有将纸质文件转换为电子文件的技术与之相适应来 满足读者的阅读需求。
常见的将纸质文件转换为电子文件的技术为 OCR (Optical Character Recognition , 光 学字符识别) 技术, 其具体过程为: 将紙质文件扫描为电子图片文件; 将该电子图片文 件切分为多个字符图片, 每个字符图片仅包括一个字符; 逐个识别每个字符图片中的字 符, 这其中包括纠错和联想功能以减少错误率; 将字符的识别结果按顺序输出, 从而得 到最终的电子文件。
OCR技术的核心是对字符图片逐个识别, 其判断依据是字符图片的轮廓。 由于轮廓 相似的字符有很多, 因而识别的正确率不高, 最终得到的电子文件也就不会太准确。 而 为了提高识别正确率, OCR技术要花费大量的时间来进行字符识别、 查找可疑字符、 纠 错等处理, 因而 OCR技术的效率也较低。 发明内容
本发明所要解决的技术问题是提供一种将紙质文件转换为电子文件的方法, 能同时 提高转换效率以及电子文件与纸质文件内容的相符程度。
本发明解决上述技术问题的技术方案如下:一种将纸质文件转换为电子文件的方法, 该方法包括:
步骤 1 : 将所述紙质文件扫描为电子图片文件;
步骤 2 : 按块对所述电子图片文件所包含的非空白部分进行切分, 使所述非空白部分 被切分为若干个所述块; 其中, 所述块为行和列中的一种;
步骤 3 : 将每个所述块切分为一个以上的字符图片;
步骤 4 : 确定所述块之间的位置关系以及属于同一块的所述字符图片之间的位置关
系;
步驟 5 : 将属于同一块的所有字符图片按照相互之间的位置关系排列为一个新块; 步骤 6 : 将所有所述新块按照所述块之间的位置关系排列, 得到所述电子文件。 本发明的有益效果是: 本发明中, 将纸质文件扫描为电子图片文件, 按块对电子图 片文件的非空白部分进行切分得到若干个块, 然后将块切分为字符图片之后, 本发明根 据字符图片之间的位置关系将字符图片重新排列为一个新块, 根据块之间的位置关系将 得到的新块排列为电子文件。 因此, 本发明无需进行现有的 OCR技术中的字符识别、 查 找可疑字符、 纠错、 联想等处理, 只需利用切分电子图片文件得到的字符图片即可实现 转换任务, 这大大提高了转换效率, 同时, 由于本发明利用切分得到的字符图片重新排 布得到电子文件, 不会引入识别错误, 也就大大提高了电子文件与纸质文件内容的相符 程度, 字符正确率基本可达到 100%。
在上述技术方案的基础上, 本发明还可以做如下改进:
进一步, 在所述步骤 1之后, 在所述步骤 2之前, 还包括步骤 1-2 : 旋转所述电子图 片文件, 使其中的字符处于正向。
进一步, 在所述步骤 1-2中, 在旋转所述电子图片文件之前, 还包括: 删除所述电子 图片文件中的污点和划痕。
进一步,在所述步骤 1-2中,在删除所述电子图片文件中的污点和划痕之前,还包括: 放大所述电子图片文件。
进一步, 在所述步骤 1-2中, 在旋转所述电子图片文件使其中的字符处于正向之后, 还包括: 将所述电子图片文件中处于上边距、 下边距、 左边距及右边距范围内的白边部 分切除。 附图说明
图 1为本发明提出的将纸质文件转换为电子文件的方法的流程图;
图 2为本发明扫描得到的一个电子图片文件的示意图;
图 3为利用本发明对电子图片文件进行旋转后的示意图;
图 4为利用本发明切除电子图片文件四个边距范围内的白边部分后的示意图; 图 5为利用本发明按行对电子图片文件所包含的非空白部分进行切分后的示意图; 图 6为利用本发明将块切分为字符图片后的示意图。
具体实施方式
以下结合附图对本发明的原理和特征进行描述, 所举实例只用于解释本发明, 并非 用于限定本发明的范围。
本发明提出了一种将纸质文件转换为电子文件的方法, 图 1 为该方法的流程图。 如 图 1所示, 该方法包括:
步骤 101 : 将纸质文件扫描为电子图片文件。
本发明中的紙质文件可以为书籍、 画册等任一记载在紙张上的文件。
对紙质文件进行扫描从而得到电子图片文件是实现紙质文件电子化的第一步, 该步 骤可利用扫描仪来完成。
步骤 102 : 按块对电子图片文件所包含的非空白部分进行切分, 使非空白部分被切分 为若干个块。
本发明中的块指的是行和列中的一种。
电子图片文件由步骤 101 中的扫描步骤得来, 紙质文件中的字符、 图、 表格等内容 必然会在电子图片文件中以某种形式 (如以图片的形式等) 反映出来, 这就对应着电子 图片文件中的非空白部分。 而除去上述的非空白部分之外, 电子图片文件中还必然包含 空白部分, 例如其上边距、 下边距、 左边距、 右边距范围内的白边部分, 等等。
该步骤仅对电子图片文件中的非空白部分进行切分, 切分结果为若干个块。 当然, 这里的切分结果也都是电子图片的形式。 例如, 按照行对非空白部分进行切分, 则切分 结果为若干个电子图片形式的行。 进一步, 如果非空白部分的内容是文字, 则本步骤得 到的切分结果为文字的每一行的电子图片; 如果非空白部分的内容是表格, 则切分时会 区分该表格是带边框的表格还是不带边框的表格, 如果是带边框的表格, 则将该表格作 为一行来处理, 即切分结果为该表格的电子图片, 如果是不带边框的表格, 则将该表格 的内容按行来分成块, 即切分结果为表格的每一行的电子图片; 这里应该注意, 本步骤 对电子图片文件中内容为图片的部分的切分结果仍为该图片的电子图片, 即如果非空白 部分的内容为图片, 则切分结果仍为该图片的电子图片。 按列对非空白部分进行切分的 方法与此类此, 如果非空白部分的内容是文字, 则本步骤得到的切分结果为文字的每一 列的电子图片; 如果非空白部分的内容是表格, 也要区分该表格是带边框的表格还是不 带边框的表格, 如果是带边框的表格, 则将该表格作为一列来处理, 即切分结果为该表 格的电子图片, 如果是不带边框的表格, 则将该表格的内容按列来分成块, 即切分结果 为表格的每一列的电子图片; 如果非空白部分的内容为图片, 则切分结果仍为该图片的
电子图片, 这一点与按行进行切分的结果相同。 在切分表格时之所以要区分表格是否带 有边框, 是因为其边框的框线将表格联接为一个整体, 不会被分成更小的行或列, 因而 只能将该表格作为一个整体 (即一行或一列) 来处理。
由于电子图片文件中的空白部分不会与紙质文件中的内容相对应, 因而本步驟无需 对其进行处理。
步骤 103 : 将每个块切分为一个以上的字符图片。
步骤 102所得到的块只是对电子图片文件中非空白部分的初步切分, 事实上, 每个 块的信息量 (即与紙质文件中的内容相对应的内容) 仍然较大, 所包含的空白部分有时 也较多, 因而本步骤对每个块进一步进行了切分, 得到的结果称为字符图片。 由于将块 切分成了一个以上的字符图片, 因而在多数情况下, 每个字符图片所包含的信息量要小 于其所属的块, 当然, 也不排除一个块被切分为一个字符图片, 或者块中的所有信息量 都被切分到一个字符图片中, 其余字符图片全部不包含信息量的情形, 在这两种情形中, 某个字符图片的信息量与其所属的块相同。
本步骤中的字符图片仍是电子图片的形式, 其包含的信息不能变化。
步骤 104: 确定块之间的位置关系以及属于同一块的字符图片之间的位置关系。
本步骤是确定电子图片文件中非空白部分的布局的步驟。 通过确定块之间的位置关 系, 可确定行与行之间、 或者列与列之间的先后顺序, 通过确定属于同一块的字符图片 之间的位置关系, 可以确定同一行的各个字符图片之间的先后顺序。
步骤 105: 将属于同一块的所有字符图片按照相互之间的位置关系排列为一个新块。 本步骤是重新排布各字符图片从而得到新块的步骤, 排布的规则为步骤 104所确定 的属于同一块的字符图片之间的位置关系。 这样, 所得到的新块的内容与相应字符图片 所属的块是相同的, 而且, 由于排布未涉及字符的识别, 因而不会出现字符被误读的情 况, 只要各字符图片的排列顺序正确, 各新块中的字符正确率完全可以达到 100%。
由于每个新块中的各字符图片都来自步骤 102所得到的某个块, 因而这里的新块与 块之间实际上就具有了——对应关系。
步骤 106 : 将所有新块按照块之间的位置关系排列, 得到电子文件。
本步骤是将步骤 105排列得到的新块重新排布的步骤, 排布的规则为步骤 104所确 定的块之间的位置关系。 也就是说, 本步骤是将新块按照其对应的块在电子图片文件中 的顺序来"#列, 从而得到布局与电子图片文件的布局, 同时也是纸质文件的布局一致的 电子文件。
由此可见, 本发明中, 将紙质文件扫描为电子图片文件, 按块对电子图片文件的非 空白部分进行切分得到若干个块, 然后将块切分为字符图片之后, 本发明根据字符图片 之间的位置关系将字符图片重新排列为一个新块, 根据块之间的位置关系将得到的新块 排列为电子文件。因此,本发明无需进行现有的 OCR技术中的字符识别、查找可疑字符、 纠错、 联想等处理, 只需利用切分电子图片文件得到的字符图片即可实现转换任务, 这 大大提高了转换效率, 同时, 由于本发明利用切分得到的字符图片重新排布得到电子文 件, 不会引入识别错误, 也就大大提高了电子文件与纸质文件内容的相符程度, 字符正 确率基本可达到 100%。
在步骤 101之后, 在步骤 102之前, 还可以包括步骤 101-102 : 旋转电子图片文件, 使其中的字符处于正向。
在步骤 101402中, "字符处于正向" 的含义是: 如果对字符所处的电子图片文件在 屏幕上进行显示, 则屏幕上显示的该字符所处的角度与其标准角度完全一致。 例如, 数 字 "1 " 的标准角度为与屏幕或紙面的左右边平行, 但在步骤 101的扫描步骤中, 常常因 紙质文件的放置位置不标准而造成扫描得到的电子图片文件发生了一定角度的转动, 这 样, 该电子图片文件中所显示的数字 "1 " 就不再处于其标准角度, 而是与电子图片文件 (或屏幕) 的左右边有了一定的夹角, 因而需要在执行步骤 102之前对电子图片文件进 行旋转, 使其中的字符处于正向, 以提高步骤 102和步骤 103切分的正确率。
在步骤 101-102中, 在旋转电子图片文件之前, 还可以包括: 删除电子图片文件中的 污点和划痕。
利用该步骤, 可以减少或消除污点、 划痕等噪音数据对本发明转换正确性的影响, 并可以节约转换时间, 提高转换效率。
进一步,在步骤 101-102中,在删除电子图片文件中的污点和划痕之前,还可以包括: 放大电子图片文件。
放大电子图片文件有利于降低污点、 划痕判断的难度, 提高判断正确率。
此外, 在步骤 101- 102中, 在旋转电子图片文件使其中的字符处于正向之后, 还可以 包括: 将电子图片文件中处于上边距、 下边距、 左边距及右边距范围内的白边部分切除。
通过切除电子图片文件中处于上边距、 下边距、 左边距及右边距范围内的白边部分, 可以减少电子图片文件的页面范围, 降低后续步骤的工作量, 提高转换效率和正确率。
图 2为本发明扫描得到的一个电子图片文件的示意图, 直观看去, 图 2所显示的内 容与扫描前的紙质文件的内容相比, 在顺时针方向发生了一定角度的旋转。 图中处于上、
下、 左、 右的四条黑线表示该电子图片文件的边界, 并无实际意义, 图 3-图 6中各黑线 的含义与此相同。
图 3-图 6是对图 2电子图片文件进行本发明所述的某些操作步骤后的示意图。其中, 图 3为利用本发明对电子图片文件进行旋转后的示意图, 如图 3所示, 整个电子图片文 件均在逆时针方向相对于图 2旋转了一定角度,从而使顶部的图片(标有 "Foxit Software" 文字及图标、 "Company Brochure" 文字的黑底图片) 及下面的文字都处于各自正向。 在 图 3中, 标号 301所指示的范围为图 3 电子图片文件的左边距范围内的白边部分, 与此 类此, 标号 302所指示的范围为图 3电子图片文件的右边距范围内的白边部分, 标号 303 所指示的范围为图 3 电子图片文件的上边距范围内的白边部分, 标号 304所指示的范围 为图 3 电子图片文件的下边距范围内的白边部分。 这样, 利用本发明切除电子图片文件 上边距、 下边距、 左边距和右边距这四个边距范围内的白边部分后, 得到了图 4所示的 示意图。在此基础上, 再按行对电子图片文件所包含的非空白部分进行切分, 就得到图 5 示意图, 进而对图 5 中的各行 (包括顶部的图片) 进行步骤 103所述的进一步切分, 就 得到图 6。 由图 6 可以看出, 这里的字符图片可以仅包含一个字符, 如将 "Company Brochure" 切分为 15个字母及多个空格, 当然, 这里的字母和空格仍以电子图片的形式 存在。 图 6中的字符图片还可以包括多个字符, 如单词 "Solution"、 "details" 等。 处于 顶部的图片在图 6中仍为一个字符图片。
由此可见, 本发明具有以下优点:
( 1 ) 本发明中, 将紙质文件扫描为电子图片文件, 按块对电子图片文件的非空白部 分进行切分得到若干个块, 然后将块切分为字符图片之后, 本发明才艮据字符图片之间的 位置关系将字符图片重新排列为一个新块, 根据块之间的位置关系将得到的新块排列为 电子文件。 因此, 本发明无需进行现有的 OCR技术中的字符识别、 查找可疑字符、 纠错、 联想等处理, 只需利用切分电子图片文件得到的字符图片即可实现转换任务, 这大大提 高了转换效率, 同时, 由于本发明利用切分得到的字符图片重新排布得到电子文件, 不 会引入识别错误, 也就大大提高了电子文件与纸质文件内容的相符程度, 字符正确率基 本可达到 100%。
(2) 本发明中, 在对电子图片文件进行切分之前, 还将电子图片文件进行了旋转, 使其中的字符处于正向, 这有利于提高切分步骤的正确率。
(3)本发明中,在旋转电子图片文件之前,还删除了电子图片文件中的污点和划痕, 可以减少或消除污点、 划痕等噪音数据对本发明转换正确性的影响, 并可以节约转换时
间, 提高转换效率。
(4) 本发明通过切除电子图片文件中处于上边距、 下边距、 左边距及右边距范围内 的白边部分, 可以减少电子图片文件的页面范围, 降低后续步驟的工作量, 提高转换效 率和正确率。
以上所述仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的精神和原 则之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。
Claims
1. 一种将纸质文件转换为电子文件的方法, 其特征在于, 该方法包括: 步骤 1 : 将所述纸质文件扫描为电子图片文件;
步骤 2 : 按块对所述电子图片文件所包含的非空白部分进行切分, 使所述非空白部分 被切分为若干个所述块; 其中, 所述块为行和列中的一种;
步骤 3 : 将每个所述块切分为一个以上的字符图片;
步骤 4 : 确定所述块之间的位置关系以及属于同一块的所述字符图片之间的位置关 系;
步骤 5: 将属于同一块的所有字符图片按照相互之间的位置关系排列为一个新块; 步骤 6 : 将所有所述新块按照所述块之间的位置关系排列, 得到所述电子文件。
2. 根据权利要求 1 所述的方法, 其特征在于, 在所述步骤 1之后, 在所述步骤 2 之前, 还包括步骤 1-2 : 旋转所述电子图片文件, 使其中的字符处于正向。
3. 根据权利要求 2所述的方法, 其特征在于, 在所述步驟 1-2 中, 在旋转所述电 子图片文件之前, 还包括: 删除所述电子图片文件中的污点和划痕。
4. 根据权利要求 3所述的方法, 其特征在于, 在所述步驟 1-2 中, 在删除所述电 子图片文件中的污点和划痕之前, 还包括: 放大所述电子图片文件。
5. 根据权利要求 2所述的方法, 其特征在于, 在所述步骤 1-2 中, 在旋转所述电 子图片文件使其中的字符处于正向之后, 还包括: 将所述电子图片文件中处于上边距、 下边巨、 左边距及右边距范围内的白边部分切除。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/910,011 US20160180164A1 (en) | 2013-08-12 | 2014-07-22 | Method for converting paper file into electronic file |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310349738.4A CN104376317B (zh) | 2013-08-12 | 2013-08-12 | 一种将纸质文件转换为电子文件的方法 |
CN201310349738.4 | 2013-08-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015021737A1 true WO2015021737A1 (zh) | 2015-02-19 |
Family
ID=52467984
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/000694 WO2015021737A1 (zh) | 2013-08-12 | 2014-07-22 | 一种将纸质文件转换为电子文件的方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20160180164A1 (zh) |
CN (1) | CN104376317B (zh) |
WO (1) | WO2015021737A1 (zh) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107145859A (zh) * | 2017-05-04 | 2017-09-08 | 北京小米移动软件有限公司 | 电子书转换处理方法、装置及计算机可读存储介质 |
CN108909290A (zh) * | 2018-08-03 | 2018-11-30 | 安徽赛福贝特信息技术有限公司 | 一种大数据智能描图装置 |
US11087448B2 (en) * | 2019-05-30 | 2021-08-10 | Kyocera Document Solutions Inc. | Apparatus, method, and non-transitory recording medium for a document fold determination based on the change point block detection |
CN110188745A (zh) * | 2019-05-30 | 2019-08-30 | 北京爱尖子教育科技有限责任公司 | 教学内容在线代码化方法及系统 |
CN111310747A (zh) * | 2020-02-12 | 2020-06-19 | 北京小米移动软件有限公司 | 信息处理方法、信息处理装置及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040181754A1 (en) * | 2003-03-12 | 2004-09-16 | Kremer Karl Heinz | Manual and automatic alignment of pages |
CN103218351A (zh) * | 2013-03-15 | 2013-07-24 | 杭州中元数据科技有限公司 | 现代地方文献电子图书制作方法 |
CN103679640A (zh) * | 2012-09-24 | 2014-03-26 | 福州福昕软件开发有限公司北京分公司 | 一种提高纸质文件转化成的pdf文件的清晰度的方法 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2579397B2 (ja) * | 1991-12-18 | 1997-02-05 | インターナショナル・ビジネス・マシーンズ・コーポレイション | 文書画像のレイアウトモデルを作成する方法及び装置 |
US5335290A (en) * | 1992-04-06 | 1994-08-02 | Ricoh Corporation | Segmentation of text, picture and lines of a document image |
US5852676A (en) * | 1995-04-11 | 1998-12-22 | Teraform Inc. | Method and apparatus for locating and identifying fields within a document |
EP1173003B1 (en) * | 2000-07-12 | 2009-03-25 | Canon Kabushiki Kaisha | Image processing method and image processing apparatus |
US7475061B2 (en) * | 2004-01-15 | 2009-01-06 | Microsoft Corporation | Image-based document indexing and retrieval |
JP4856925B2 (ja) * | 2005-10-07 | 2012-01-18 | 株式会社リコー | 画像処理装置、画像処理方法及び画像処理プログラム |
US20070168382A1 (en) * | 2006-01-03 | 2007-07-19 | Michael Tillberg | Document analysis system for integration of paper records into a searchable electronic database |
JP5036430B2 (ja) * | 2007-07-10 | 2012-09-26 | キヤノン株式会社 | 画像処理装置及びその制御方法 |
JP4758502B2 (ja) * | 2008-12-10 | 2011-08-31 | シャープ株式会社 | 画像処理装置、画像読取装置、画像送信装置、画像形成装置、画像処理方法、プログラムおよびその記録媒体 |
US20100208282A1 (en) * | 2009-02-18 | 2010-08-19 | Andrey Isaev | Method and apparatus for improving the quality of document images when copying documents |
US8000528B2 (en) * | 2009-12-29 | 2011-08-16 | Konica Minolta Systems Laboratory, Inc. | Method and apparatus for authenticating printed documents using multi-level image comparison based on document characteristics |
CN102243621A (zh) * | 2010-05-11 | 2011-11-16 | 项洁 | 影像文本文件的活字排版方法 |
CN102456136B (zh) * | 2010-10-29 | 2013-06-05 | 方正国际软件(北京)有限公司 | 一种图文切分方法及系统 |
CN102467653A (zh) * | 2010-10-29 | 2012-05-23 | 方正国际软件(北京)有限公司 | 一种图文识别方法及系统 |
CN103186911B (zh) * | 2011-12-28 | 2015-07-15 | 北大方正集团有限公司 | 一种处理扫描书数据的方法及装置 |
CN102930267B (zh) * | 2012-11-16 | 2015-09-23 | 上海合合信息科技发展有限公司 | 卡片扫描图像的切分方法 |
US20160188541A1 (en) * | 2013-06-18 | 2016-06-30 | ABBYY Development, LLC | Methods and systems that convert document images to electronic documents using a trie data structure containing standard feature symbols to identify morphemes and words in the document images |
-
2013
- 2013-08-12 CN CN201310349738.4A patent/CN104376317B/zh active Active
-
2014
- 2014-07-22 US US14/910,011 patent/US20160180164A1/en not_active Abandoned
- 2014-07-22 WO PCT/CN2014/000694 patent/WO2015021737A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040181754A1 (en) * | 2003-03-12 | 2004-09-16 | Kremer Karl Heinz | Manual and automatic alignment of pages |
CN103679640A (zh) * | 2012-09-24 | 2014-03-26 | 福州福昕软件开发有限公司北京分公司 | 一种提高纸质文件转化成的pdf文件的清晰度的方法 |
CN103218351A (zh) * | 2013-03-15 | 2013-07-24 | 杭州中元数据科技有限公司 | 现代地方文献电子图书制作方法 |
Also Published As
Publication number | Publication date |
---|---|
US20160180164A1 (en) | 2016-06-23 |
CN104376317A (zh) | 2015-02-25 |
CN104376317B (zh) | 2018-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210256253A1 (en) | Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium | |
US8855413B2 (en) | Image reflow at word boundaries | |
WO2015021737A1 (zh) | 一种将纸质文件转换为电子文件的方法 | |
US9241084B2 (en) | Scanning implemented software for time economy without rescanning (S.I.S.T.E.R.) identifying multiple documents with first scanning pass and generating multiple images with second scanning pass | |
US9477898B2 (en) | Straightening out distorted perspective on images | |
US8675260B2 (en) | Image processing method and apparatus, and document management server, performing character recognition on a difference image | |
US8249356B1 (en) | Physical page layout analysis via tab-stop detection for optical character recognition | |
WO2018233055A1 (zh) | 保单信息录入的方法、装置、计算机设备及存储介质 | |
US8995768B2 (en) | Methods and devices for processing scanned book's data | |
WO2018233171A1 (zh) | 单证信息录入的方法、装置、计算机设备及存储介质 | |
RU2579899C1 (ru) | Обработка документа с использованием нескольких потоков обработки | |
WO2014086287A1 (zh) | 文本图像自动切分方法及装置,自动切分手写条目的方法 | |
JP2002056398A (ja) | 文書画像処理装置、文書画像処理方法、及び記憶媒体 | |
US8208726B2 (en) | Method and system for optical character recognition using image clustering | |
WO2020233023A1 (zh) | 基于分层技术实现的psd文件编辑方法、电子设备 | |
CN112949471A (zh) | 基于国产cpu的电子公文识别复现方法及系统 | |
US8068261B2 (en) | Image reading apparatus, image reading method, and image reading program | |
JP2008054147A (ja) | 画像処理装置および画像処理プログラム | |
US20080266606A1 (en) | Optimized print layout | |
JP4983464B2 (ja) | 帳票画像処理装置及び帳票画像処理プログラム | |
CN112800824A (zh) | 扫描文件的处理方法、装置、设备及存储介质 | |
JP2002203206A (ja) | 文書書式識別装置および識別方法 | |
CN101686305A (zh) | 图像处理装置、图像处理方法和计算机可读介质 | |
US9402014B2 (en) | Method for improving clarity of PDF file converted from paper file | |
US20060204141A1 (en) | Method and system of converting film images to digital format for viewing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14836600 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14910011 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14836600 Country of ref document: EP Kind code of ref document: A1 |