CN117236291A

CN117236291A - Method and system for rapidly converting scanned file into vector layout file

Info

Publication number: CN117236291A
Application number: CN202311523381.7A
Authority: CN
Inventors: 李超; 朱静宇; 赵云; 张伟; 庄玉龙; 陆猛
Original assignee: Beijing Dianju Information Technology Co ltd
Current assignee: Beijing Dianju Information Technology Co ltd
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2023-12-15
Anticipated expiration: 2043-11-16
Also published as: CN117236291B

Abstract

The invention relates to the technical field of data compression, in particular to a method and a system for rapidly converting a scanned file into a vector format file, wherein the method comprises the following steps: obtaining a two-dimensional matrix of a vector layout file; acquiring a target subarea according to the two-dimensional matrix of the vector layout file; acquiring all the position information groups in each target subarea and initial data points of each position information group; acquiring a position information set according to all the position information groups in each target subarea; calculating the preference degree of the position information set; acquiring a compression starting point of each target sub-region according to the preference degree of the position information set and the initial data point of each position information set; and obtaining a compression result of the vector layout file according to the compression starting point of each target subarea. The method and the device improve the compression effect of the vector layout file by calculating the repeatability of the position information data in the vector layout file.

Description

Method and system for rapidly converting scanned file into vector layout file

Technical Field

The invention relates to the technical field of data compression, in particular to a method and a system for rapidly converting a scanned file into a vector format file.

Background

A layout file refers to a file in which all text data in a document are combined and arranged in a certain format. For example, the disclosed red header files generally contain a large amount of text and formatting information, if the red header files are to be published, the files are displayed as scanned files, noise and distortion can occur when the files are amplified, and complete conversion from paper files to electronic files cannot be realized. Therefore, when the layout file is stored and transmitted, the layout file needs to be vectorized.

When vector data is stored, the coordinate position information of each vector data needs to be stored to describe the position and shape of the vector data, so that the vector layout file data can become very large, the data volume of the large vector file data is overlarge during storage, a large amount of storage space is occupied, the coordinate position information of the vector file data needs to be compressed, so that the storage space is saved, the transmission bandwidth is reduced, and the transmission efficiency is improved.

Disclosure of Invention

The invention provides a method and a system for rapidly converting a scanned file into a vector format file, which are used for solving the existing problems: the data volume occupied by the vector layout file is overlarge, which is unfavorable for the storage and transmission of vector data.

The invention relates to a method and a system for rapidly converting a scanned file into a vector format file, which adopts the following technical scheme:

one embodiment of the invention provides a method for rapidly converting a scanned file into a vector layout file, which comprises the following steps:

acquiring a two-dimensional matrix of a vector layout file and target data points;

acquiring a target subarea according to the two-dimensional matrix of the vector layout file; acquiring position information data from any target data point to another target data point in the target subregion;

acquiring all the position information groups in each target subarea and the initial data point of each position information group according to the position information data from any target data point to another target data point in the target subarea; acquiring a position information set according to all the position information groups in each target subarea; calculating the preference degree of the position information set;

acquiring a compression starting point of each target sub-region according to the preference degree of the position information set and the initial data point of each position information set;

and obtaining a compression result of the vector layout file according to the compression starting point of each target subarea.

Preferably, the method for obtaining the two-dimensional matrix of the vector layout file and the target data point includes the following specific steps:

scanning a paper file through a file scanner to obtain a scanned file matrix, and acquiring a specific position of a text in the scanned file by utilizing an optical character recognition technology; setting the data value of the data point at the text position in the scanning file matrix to be 1, and marking the data point as a target data point; the data value of the data point which is not at the text position in the scanning file matrix is set to 0 and is recorded as a blank data point; and obtaining a two-dimensional matrix of the vector layout file.

Preferably, the method for obtaining the target sub-region according to the two-dimensional matrix of the vector layout file includes the following specific steps:

in the two-dimensional matrix of the vector layout file, if two target data points are adjacent, the two target data points are classified into the same target subarea, and a plurality of target subareas are obtained.

Preferably, the method for acquiring the position information data from any target data point to another target data point in the target subregion includes the following specific steps:

the horizontal rightward direction is recorded as a reference direction; acquiring an included angle between a ray from any target data point in each target subarea to another target data point and a reference direction, and taking the included angle as a direction angle from any target data point in each target subarea to another target data point;

acquiring Euclidean distance from any target data point to another target data point in each target subregion, and obtaining the distance from any target data point to another target data point in the target subregion;

the direction angle and distance of any target data point to another target data point are recorded as the position information data of any target data point to another target data point.

Preferably, the method for acquiring all the position information groups in each target subarea and the initial data point of each position information group includes the following specific steps:

for the firstA target subarea, acquiring the position information data from the first target data point to all other target data points, classifying the position information data from the first target data point to all other target data points into a group, and marking the group as +.>A first set of location information in the target subregion; and the first target data point is marked as +.>A first set of location information for a first set of location information in the target subregion;

acquiring the position information data from the second target data point to all other target data points, grouping the position information data from the second target data point to all other target data points, and marking the position information data as the first groupA second set of location information in the target subregion; and the second target data point is marked as +.>A starting data point of a second set of location information in the target subregion;

and so on until the position information data from the last target data point to all other target data points is obtained, and the position information data from the last target data point to all other target data points are grouped and recorded as the firstThe last one of the target subregions is the first position information group; and the first last target data point is marked as the first +.>A starting data point of the penultimate position information set in the target subregion;

all the sets of location information in each target sub-area are obtained, along with the starting data point for each set of location information.

Preferably, the acquiring a position information set according to all the position information groups in each target subarea; calculating the preference degree of the position information set, which comprises the following specific methods:

randomly selecting a position information group from each target subarea, classifying the position information data in the selected position information group into the same position information set, and classifying the position information data which are completely the same in the set into the same position information data; and counting the occurrence frequency of each type of position information data in the position information set and the quantity of the position information data in the position information set for the same position information set, and acquiring the preference degree of the position information set according to the type quantity of the position information data in the position information set, the occurrence frequency of each type of position information data in the position information set and the quantity of the position information data in the position information set.

Preferably, the obtaining the preference degree of the location information set includes the following specific method calculation formula:

in the method, in the process of the invention,indicate->The degree of preference of the set of location information; />Indicate->The number of location information data in the set of location information; />Indicate->Seed position information data at->Frequency of occurrence in the set of location information; />Indicate->The number of categories of location information data in the set of individual location information; />An exponential function based on a natural constant; />A logarithmic function in 2 bases is represented.

Preferably, the method for obtaining the compression starting point of each target sub-region according to the preference degree of the position information set and the starting data point of each position information set includes the following specific steps:

selecting the position information set with the highest preference degree as the optimal position information set; recording all the position information groups forming the optimal position information set as compressed data groups; and taking the initial data point of each compressed data set as a compression starting point of the target subarea corresponding to each compressed data set.

Preferably, the method for obtaining the compression result of the vector layout file according to the compression start point of each target sub-region includes the following specific steps:

recording the compression starting point position of each target subarea; and carrying out Huffman coding operation on the position information data in the optimal position information set, constructing a coding tree, and compressing the position information data from the compression starting point of each target sub-region to other target data points to obtain the compression result of the vector layout file.

The embodiment of the invention provides a system for rapidly converting a scanned file into a vector layout file, which comprises a data acquisition module, a data dividing module, a data analysis module, a data selection module and a data compression module, wherein:

the data acquisition module is used for acquiring a two-dimensional matrix of the vector layout file;

the data dividing module is used for acquiring a target subarea according to the two-dimensional matrix of the vector layout file; acquiring position information data from any target data point to another target data point in the target subregion;

the data analysis module is used for acquiring all the position information groups in each target subarea and initial data points of each position information group; acquiring a position information set according to all the position information groups in each target subarea; calculating the preference degree of the position information set;

the data selection module is used for acquiring a compression starting point of each target subarea according to the preference degree of the position information set and the starting data point of each position information group;

and the data compression module is used for acquiring the compression result of the vector layout file according to the compression starting point of each target subarea.

The technical scheme of the invention has the beneficial effects that: because a large number of angles and distances exist in the vector format file, a large amount of space is occupied when the vector format file is stored, and the transmission efficiency is low when the vector format file is transmitted; the invention provides a method for quickly converting a vector layout file of a scanned file, which aims to reduce the storage space for storing the vector layout file and improve the transmission efficiency of transmitting the vector layout file by compressing the vector layout file.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a method for fast converting a vector layout file for a scanned file according to the present invention;

FIG. 2 is a block diagram of a system for fast converting a scanned document into a vector layout document according to the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a method and a system for quickly converting a scanned file into a vector layout file according to the invention, which are detailed in the following, and the specific implementation, structure, characteristics and effects are described in detail in the following. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The method and the system for quickly converting the scanned file into the vector format file are specifically described below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a method for quickly converting a vector layout file for a scanned file according to an embodiment of the invention is shown, the method includes the following steps:

step S001: and obtaining a two-dimensional matrix of the vector layout file and target data points.

It should be noted that when converting a paper document into an electronic document, the situation that noise and distortion occur in the electronic document often occurs due to the problem of scaling of the paper document, resulting in poor quality of the electronic document, so that the scanned paper document is usually converted into a vector format document, that is, the scanned document is converted into the vector format document, so as to improve the quality of the electronic document.

It should be further noted that, because there are a large number of angles and distances in the vector layout file, when the vector layout file is stored, a large amount of storage space is occupied, and the transmission efficiency is low when the vector layout file is transmitted; therefore, the embodiment provides a method for quickly converting the vector layout file into the scanned file, which aims to reduce the storage space for storing the vector layout file and improve the transmission efficiency of transmitting the vector layout file by compressing the vector layout file. It is therefore first necessary to obtain a two-dimensional matrix of the vector layout file.

Specifically, scanning a paper document by a document scanner to obtain a scanned document matrix, and acquiring a specific position of a text in the scanned document by using an optical character recognition technology; setting the data value of the data point at the text position in the scanning file matrix to be 1, and marking the data point as a target data point; the data value of the data point which is not at the text position in the scanning file matrix is set to 0 and is recorded as a blank data point; obtaining a two-dimensional matrix of the vector layout file; since the optical character recognition is a well-known prior art, the description thereof is omitted in this embodiment.

So far, a two-dimensional matrix of the vector layout file is obtained.

Step S002: acquiring a target subarea according to the two-dimensional matrix of the vector layout file; position information data of any target data point to another target data point in the target subregion is acquired.

It should be noted that, the present embodiment is used as a method for quickly converting a vector format file by using a scanned file, and aims to reduce a storage space for storing the vector format file and improve transmission efficiency of transmitting the vector format file by compressing the vector format file; because only blank data points and target data points exist in the vector layout file matrix, and the number of blank data points in the vector layout file matrix is far greater than the number of target data points, the compression of the vector layout file can be realized only by compressing the target data points in the vector layout file matrix. A target sub-region of each of the two-dimensional matrices of the vector layout file comprised of target data points is first acquired.

Specifically, in the matrix of the vector layout file, if two target data points are adjacent, the two target data points are classified into the same target sub-area, and if the two target data points are not adjacent, the two target data points cannot be classified into the same target sub-area, so that a plurality of target sub-areas are obtained.

It should be further noted that, the vector data in the two-dimensional matrix of the vector layout file is the direction angle and the distance between each data point in the vector layout file and other data points, and the method for quickly converting the vector layout file by using the embodiment as a method for scanning the file needs to obtain the direction angle and the distance between each target data point in all target subareas and other target data points in the two-dimensional matrix of the vector layout file.

Specifically, the horizontal rightward direction is referred to as the reference direction; the specific calculation formula for acquiring the included angle between the ray of any target data point in each target subregion and the other target data point and the reference direction is as follows:

in the method, in the process of the invention,representing the>Target data point to->The direction angle of the ray of each target data point and the reference direction; />Representing the>The number of columns of the target data points in the matrix of the vector layout file; />Representing the>The number of columns of the target data points in the matrix of the vector layout file; />Representing the>The number of rows of the target data points in the matrix of the vector layout file; />Representing the>The number of rows of the target data points in the matrix of the vector layout file; />Representing an arctangent function; />Representing a signed function.

And acquiring Euclidean distance from any target data point to another target data point in each target subregion, wherein the Pythagorean theorem is a known technology, so that repeated description is omitted in the embodiment, and the distance from any target data point to another target data point in the target subregion is obtained.

The direction angle and the distance from any target data point to another target data point in the target subarea are obtained, and the direction angle and the distance from any target data point to another target data point are recorded as position information data from any target data point to another target data point.

Step S003: acquiring all the position information groups in each target subarea and the initial data point of each position information group according to the position information data from any target data point to another target data point in the target subarea; acquiring a position information set according to all the position information groups in each target subarea; the degree of preference of the set of location information is calculated.

It should be noted that, the present embodiment is used as a method for quickly converting a vector format file by using a scanned file, and aims to reduce a storage space for storing the vector format file and improve transmission efficiency of transmitting the vector format file by compressing the vector format file; meanwhile, in each target subregion obtained in the step S002, the position information data from any target data point to another target data point has certain repeatability, and the greater the repeatability, the better the compression effect; therefore, the optimization degree of different position information data can be obtained according to the repeatability of the position information data, so that the compression effect of the vector format file is improved.

Specifically, for the firstIndividual target subregionsAcquiring position information data from the first target data point to all other target data points, classifying the position information data from the first target data point to all other target data points into a group, and marking the group as +.>A first set of location information in the target subregion; and the first target data point is marked as +.>A first set of location information for a first set of location information in the target subregion;

and so on until the position information data from the last target data point to all other target data points is obtained, and the position information data from the last target data point to all other target data points are grouped and recorded as the firstThe last one of the target subregions is the first position information group; and the first last target data point is marked as the first +.>The starting data point of the penultimate set of positional information in the target subregion.

Similarly, all sets of location information in each target sub-area are acquired, along with the starting data point for each set of location information.

Then randomly selecting one position information group from each target subarea, classifying the position information data in the selected position information group into the same position information set, and classifying the position information data which are completely the same in the set into the same position information data; for the same position information set, counting the occurrence frequency of each position information data in the position information set and the number of the position information data in the position information set, and acquiring the preference degree of the position information set according to the type number of the position information data in the position information set, the occurrence frequency of each position information data in the position information set and the number of the position information data in the position information set, wherein the specific calculation formula is as follows:

Since the number of pieces of positional information data included in all the positional information groups in each target sub-area is equal, the number of pieces of positional information data included in each positional information set is equal, and therefore, as the number of pieces of positional information data in the positional information set increases, the repeatability of the positional information set becomes lower, that is, the degree of preference of the positional information set becomes lower.Indicate->The information entropy of the set of location information,the smaller is->The lower the information entropy of the individual position information sets, the better the compression effect of the compression by entropy coding, and therefore +.>The smaller the value of (c), the higher the compression effect, i.e., the higher the degree of preference. Thus->The greater the value of +.>The higher the preference of the individual sets of location information.

So far, the preference degree of all the position information sets is obtained.

Step S004: and acquiring a compression starting point of each target sub-region according to the preference degree of the position information set and the starting data point of each position information group.

It should be noted that, the preferred degree of all the location information sets is obtained through step S003, that is, the optimal location information set is obtained according to the preferred degree of all the location information sets, and then the compression starting point of each target sub-area is obtained according to the optimal location information set.

Specifically, selecting the position information set with the highest preference degree as the optimal position information set; recording all the position information groups forming the optimal position information set as compressed data groups; and taking the initial data point of each compressed data set as a compression starting point of the target subarea corresponding to each compressed data set.

So far, a compression starting point of each target sub-region is obtained.

Step S005: and obtaining a compression result of the vector layout file according to the compression starting point of each target subarea.

It should be noted that, according to step S004, a compression start point of each target sub-region is obtained, and then each target sub-region can be compressed according to the compression start point of each target sub-region, so as to obtain a compression result of the vector layout file.

Specifically, the compression starting point position of each target subarea is recorded; and carrying out Huffman coding operation on the position information data in the optimal position information set, constructing a coding tree, and compressing the position information data from the compression starting point of each target subarea to other target data points to obtain a compression result of the vector layout file, wherein the Huffman coding operation is a well-known prior art, so that redundant description is omitted in the embodiment.

This embodiment is completed.

Referring to fig. 2, a block diagram of a system for fast converting a vector layout file of a scanned file according to an embodiment of the present invention is shown, where the system includes the following modules:

the data compression module is configured to obtain a compression result of the vector layout file according to the compression start point of each target sub-region.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for rapidly converting a scanned file into a vector layout file, the method comprising the steps of:

2. The method for quickly converting a vector layout file into a scanned file according to claim 1, wherein the method for obtaining the two-dimensional matrix of the vector layout file and the target data points comprises the following specific steps:

3. The method for rapidly converting a vector layout file into a scanned file according to claim 2, wherein the obtaining the target sub-region according to the two-dimensional matrix of the vector layout file comprises the following specific steps:

4. A method for quickly converting a scanned file into a vector layout file according to claim 3, wherein the method for obtaining the position information data from any target data point to another target data point in the target subregion comprises the following specific steps:

5. The method for quickly converting a vector layout file according to claim 1, wherein the step of obtaining all the position information groups in each target subregion and the start data point of each position information group comprises the following specific steps:

6. The method for quickly converting a vector layout file according to claim 1, wherein the acquiring the position information set according to all the position information groups in each target subarea; calculating the preference degree of the position information set, which comprises the following specific methods:

7. The method for quickly converting a vector layout file for a scanned file according to claim 6, wherein the obtaining the preference degree of the location information set comprises the following specific method calculation formula:

in the method, in the process of the invention,indicate->The degree of preference of the set of location information; />Indicate->The number of location information data in the set of location information; />Indicate->Seed position information data at->Frequency of occurrence in the set of location information; />Represent the firstThe number of categories of location information data in the set of individual location information; />An exponential function based on a natural constant;a logarithmic function in 2 bases is represented.

8. The method for quickly converting a vector layout file by scanning a file according to claim 1, wherein the obtaining a compression starting point of each target sub-region according to the preference degree of the location information set and the starting data point of each location information group comprises the following specific steps:

9. The method for quickly converting a vector layout file according to claim 8, wherein the obtaining the compression result of the vector layout file according to the compression start point of each target sub-region comprises the following specific steps:

10. A system for fast converting a scanned document into a vector layout document, the system comprising:

the data acquisition module is used for acquiring a two-dimensional matrix of the vector layout file and target data points;