CN115346227A

CN115346227A - Method for vectorizing electronic file based on layout file

Info

Publication number: CN115346227A
Application number: CN202211266067.0A
Authority: CN
Inventors: 黄雪琪; 王宝凤; 赵瑞婷
Original assignee: Jingchen Technology Nantong Co ltd
Current assignee: Jingchen Technology Nantong Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-11-15
Anticipated expiration: 2042-10-17
Also published as: CN115346227B

Abstract

The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file, which comprises the following steps: acquiring a scanning bitmap of a paper document; acquiring vector characters of different fonts corresponding to a scanning bitmap, acquiring binary images in an outer surrounding frame corresponding to the characters and the vector characters, acquiring the distance from the center point of the surrounding frame to a closed edge, and acquiring a distance sequence and a distance sequence set; sampling the distance sequence set by utilizing a plurality of different sampling scales to obtain a target sequence set, calculating the similarity distance of the characters and the target sequence corresponding to the vector characters, carrying out KM matching, obtaining a matching effect evaluation value according to the similarity distance of the matched target sequence, obtaining the character matching degree of the characters and the vector characters corresponding to the characters, determining a replacing object of the characters, and obtaining a vectorized file.

Description

Method for vectorizing electronic file based on layout file

Technical Field

The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file.

Background

With the development of informatization, the application range of electronic files is wider and wider, the electronic files gradually replace the paper files in daily office, and in application scenes such as archive organization, electronic archive filing is beneficial to storing a large amount of paper data. In the process of electronizing a paper document, the paper document needs to be scanned, that is, the paper document is firstly converted into a computer image form, and a bitmap of the paper document image is obtained through image scanning and is distorted during amplification, and in order to avoid the situation, vectorization is carried out on the computer image of the paper document to obtain a vector diagram.

The vector diagram is an image which does not cause distortion no matter how the vector diagram is enlarged, reduced or rotationally stretched, and because the vector diagram is an image represented by parameters of a mathematical formula, the required storage capacity is smaller, and the aim of converting a paper document into an electronic document for storage can be fulfilled.

In the process of vectorizing a paper document image obtained by scanning of a computer, generally, an OCR document content recognition technology is used to recognize character information in a document, and character vector graphics are directly replaced according to the recognition result of the character information to implement vectorization of the paper document, which is a process in which characters recognized by an OCR technology need to recognize fonts and font sizes, and the font sizes can be directly recognized according to the sizes of the characters, and the recognition of the fonts needs to compare the characters in the scanned image with corresponding vector characters existing in a vector character library, and replace and display the corresponding vector characters after the fonts of the characters in the scanned image are determined.

However, some characters have small font differences and are unstable due to the fact that the quality of the scanned image is easily affected by various factors, which easily causes the situation that the fonts of the characters in the scanned image are similar to the bitmap into which the vector characters are converted, but actually the fonts of the characters are not the same, and thus, the wrong vectorization result is easily generated.

Disclosure of Invention

The invention provides a method for vectorizing an electronic file based on a layout file, which aims to solve the problem of inaccurate vectorizing result in the prior art.

The invention discloses a method for vectorizing an electronic file based on a layout file, which adopts the following technical scheme:

acquiring a scanning bitmap of a paper document;

acquiring an index label and a font size label of each character on a scanning bitmap and an outer surrounding frame of each character;

acquiring vector characters of different fonts corresponding to the characters in a vector character library according to the index tag and the character size tag, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of corresponding images in the outer surrounding frames of the characters and the outer surrounding frames of the vector characters;

acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;

sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale, calculating the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale, performing KM (K-K) matching on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value;

and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.

Preferably, the step of setting a plurality of different sampling scales comprises:

acquiring the length of a closed edge corresponding to a distance sequence with the shortest length in all distance sequences in the two distance sequence sets;

taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;

the sampling scale is 1, 2 and 3 in sequence, 8230, to the maximum sampling scale.

Preferably, the step of respectively sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale includes:

adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in the target sequence;

and by analogy, obtaining a target sequence corresponding to each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.

Preferably, a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in two target sequence sets under the same sampling scale.

Preferably, the step of obtaining the similarity distance average of the matched target sequences includes:

after matching is completed, recording the similarity distance of two unmatched target sequences as 1;

calculating the similarity distance between every two well-matched target sequences by using a dynamic time warping algorithm;

and after the matching is completed, calculating the similarity distance average value of the similarity distance of all the unmatched two target sequences and the similarity distance average value of all the matched two target sequences.

Preferably, the step of calculating the similarity of KM matching results of target sequences in the corresponding target sequence sets at two different sampling scales includes:

acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs;

solving a union set of corresponding matched target sequence pairs in the corresponding target sequence sets under two different sampling scales to obtain the total matching number;

and taking the ratio of the number of the matched same target sequence pairs to the total number of matched targets as the similarity of KM matching results.

Preferably, the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value includes:

calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales;

calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation value and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;

and summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the vector character of each corresponding font.

Preferably, the image correction is performed on the scanned bitmap, the image correction including: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.

Preferably, the OCR algorithm is used to obtain the text on the scanning bitmap and obtain the outer bounding box of each text.

Preferably, the index tag of the text is a tag of the text to be searched.

The method for vectorizing the electronic file based on the layout file has the advantages that:

1. the distance between the upper edge point and the central point of the closed edge of the scanning bitmap and the vector character bitmap of the characters is used as a description characteristic value of the closed edge, then the distance sequence is subjected to multi-sampling-scale pairing, and the font of each character is identified through the pairing result, so that the influence of the image quality of a scanning piece on the matching accuracy is avoided.

2. Similarity calculation is carried out on matching results of the distance sequence set of the closed edge of the scanning bitmap and the distance sequence set of the closed edge of the vector character bitmap, matching effects are obtained, then the matching degree of the characters and each corresponding character style is comprehensively evaluated according to the matching effects under the multi-sampling scale and the similarity of the matching results, and therefore the vector characters of the character style matched with the characters are further accurately determined, and vectorization of the characters is accurately achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an embodiment of a method for vectorizing an electronic file based on a layout file according to the present invention;

fig. 2 is a schematic structural diagram of obtaining a distance sequence in an embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

An embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention is, as shown in fig. 1,

s1, obtaining a scanning bitmap of a paper document.

Specifically, a scanning bitmap of a paper document is acquired through a digital camera, and in order to prevent the collected paper document from skewing and influence the processing of a post-school image, the scanning bitmap is subjected to image correction in the embodiment, and the image correction includes: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.

S2, acquiring an index label, a font size label and an outer surrounding frame of each character on the scanning bitmap, acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the font size label, converting the vector characters into a vector character bitmap, acquiring the outer surrounding frame of each vector character in the vector character bitmap, and acquiring the outer surrounding frame of the characters and a binary image of a corresponding image in the outer surrounding frame of the vector characters.

Because the vector characters are data existing in the vector character library, the characters which are identified by the scanning bitmap of the paper document by using an OCR algorithm are provided with index labels and character size labels, so that the vector characters of all different fonts corresponding to each character can be obtained in the vector character library.

Based on this, firstly, an index tag, a font size tag and an outer enclosure frame of each character on a scanning bitmap are required to be obtained, wherein the index tag of the character refers to a tag corresponding to the character to be searched, the index tag is the same as the tag of the same character in a character library, and specifically, the second step is to

For example, a word has an index tag of the word

One character size label

In the vector font library, according to

Index label of individual character

Word size label

Can obtain the first

Vector characters of all different fonts corresponding to the characters, wherein

The first character corresponds to

Vector text representation of a seed font as

I.e. when converting these vector words into a vector word bitmap, the second

The vector bitmap of the vector character corresponding to each character can be expressed as

Wherein, in the step (A),

=1、2、3……

。

to a first order

Taking individual character as an example, the outer enclosure frame of the character in the scanning bitmap and the outer enclosure frame of the vector character of each font corresponding to the character in the vector character bitmap can be obtained, the image in the outer enclosure frame is binarized, because the scanning bitmap and the vector character bitmap of the character both comprise a character part and a background part, the pixel point of the character part is marked as 1, and the pixel point of the background part is set as 0, namely, the binarization of the bitmap in the outer enclosure frame is completed, in this embodiment, the second embodiment is marked as

The binary image corresponding to the image in the enclosing frame of each character is

To remember the first

The binary image of the image in the bounding box of the vector character of each font corresponding to each character is recorded as

Wherein, in the step (A),

=1、2、3……

it should be noted that, in this embodiment, the outer bounding box of the text and the outer bounding box of the vector text both refer to the minimum bounding rectangle.

And S3, acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images.

Specifically, edge detection is performed on a binary image by using a CANNY edge detection algorithm to obtain all closed edges on the binary image, for a Chinese character, for example, a Chinese character "two", two closed edges are formed, for a Chinese character "three", three closed edges are obtained, when a distance sequence of each closed edge is obtained, taking one closed edge as an example, a center point of an outer enclosure frame corresponding to the closed edge needs to be obtained first, taking the center point of the outer enclosure frame as a reference point, and obtaining distances between the reference point and each edge point on the same closed edge, where the distance is obtained as the prior art.

As shown in fig. 2, the pixel distance of each edge point on a closed edge from the reference point is used as the description value of the closed edge, and then for the second closed edge

Binary image within bounding box of text

To a corresponding second

The sequence of distances of the closed edges can be expressed as

；

Is shown as

The total number of all edge points on an individual closed edge,

is shown as

On the closing edge of the first

The distance between each edge point and the center point; then it is first

The first character corresponds to

The binary image in the bounding box of the vector character of the typeface is represented as

Binary image

To a corresponding second

The sequence of distances of the closed edges can be expressed as

(ii) a By analogy, the distance sequence and the corresponding distance sequence set of the binary image corresponding to the bounding box in the bounding box of all the characters and the bounding box in the vector character bitmap of the vector character of each corresponding font can be obtained.

S4, sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale, calculating the similarity distance between every two target sequences between two target sequence sets under the same sampling scale, performing KM (K-K matching) on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value.

Because the font matching degree of the binary image of the text and the binary image corresponding to the vector text is measured, it is necessary to perform multi-scale pairing on each distance sequence in the distance sequence set of the closed edge of the binary image of the text and each distance sequence in the distance sequence set of the closed edge of the binary image corresponding to the vector text.

Specifically, in this embodiment, a sampling scale is set first, and specifically, since the length of each closed edge is limited, the length of the closed edge corresponding to the distance sequence with the shortest length in all the distance sequences in the two distance sequence sets is obtained first; one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length is taken as the maximum sampling scale, and the maximum sampling scale is taken as C, so that the sampling scales are 1, 2 and 3 \8230;, and C are 1, 2 and 3 \8230; wherein, when the sampling scale is C, the embodiment is the first

Binary image within bounding box of text

Distance sequence set ofThe target sequence set obtained after sampling each distance sequence in the sum is recorded as

Then it is first

Character 1

Binary image in bounding box of vector character of seed font

The target sequence set obtained after sampling each distance sequence in the distance sequence set is recorded as

。

The step of sampling each distance sequence in the distance sequence set according to a sampling scale to obtain a target sequence set corresponding to each sampling scale comprises the following steps: adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence; and by analogy, obtaining a target sequence corresponding to each distance sequence under each sampling scale, and obtaining a target sequence set by the target sequences corresponding to all the distance sequences in each distance sequence set.

Specifically, a similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated, where the similarity distance is a DTW distance, that is, the similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated by using a dynamic time warping algorithm, and the dynamic time warping algorithm is an algorithm in the prior art and is not described in detail in this embodiment.

The method comprises the following steps of performing KM matching on target sequences in two target sequence sets according to similarity distance, and acquiring similarity distance mean values: wherein, the KM algorithm can complete a bipartite graphAn algorithm of maximum weight matching in matching, that is, every two target sequences in two target sequence sets are sequentially matched to obtain two target sequences matched in the target sequence set and the other target sequence set, and at the same time, the number of the matched sequences can also be obtained, that is, in this embodiment, the similarity distance corresponding to the target sequences is used for matching the two target sequences, after the matching is completed, the similarity distance of the two unmatched target sequences is marked as 1, then the similarity distance of each two matched target sequences is calculated by using a dynamic time warping algorithm, after the matching is completed, the similarity distance of all the two unmatched target sequences and the similarity distance average of the similarity distances of all the two matched target sequences are calculated, so that the value range is from 0 to 1, in this embodiment, the similarity distance is the normalized similarity distance, therefore, the obtained similarity distance average is used as the effect evaluation of the matching result, the similarity distance average approaches 0, the matching effect is better, the similarity distance average is more close to 1, the matching effect is illustrated, and therefore, in this embodiment, the sampling scale in this embodiment, the next step C is more the sampling is performed, the scale is more the matching effect is poor

Binary image inside bounding box of text

And a first

The first character corresponds to

Binary image in bounding box of vector character of seed font

The evaluation values of the matching effect of the corresponding two target sequence sets are recorded as

。

And S5, calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining a vectorized file.

Specifically, the step of obtaining the similarity of KM matching results of target sequences in a target sequence set corresponding to each two different sampling scales includes: acquiring corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs; obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales; and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.

The step of calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises the following steps: calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales; calculating the product of the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation values and the weights of the two corresponding different sampling scales, and taking the product as the initial matching degree; summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the corresponding vector character, wherein the font matching degree of each character and the vector character of the corresponding font is calculated according to the following formula in the embodiment:

in the formula (I), the compound is shown in the specification,

is shown as

Text and number one

The first character corresponds to

The font matching degree of the vector character of the font;

is shown in

At the sampling scale of

Binary image inside bounding box of text

And a first step of

The first character corresponds to

Binary image in bounding box of vector character of seed font

The matching effect evaluation values of the corresponding two target sequence sets;

is shown in

At the sampling scale of

Binary image within bounding box of text

And a first

The first corresponding to the character

Binary image in bounding box of vector character of seed font

is shown in

KM matching results of target sequences in a corresponding set of target sequences at a sampling scale with KM matching results of KM matching sequences in a target sequence set

The similarity of KM matching results of target sequences in the corresponding target sequence set at the sampling scale is, it should be noted that,

；

represents the total number of all sampling scales;

it should be noted that, in the following description,

matching effect evaluation value corresponding to sampling scale

And

matching effect evaluation value corresponding to sampling scale

All the similarity distances are normalized distances, so that the matching effect evaluation value is closer to 0, which indicates that the matching effect is better, and the matching effect evaluation value is closer to 1, which indicates that the matching effect is worse,

representing weights of interest, i.e. sampling scale

And the sampling scale

The larger the phase difference is, the larger the sampling scale difference is, and the more similar the matching result is, the second is

Binary image within bounding box of text

And the first

The first character corresponds to

Binary image in bounding box of vector character of seed font

The pairing results obtained at the two sampling scales with the larger difference are still similar, and at the same time, the wordsDegree of volume matching

The closer to 0, the more the description is

Text and number one

The first character corresponds to

The smaller the difference between the two types of fonts is, the more matched the two types of fonts are, in short, the more similar the matching results corresponding to each sampling scale are and the better the matching effect of the matching results is, the second explanation is that

The first character corresponds to

The more the vector character of the same font should be the first

And (5) a character replacement result.

Therefore, in this embodiment, the vector character corresponding to the maximum font matching degree in the font matching degrees of the characters and the vector characters of each corresponding font is used as a replacement object of the corresponding character in the vector diagram, and the vectorized file is obtained.

The invention relates to a method for vectorizing an electronic file based on a format file, which takes the distance between a scanning bitmap of characters and the upper edge point and the central point of the closed edge of a vector character bitmap as a description characteristic value of the closed edge, then performs multi-sampling-scale pairing on a distance sequence, and identifies the font of each character through a pairing result, thereby avoiding the influence of the image quality of a scanned piece on the matching accuracy.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for vectorizing an electronic file based on a layout file is characterized by comprising the following steps:

acquiring a scanning bitmap of a paper document;

acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the character size label, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of the outer surrounding frame of the characters and corresponding images in the outer surrounding frame of the vector characters;

2. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of setting a plurality of different sampling scales comprises:

3. The method for vectorizing the electronic file based on the layout file according to claim 1, wherein the step of sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale comprises:

adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence;

4. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale.

5. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of obtaining the mean value of the similarity distance between the matched target sequences comprises:

6. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of calculating the similarity of the KM matching results of the target sequences in the corresponding target sequence sets at two different sampling scales comprises:

obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales;

and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.

7. The method for vectorizing an electronic document based on a layout file according to claim 1, wherein the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises:

8. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the scanning bitmap is subjected to image correction, and the image correction comprises: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.

9. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein an OCR algorithm is used to obtain the text on the scanned bitmap and obtain the outer bounding box of each text.

10. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the index tag of the text is a tag of the text to be searched.