CN115346227B

CN115346227B - Method for vectorizing electronic file based on layout file

Info

Publication number: CN115346227B
Application number: CN202211266067.0A
Authority: CN
Inventors: 黄雪琪; 王宝凤; 赵瑞婷
Original assignee: Jingchen Technology Nantong Co ltd
Current assignee: Jingchen Technology Nantong Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-08-08
Anticipated expiration: 2042-10-17
Also published as: CN115346227A

Abstract

The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file, which comprises the following steps: acquiring a scanning bitmap of a paper document; acquiring vector characters of different fonts corresponding to a scanning bitmap, acquiring binary images in corresponding outer surrounding frames of the characters and the vector characters, acquiring distances from the central point of the surrounding frames to the closed edge, and acquiring a distance sequence and a distance sequence set; the method comprises the steps of sampling a distance sequence set by using a plurality of different sampling scales to obtain a target sequence set, calculating similarity distances of characters and target sequences corresponding to vector characters, performing KM matching, obtaining a matching effect evaluation value according to the similarity distances of the matched target sequences, obtaining font matching degree of the characters and the vector characters corresponding to the fonts, determining a replacement object of the characters, and obtaining a vectorized file.

Description

Method for vectorizing electronic file based on layout file

Technical Field

The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file.

Background

Along with the development of informatization, the application range of electronic files is wider and wider, the electronic files gradually replace the paper files in daily offices, and in application scenes such as file arrangement, electronic file archiving is beneficial to storing a large amount of paper data. In the process of electronizing a paper document, the paper document needs to be scanned, namely, the paper document is firstly converted into a computer image form, a bitmap of the image of the paper document is obtained after image scanning, and the bitmap is distorted when being enlarged, and in order to avoid the situation, the computer image of the paper document needs to be vectorized to obtain a vector diagram.

The vector image is an image which does not cause distortion in any of enlargement, reduction and rotational stretching, and because the vector image is an image expressed by parameters of a mathematical formula, the vector image requires less storage capacity, and the purpose of converting a paper document into an electronic document for storage can be achieved.

The process of vectorizing a paper document image obtained by computer scanning generally utilizes an OCR file content identification technology to identify text information in the document, and directly replaces text vector graphics according to the identification result of the text information to realize vectorization of the paper document, the text identified by the OCR technology needs to be identified by fonts and font sizes, the font sizes can be directly identified according to the text sizes, the identification of the fonts needs to compare the text in the scanned image with the corresponding vector text existing in a vector text library, and the corresponding vector text is replaced and displayed after the font of the text in the scanned image is determined, however, in the process of font identification, because the vector text in the existing vector text library is the result after vectorizing the text image and the text obtained from the scanned image is the result before vectorizing, the two cannot be directly matched through comparison similarity, in order to solve the problem, the prior art converts the standard vector text in the vector text library into a bitmap, uses the bitmap as a standard, and replaces the corresponding vector text in the actual scanned image according to the matching result.

However, some characters have small differences in fonts, and the quality of the scanned image is easily affected by various factors and is unstable, so that the situation that the fonts of the characters in the scanned image are similar to the bitmap converted by the vector characters, but the fonts of the characters are not identical in practice, and therefore, wrong vectorization results are easily generated.

Disclosure of Invention

The invention provides a method for vectorizing an electronic file based on a format file, which aims to solve the problem that the existing vectorizing result is inaccurate.

The method for vectorizing the electronic file based on the format file adopts the following technical scheme:

acquiring a scanning bitmap of a paper document;

acquiring an index label, a word size label and an outer surrounding frame of each word on a scanning bitmap;

according to the index label and the font size label, obtaining vector characters of different fonts corresponding to the corresponding characters in a vector character library, converting the vector characters into a vector character bitmap, obtaining an outer surrounding frame of each vector character in the vector character bitmap, and obtaining binary images of the outer surrounding frame of the characters and corresponding images in the outer surrounding frame of the vector characters;

acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer surrounding frame to the corresponding closed edge, sequentially acquiring the distances from all edge points to the central point of the corresponding outer surrounding frame in the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;

sampling each distance sequence in the distance sequence set by utilizing a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale, calculating the similarity distance of each two target sequences between two target sequence sets under the same sampling scale, performing KM (KM) matching on the target sequences in the two target sequence sets according to the similarity distance, and acquiring a similarity distance average value of the matched target sequences, wherein the similarity distance average value is used as a matching effect evaluation value;

and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity and the matching effect evaluation value of the KM matching results of the target sequences in the corresponding target sequence set under each two different sampling scales, taking the vector characters corresponding to the maximum font matching degree as the replacement objects of the corresponding characters in the vector diagram, and obtaining the vectorized file.

Preferably, the step of setting a plurality of different sampling scales comprises:

acquiring the length of a closed edge corresponding to the shortest-length distance sequence in all the distance sequences in the two distance sequence sets;

taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;

the sampling scale is sequentially 1, 2, 3 … … to the maximum sampling scale.

Preferably, the step of sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale includes:

adding each distance in the distance sequence set and the corresponding distance after the interval sampling scale to obtain an average value; taking the average value as a target value in the target sequence;

and by analogy, obtaining a corresponding target sequence under each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.

Preferably, a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences between two target sequence sets at the same sampling scale.

Preferably, the step of obtaining the similarity distance average value of the matched target sequence includes:

after matching is completed, the similarity distance between two unmatched target sequences is recorded as 1;

calculating the similarity distance between every two matched target sequences by using a dynamic time warping algorithm;

and after the matching is completed, calculating the similarity distances of all the two unmatched target sequences and the similarity distance average value of the similarity distances of all the two matched target sequences.

Preferably, the step of calculating the similarity of KM matching results for target sequences in the corresponding target sequence set at two different sampling scales includes:

acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving intersection sets to obtain the number of matched identical target sequence pairs;

obtaining the total matching number by summing the corresponding matched target sequence pairs in the corresponding target sequence sets under two different sampling scales;

and taking the ratio of the number of the matched identical target sequence pairs to the total matched number as the similarity of KM matching results.

Preferably, the step of calculating the font matching degree of the text and the vector text of each corresponding font according to the similarity and the matching effect evaluation value of the KM matching results of the target sequences in the corresponding target sequence set under each two different sampling scales comprises:

calculating the average value of the corresponding matching effect evaluation values under each two different sampling scales;

calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under each two different sampling scales, the average value of the matching effect evaluation values and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;

and summing all initial matching degrees corresponding to every two different sampling scales in all sampling scales to obtain the font matching degree of each character and the vector characters of each corresponding font.

Preferably, the image correction is performed on the scanned bitmap, and the image correction includes: correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, wherein the front view of the scanning bitmap is used as the scanning bitmap.

Preferably, the characters on the scanning bitmap are acquired by utilizing an OCR algorithm, and the outer surrounding frame of each character is acquired.

Preferably, the index tag of the text is a tag of the text to be searched.

The method for vectorizing the electronic file based on the format file has the beneficial effects that:

1. and the distance between the edge point and the central point on the closed edge of the character scanning bitmap and the vector character bitmap is used as a description characteristic value of the closed edge, then the matching of multiple sampling scales is carried out based on the distance sequence, and the character font of each character is identified through the matching result, so that the influence of the image quality of a scanned part on the matching accuracy is avoided.

2. The matching effect is obtained by carrying out similarity calculation on the matching result of the distance sequence set of the closed edge of the scanning bitmap and the distance sequence set of the closed edge of the vector text bitmap, and then the matching degree of the text and each corresponding font is comprehensively evaluated according to the matching effect and the similarity of the matching result under the multi-sampling scale, so that the vector text of the font matched with the text is further accurately determined, and the vectorization of the text is accurately realized.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a method for vectorizing an electronic document based on a layout file according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a distance sequence obtained in an embodiment of a method for vectorizing an electronic file based on a layout file according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

An embodiment of a method for vectorizing an electronic file based on a layout file according to the present invention, as shown in figure 1,

s1, acquiring a scanning bitmap of the paper document.

Specifically, a scanning bitmap of a paper document is obtained by a digital camera, and in order to prevent the collected paper document from being skewed and affecting the processing of a post-school image, the image correction in this embodiment includes: correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, wherein the front view of the scanning bitmap is used as the scanning bitmap.

S2, acquiring an index label and a word size label of each word and an outer surrounding frame of each word on a scanning bitmap, acquiring vector words of different fonts corresponding to the corresponding word in a vector word library according to the index label and the word size label, converting the vector words into a vector word bitmap, acquiring the outer surrounding frame of each vector word in the vector word bitmap, and acquiring binary images of the outer surrounding frame of the word and corresponding images in the outer surrounding frame of the vector word.

Because the vector characters are the characters which are recognized by the OCR algorithm on the scanning bitmap of the paper document and are provided with index labels and character size labels, the vector characters of all different fonts corresponding to each character can be obtained from the vector character library.

Based on this, first, an index tag, a word size tag, and an outer peripheral frame of each word need to be obtained for each word on the scan bitmap, where the index tag of a word refers to a tag corresponding to the word to be searched, and the index tag is the same as the tag of the same word in the word stock, specifically, taking the i-th word as an example, an index tag of the word that hasCharacter size labelIn the vector text library, the index label according to the ith textWord size labelThe vector characters of all different fonts corresponding to the ith character can be obtained, wherein the vector characters of the kth font corresponding to the ith character are expressed asThat is, when converting these vector characters into a vector character bitmap, the vector bitmap of the vector character corresponding to the i-th character may be expressed asWherein, the method comprises the steps of, wherein,=1、2、3……。

taking the i-th character as an example, the outer surrounding frame of the character in the scanning bitmap and the outer surrounding frame of the vector characters of each font corresponding to the character in the vector character bitmap can be obtained, the images in the outer surrounding frame are binarized, and the scanning bitmap and the vector character bitmap of the character both comprise a character part and a background part, so that the pixel point of the character part is marked as 1, the pixel point of the background part is set as 0, namely the binarization of the bitmap in the outer surrounding frame is completed, and the embodiment marks the binary image corresponding to the image in the surrounding frame of the i-th character asThe binary image of the image in the bounding box of the vector character of each font corresponding to the ith character is recorded asWherein, the method comprises the steps of, wherein,=1、2、3……it should be noted that, in this embodiment, the outer bounding box of the Chinese character and the outer bounding box of the vector text refer to the smallest bounding rectangle.

S3, acquiring all the closed edges in the two binary images, acquiring the maximum distance from the central point of the outer bounding box to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer bounding box in the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring the distance sequence of the closed edge and the distance sequence set corresponding to the two binary images.

Specifically, the edge detection algorithm is used to perform edge detection on the binary image to obtain all the closed edges on the binary image, for a Chinese character, for example, a Chinese character 'two' is formed, for a Chinese character 'three' is obtained to obtain three closed edges, when the distance sequence of each closed edge is obtained, taking one closed edge as an example, the center point of the outer surrounding frame corresponding to the closed edge needs to be obtained first, the center point of the outer surrounding frame is taken as a datum point, the distance between the datum point and each edge point on the same closed edge is obtained, the distance is obtained in the prior art, the distance formula is adopted to calculate the distance between the datum point and each edge point on the same closed edge, the distance between the target edge point corresponding to the maximum distance in all the distances is taken as a starting point, the distances between all the edge points and the center point corresponding to the outer surrounding frame are sequentially obtained along the clockwise direction, the distance sequence is obtained by taking the distance as the description value of the closed edge, namely, the distance sequence corresponding to all the closed edges in each surrounding frame is formed into the distance sequence set.

As shown in FIG. 2, the pixel distance between each edge point on a closed edge and the reference point is used as the description value of the closed edge, and then the binary image in the bounding box of the ith characterCorresponding firstThe distance sequence of the individual closed edges can be expressed as；Represent the firstThe total number of all edge points on the individual closed edges,represent the firstThe first of the closing edgesThe distance between each edge point and the center point; the binary image in the bounding box of the vector text of the kth font corresponding to the ith text is expressed asBinary imageCorresponding firstThe distance sequence of the individual closed edges can be expressed asThe method comprises the steps of carrying out a first treatment on the surface of the So far, similarly, the distance sequence and the corresponding distance sequence set of the binary image corresponding to the bounding box in the vector text bitmap of the vector text of each corresponding font can be obtained.

S4, sampling each distance sequence in the distance sequence set by utilizing a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale, calculating the similarity distance of each two target sequences between two target sequence sets under the same sampling scale, performing KM (KM) matching on the target sequences in the two target sequence sets according to the similarity distances, and acquiring a similarity distance average value of the matched target sequences, wherein the similarity distance average value is used as a matching effect evaluation value.

Because the font matching degree of the binary image of the text and the binary image corresponding to the vector text is measured, each distance sequence in the distance sequence set of the closed edge of the binary image of the text and each distance sequence in the distance sequence set of the closed edge of the binary image corresponding to the vector text are required to be subjected to multi-scale pairing.

Specifically, the sampling scale is set first, specifically, because the length of each closed edge is limited, the length of the closed edge corresponding to the shortest distance sequence in all the distance sequences in the two distance sequence sets is acquired first; taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale, and marking the maximum sampling scale as C, wherein the sampling scale is sequentially 1, 2 and 3 … … to the maximum sampling scale, namely the sampling scale is sequentially 1, 2 and 3 … … C; in this embodiment, when the sampling scale is C, the binary image in the bounding box of the ith textThe target sequence set obtained by sampling each distance sequence in the distance sequence set is recorded asBinary image in bounding box of vector character of kth font of ith characterThe target sequence set obtained by sampling each distance sequence in the distance sequence set is recorded as。

The step of sampling each distance sequence in the distance sequence set according to the sampling scale to obtain a corresponding target sequence set under each sampling scale comprises the following steps: adding each distance in the distance sequence set and the corresponding distance after the interval sampling scale to obtain an average value; taking the average value as a target value in the target sequence; and by analogy, obtaining a target sequence corresponding to each distance sequence under each sampling scale, and obtaining a target sequence set by the target sequences corresponding to all the distance sequences in each distance sequence set.

Specifically, the similarity distance between every two target sequences in the two target sequence sets in the same sampling scale is calculated, and the similarity distance is the DTW distance, that is, the similarity distance between every two target sequences in the two target sequence sets in the same sampling scale is calculated by using a dynamic time warping algorithm, which is an algorithm in the prior art, and is not described in detail in this embodiment.

The step of performing KM matching on target sequences in two target sequence sets according to the similarity distance and obtaining a similarity distance mean value comprises the following steps: the KM algorithm is an algorithm for matching the maximum weight in the perfect matching of a bipartite graph, that is, each two target sequences in two target sequence sets are sequentially matched to obtain two matched target sequences in the target sequence set and the other target sequence set, and at the same time, the number of matched sequences can be obtained, that is, in this embodiment, the matching of the two target sequences is performed by using the similarity distances corresponding to the target sequences, after the matching is completed, the similarity distances of the two unmatched target sequences are recorded as 1, then, after the matching is completed, the similarity distances of each two matched target sequences are calculated by using a dynamic time warping algorithm, and the similarity distance average value of the similarity distances of all the unmatched two target sequences and the similarity distances of all the matched two target sequences is calculated, so that in order to make the value range of 0 to 1, in this embodiment, the similarity distances are all the similarity distances after normalization, the similarity distances are the similarity distances, the obtained similarity average value is taken as the evaluation effect of the matching result, the similarity average value approaches 0, the similarity distance approaches to the 0, the similarity average value approaches the 0, the similarity distance is better, the similarity distance is the similarity distance 1, and the average value is sampled, and the result is sampled, and is the result is the average is the comparisonBinary image in bounding box of textThe kth corresponding to the ith characterBinary image in bounding box of vector characters of fontThe matching effect evaluation values of the two corresponding target sequence sets are recorded as。

And S5, calculating the font matching degree of the characters and the vector characters according to the similarity and the matching effect evaluation value of KM matching results of the target sequences in the corresponding target sequence sets under every two different sampling scales, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.

Specifically, the step of obtaining the similarity of KM matching results of target sequences in the corresponding target sequence set under each two different sampling scales includes: acquiring the number of matched identical target sequence pairs obtained by intersection of corresponding matched target sequence pairs in corresponding target sequence sets under each two different sampling scales; obtaining the total matching number of the corresponding matched target sequence pairs in the corresponding target sequence sets under each two different sampling scales; and taking the ratio of the number of the matched identical target sequence pairs to the total matched number as the similarity of KM matching results.

The step of calculating the font matching degree of the characters and the vector characters according to the similarity and the matching effect evaluation value of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales comprises the following steps: calculating the average value of the corresponding matching effect evaluation values under each two different sampling scales; calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under each two different sampling scales, the average value of the matching effect evaluation values and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree; and summing all initial matching degrees corresponding to every two different sampling scales in all sampling scales to obtain the font matching degree of each character and the corresponding vector character, wherein the font matching degree of each character and the vector character of the corresponding font is calculated according to the following formula:

in the method, in the process of the invention,the font matching degree of the vector characters of the kth font corresponding to the ith character is represented;

is shown inBinary image in bounding box of ith character under sampling scaleThe ith characterBinary image in bounding box of vector characters of fontMatching effect evaluation values of the two corresponding target sequence sets;

is shown inBinary image in bounding box of ith character under sampling scaleBinary image in bounding box of vector characters of kth font corresponding to ith characterMatching effect evaluation values of the two corresponding target sequence sets;

is shown inKM matching result of target sequences in corresponding target sequence set under sampling scale and sampling scaleSimilarity of KM matching results for target sequences in a set of corresponding target sequences at a sampling scale, it should be noted that,；

representing the total number of all sample scales;

it should be noted that the number of the substrates,matching effect evaluation value corresponding to sampling scaleAnd (3) withMatching effect evaluation value corresponding to sampling scaleAll are normalized similarity distances, so that the closer the matching effect evaluation value is to 0, the better the matching effect is, the closer the matching effect evaluation value is to 1, the worse the matching effect is,representing the weights of interest, i.e. the sampling scaleAnd the sampling scaleThe larger the phase difference, the description of the sampling scaleThe larger the difference is, and if the pairing results obtained when the sampling scale difference is larger are more similar, the binary image in the bounding box of the ith character is describedAnd a binary image in a bounding box of a vector character of a kth font corresponding to the ith characterThe matching results obtained under two sampling scales with larger difference are still similar, and at the same time, the font matching degree isThe closer to 0, the smaller the difference between the ith character and the kth character corresponding to the ith character, namely, the more matching is performed, in short, the more similar the matching result corresponding to each sampling scale is and the better the matching effect of the matching result is, the more the vector characters of the kth character corresponding to the ith character are supposed to be the replacement results of the ith character.

The example uses the vector text corresponding to the maximum font matching degree of the text and the vector text of each corresponding font as the replacement object of the corresponding text in the vector diagram, and obtains the vectorized file.

According to the method for vectorizing the electronic file based on the format file, the distance between the upper edge point and the central point of the closed edge of the scanned bitmap of the character and the vector character bitmap is used as the description characteristic value of the closed edge, then the matching of multiple sampling scales is carried out based on the distance sequence, the character font of each character is identified through the matching result, so that the influence of the image quality of a scanned object on the matching accuracy is avoided, secondly, the similarity calculation is carried out on the matching result of the distance sequence set of the closed edge of the scanned bitmap and the distance sequence set of the closed edge of the vector character bitmap, the matching effect is obtained, and then the matching degree of the character and each corresponding character font is comprehensively evaluated according to the matching effect and the similarity of the matching result under the multiple sampling scales, so that the vector character of the character font matched with the character is further accurately determined, and the vectorization of the character is accurately realized.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method for vectorizing an electronic file based on a layout file, the method comprising:

acquiring a scanning bitmap of a paper document;

calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity and matching effect evaluation value of KM matching results of the target sequences in the corresponding target sequence set under each two different sampling scales, taking the vector characters corresponding to the maximum font matching degree as the replacement objects of the corresponding characters in the vector diagram, and obtaining a vectorized file;

the step of sampling each distance sequence in the distance sequence set by utilizing a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale comprises the following steps:

and by analogy, obtaining a corresponding target sequence under each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set;

calculating the similarity distance of every two target sequences between two target sequence sets under the same sampling scale by using a dynamic time warping algorithm;

the step of obtaining the similarity distance average value of the matched target sequence comprises the following steps:

after the matching is completed, calculating the similarity distances of all the two unmatched target sequences and the similarity distance average value of the similarity distances of all the two matched target sequences;

the step of calculating the similarity of KM matching results of target sequences in the corresponding target sequence sets under two different sampling scales comprises the following steps:

the ratio of the number of the matched identical target sequence pairs to the total matched number is used as the similarity of KM matching results;

the step of calculating the font matching degree of the text and the vector text of each corresponding font according to the similarity and the matching effect evaluation value of the KM matching results of the target sequences in the corresponding target sequence set under each two different sampling scales comprises the following steps:

2. The method of claim 1, wherein the step of setting a plurality of different sampling scales comprises:

3. The method of vectorizing electronic documents based on layout documents of claim 1, wherein image correction is performed on the scanned bitmaps, the image correction comprising: correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, wherein the front view of the scanning bitmap is used as the scanning bitmap.

4. The method for vectorizing electronic files based on layout files according to claim 1, wherein characters on a scanning bitmap are acquired by utilizing an OCR algorithm, and an outer surrounding frame of each character is acquired.

5. The method of claim 1, wherein the index tag of the text is a tag of the text to be searched.