CN115346227A - Method for vectorizing electronic file based on layout file - Google Patents
Method for vectorizing electronic file based on layout file Download PDFInfo
- Publication number
- CN115346227A CN115346227A CN202211266067.0A CN202211266067A CN115346227A CN 115346227 A CN115346227 A CN 115346227A CN 202211266067 A CN202211266067 A CN 202211266067A CN 115346227 A CN115346227 A CN 115346227A
- Authority
- CN
- China
- Prior art keywords
- distance
- matching
- vector
- target sequence
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/20—Drawing from basic elements, e.g. lines or circles
- G06T11/203—Drawing of straight lines or curves
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/162—Quantising the image signal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
- G06V30/1801—Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/19007—Matching; Proximity measures
- G06V30/19093—Proximity measures, i.e. similarity or distance measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
- G06V30/242—Division of the character sequences into groups prior to recognition; Selection of dictionaries
- G06V30/244—Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
- G06V30/245—Font recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Character Discrimination (AREA)
Abstract
The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file, which comprises the following steps: acquiring a scanning bitmap of a paper document; acquiring vector characters of different fonts corresponding to a scanning bitmap, acquiring binary images in an outer surrounding frame corresponding to the characters and the vector characters, acquiring the distance from the center point of the surrounding frame to a closed edge, and acquiring a distance sequence and a distance sequence set; sampling the distance sequence set by utilizing a plurality of different sampling scales to obtain a target sequence set, calculating the similarity distance of the characters and the target sequence corresponding to the vector characters, carrying out KM matching, obtaining a matching effect evaluation value according to the similarity distance of the matched target sequence, obtaining the character matching degree of the characters and the vector characters corresponding to the characters, determining a replacing object of the characters, and obtaining a vectorized file.
Description
Technical Field
The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file.
Background
With the development of informatization, the application range of electronic files is wider and wider, the electronic files gradually replace the paper files in daily office, and in application scenes such as archive organization, electronic archive filing is beneficial to storing a large amount of paper data. In the process of electronizing a paper document, the paper document needs to be scanned, that is, the paper document is firstly converted into a computer image form, and a bitmap of the paper document image is obtained through image scanning and is distorted during amplification, and in order to avoid the situation, vectorization is carried out on the computer image of the paper document to obtain a vector diagram.
The vector diagram is an image which does not cause distortion no matter how the vector diagram is enlarged, reduced or rotationally stretched, and because the vector diagram is an image represented by parameters of a mathematical formula, the required storage capacity is smaller, and the aim of converting a paper document into an electronic document for storage can be fulfilled.
In the process of vectorizing a paper document image obtained by scanning of a computer, generally, an OCR document content recognition technology is used to recognize character information in a document, and character vector graphics are directly replaced according to the recognition result of the character information to implement vectorization of the paper document, which is a process in which characters recognized by an OCR technology need to recognize fonts and font sizes, and the font sizes can be directly recognized according to the sizes of the characters, and the recognition of the fonts needs to compare the characters in the scanned image with corresponding vector characters existing in a vector character library, and replace and display the corresponding vector characters after the fonts of the characters in the scanned image are determined.
However, some characters have small font differences and are unstable due to the fact that the quality of the scanned image is easily affected by various factors, which easily causes the situation that the fonts of the characters in the scanned image are similar to the bitmap into which the vector characters are converted, but actually the fonts of the characters are not the same, and thus, the wrong vectorization result is easily generated.
Disclosure of Invention
The invention provides a method for vectorizing an electronic file based on a layout file, which aims to solve the problem of inaccurate vectorizing result in the prior art.
The invention discloses a method for vectorizing an electronic file based on a layout file, which adopts the following technical scheme:
acquiring a scanning bitmap of a paper document;
acquiring an index label and a font size label of each character on a scanning bitmap and an outer surrounding frame of each character;
acquiring vector characters of different fonts corresponding to the characters in a vector character library according to the index tag and the character size tag, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of corresponding images in the outer surrounding frames of the characters and the outer surrounding frames of the vector characters;
acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;
sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale, calculating the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale, performing KM (K-K) matching on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value;
and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.
Preferably, the step of setting a plurality of different sampling scales comprises:
acquiring the length of a closed edge corresponding to a distance sequence with the shortest length in all distance sequences in the two distance sequence sets;
taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;
the sampling scale is 1, 2 and 3 in sequence, 8230, to the maximum sampling scale.
Preferably, the step of respectively sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale includes:
adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in the target sequence;
and by analogy, obtaining a target sequence corresponding to each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.
Preferably, a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in two target sequence sets under the same sampling scale.
Preferably, the step of obtaining the similarity distance average of the matched target sequences includes:
after matching is completed, recording the similarity distance of two unmatched target sequences as 1;
calculating the similarity distance between every two well-matched target sequences by using a dynamic time warping algorithm;
and after the matching is completed, calculating the similarity distance average value of the similarity distance of all the unmatched two target sequences and the similarity distance average value of all the matched two target sequences.
Preferably, the step of calculating the similarity of KM matching results of target sequences in the corresponding target sequence sets at two different sampling scales includes:
acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs;
solving a union set of corresponding matched target sequence pairs in the corresponding target sequence sets under two different sampling scales to obtain the total matching number;
and taking the ratio of the number of the matched same target sequence pairs to the total number of matched targets as the similarity of KM matching results.
Preferably, the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value includes:
calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales;
calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation value and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;
and summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the vector character of each corresponding font.
Preferably, the image correction is performed on the scanned bitmap, the image correction including: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
Preferably, the OCR algorithm is used to obtain the text on the scanning bitmap and obtain the outer bounding box of each text.
Preferably, the index tag of the text is a tag of the text to be searched.
The method for vectorizing the electronic file based on the layout file has the advantages that:
1. the distance between the upper edge point and the central point of the closed edge of the scanning bitmap and the vector character bitmap of the characters is used as a description characteristic value of the closed edge, then the distance sequence is subjected to multi-sampling-scale pairing, and the font of each character is identified through the pairing result, so that the influence of the image quality of a scanning piece on the matching accuracy is avoided.
2. Similarity calculation is carried out on matching results of the distance sequence set of the closed edge of the scanning bitmap and the distance sequence set of the closed edge of the vector character bitmap, matching effects are obtained, then the matching degree of the characters and each corresponding character style is comprehensively evaluated according to the matching effects under the multi-sampling scale and the similarity of the matching results, and therefore the vector characters of the character style matched with the characters are further accurately determined, and vectorization of the characters is accurately achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic structural diagram of an embodiment of a method for vectorizing an electronic file based on a layout file according to the present invention;
fig. 2 is a schematic structural diagram of obtaining a distance sequence in an embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
An embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention is, as shown in fig. 1,
s1, obtaining a scanning bitmap of a paper document.
Specifically, a scanning bitmap of a paper document is acquired through a digital camera, and in order to prevent the collected paper document from skewing and influence the processing of a post-school image, the scanning bitmap is subjected to image correction in the embodiment, and the image correction includes: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
S2, acquiring an index label, a font size label and an outer surrounding frame of each character on the scanning bitmap, acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the font size label, converting the vector characters into a vector character bitmap, acquiring the outer surrounding frame of each vector character in the vector character bitmap, and acquiring the outer surrounding frame of the characters and a binary image of a corresponding image in the outer surrounding frame of the vector characters.
Because the vector characters are data existing in the vector character library, the characters which are identified by the scanning bitmap of the paper document by using an OCR algorithm are provided with index labels and character size labels, so that the vector characters of all different fonts corresponding to each character can be obtained in the vector character library.
Based on this, firstly, an index tag, a font size tag and an outer enclosure frame of each character on a scanning bitmap are required to be obtained, wherein the index tag of the character refers to a tag corresponding to the character to be searched, the index tag is the same as the tag of the same character in a character library, and specifically, the second step is toFor example, a word has an index tag of the wordOne character size labelIn the vector font library, according toIndex label of individual characterWord size labelCan obtain the firstVector characters of all different fonts corresponding to the characters, whereinThe first character corresponds toVector text representation of a seed font asI.e. when converting these vector words into a vector word bitmap, the secondThe vector bitmap of the vector character corresponding to each character can be expressed asWherein, in the step (A),=1、2、3……。
to a first orderTaking individual character as an example, the outer enclosure frame of the character in the scanning bitmap and the outer enclosure frame of the vector character of each font corresponding to the character in the vector character bitmap can be obtained, the image in the outer enclosure frame is binarized, because the scanning bitmap and the vector character bitmap of the character both comprise a character part and a background part, the pixel point of the character part is marked as 1, and the pixel point of the background part is set as 0, namely, the binarization of the bitmap in the outer enclosure frame is completed, in this embodiment, the second embodiment is marked asThe binary image corresponding to the image in the enclosing frame of each character isTo remember the firstThe binary image of the image in the bounding box of the vector character of each font corresponding to each character is recorded asWherein, in the step (A),=1、2、3……it should be noted that, in this embodiment, the outer bounding box of the text and the outer bounding box of the vector text both refer to the minimum bounding rectangle.
And S3, acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images.
Specifically, edge detection is performed on a binary image by using a CANNY edge detection algorithm to obtain all closed edges on the binary image, for a Chinese character, for example, a Chinese character "two", two closed edges are formed, for a Chinese character "three", three closed edges are obtained, when a distance sequence of each closed edge is obtained, taking one closed edge as an example, a center point of an outer enclosure frame corresponding to the closed edge needs to be obtained first, taking the center point of the outer enclosure frame as a reference point, and obtaining distances between the reference point and each edge point on the same closed edge, where the distance is obtained as the prior art.
As shown in fig. 2, the pixel distance of each edge point on a closed edge from the reference point is used as the description value of the closed edge, and then for the second closed edgeBinary image within bounding box of textTo a corresponding secondThe sequence of distances of the closed edges can be expressed as;Is shown asThe total number of all edge points on an individual closed edge,is shown asOn the closing edge of the firstThe distance between each edge point and the center point; then it is firstThe first character corresponds toThe binary image in the bounding box of the vector character of the typeface is represented asBinary imageTo a corresponding secondThe sequence of distances of the closed edges can be expressed as(ii) a By analogy, the distance sequence and the corresponding distance sequence set of the binary image corresponding to the bounding box in the bounding box of all the characters and the bounding box in the vector character bitmap of the vector character of each corresponding font can be obtained.
S4, sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale, calculating the similarity distance between every two target sequences between two target sequence sets under the same sampling scale, performing KM (K-K matching) on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value.
Because the font matching degree of the binary image of the text and the binary image corresponding to the vector text is measured, it is necessary to perform multi-scale pairing on each distance sequence in the distance sequence set of the closed edge of the binary image of the text and each distance sequence in the distance sequence set of the closed edge of the binary image corresponding to the vector text.
Specifically, in this embodiment, a sampling scale is set first, and specifically, since the length of each closed edge is limited, the length of the closed edge corresponding to the distance sequence with the shortest length in all the distance sequences in the two distance sequence sets is obtained first; one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length is taken as the maximum sampling scale, and the maximum sampling scale is taken as C, so that the sampling scales are 1, 2 and 3 \8230;, and C are 1, 2 and 3 \8230; wherein, when the sampling scale is C, the embodiment is the firstBinary image within bounding box of textDistance sequence set ofThe target sequence set obtained after sampling each distance sequence in the sum is recorded asThen it is firstCharacter 1Binary image in bounding box of vector character of seed fontThe target sequence set obtained after sampling each distance sequence in the distance sequence set is recorded as。
The step of sampling each distance sequence in the distance sequence set according to a sampling scale to obtain a target sequence set corresponding to each sampling scale comprises the following steps: adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence; and by analogy, obtaining a target sequence corresponding to each distance sequence under each sampling scale, and obtaining a target sequence set by the target sequences corresponding to all the distance sequences in each distance sequence set.
Specifically, a similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated, where the similarity distance is a DTW distance, that is, the similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated by using a dynamic time warping algorithm, and the dynamic time warping algorithm is an algorithm in the prior art and is not described in detail in this embodiment.
The method comprises the following steps of performing KM matching on target sequences in two target sequence sets according to similarity distance, and acquiring similarity distance mean values: wherein, the KM algorithm can complete a bipartite graphAn algorithm of maximum weight matching in matching, that is, every two target sequences in two target sequence sets are sequentially matched to obtain two target sequences matched in the target sequence set and the other target sequence set, and at the same time, the number of the matched sequences can also be obtained, that is, in this embodiment, the similarity distance corresponding to the target sequences is used for matching the two target sequences, after the matching is completed, the similarity distance of the two unmatched target sequences is marked as 1, then the similarity distance of each two matched target sequences is calculated by using a dynamic time warping algorithm, after the matching is completed, the similarity distance of all the two unmatched target sequences and the similarity distance average of the similarity distances of all the two matched target sequences are calculated, so that the value range is from 0 to 1, in this embodiment, the similarity distance is the normalized similarity distance, therefore, the obtained similarity distance average is used as the effect evaluation of the matching result, the similarity distance average approaches 0, the matching effect is better, the similarity distance average is more close to 1, the matching effect is illustrated, and therefore, in this embodiment, the sampling scale in this embodiment, the next step C is more the sampling is performed, the scale is more the matching effect is poorBinary image inside bounding box of textAnd a firstThe first character corresponds toBinary image in bounding box of vector character of seed fontThe evaluation values of the matching effect of the corresponding two target sequence sets are recorded as。
And S5, calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining a vectorized file.
Specifically, the step of obtaining the similarity of KM matching results of target sequences in a target sequence set corresponding to each two different sampling scales includes: acquiring corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs; obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales; and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.
The step of calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises the following steps: calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales; calculating the product of the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation values and the weights of the two corresponding different sampling scales, and taking the product as the initial matching degree; summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the corresponding vector character, wherein the font matching degree of each character and the vector character of the corresponding font is calculated according to the following formula in the embodiment:
in the formula (I), the compound is shown in the specification,is shown asText and number oneThe first character corresponds toThe font matching degree of the vector character of the font;
is shown inAt the sampling scale ofBinary image inside bounding box of textAnd a first step ofThe first character corresponds toBinary image in bounding box of vector character of seed fontThe matching effect evaluation values of the corresponding two target sequence sets;
is shown inAt the sampling scale ofBinary image within bounding box of textAnd a firstThe first corresponding to the characterBinary image in bounding box of vector character of seed fontThe matching effect evaluation values of the corresponding two target sequence sets;
is shown inKM matching results of target sequences in a corresponding set of target sequences at a sampling scale with KM matching results of KM matching sequences in a target sequence setThe similarity of KM matching results of target sequences in the corresponding target sequence set at the sampling scale is, it should be noted that,;
it should be noted that, in the following description,matching effect evaluation value corresponding to sampling scaleAndmatching effect evaluation value corresponding to sampling scaleAll the similarity distances are normalized distances, so that the matching effect evaluation value is closer to 0, which indicates that the matching effect is better, and the matching effect evaluation value is closer to 1, which indicates that the matching effect is worse,representing weights of interest, i.e. sampling scaleAnd the sampling scaleThe larger the phase difference is, the larger the sampling scale difference is, and the more similar the matching result is, the second isBinary image within bounding box of textAnd the firstThe first character corresponds toBinary image in bounding box of vector character of seed fontThe pairing results obtained at the two sampling scales with the larger difference are still similar, and at the same time, the wordsDegree of volume matchingThe closer to 0, the more the description isText and number oneThe first character corresponds toThe smaller the difference between the two types of fonts is, the more matched the two types of fonts are, in short, the more similar the matching results corresponding to each sampling scale are and the better the matching effect of the matching results is, the second explanation is thatThe first character corresponds toThe more the vector character of the same font should be the firstAnd (5) a character replacement result.
Therefore, in this embodiment, the vector character corresponding to the maximum font matching degree in the font matching degrees of the characters and the vector characters of each corresponding font is used as a replacement object of the corresponding character in the vector diagram, and the vectorized file is obtained.
The invention relates to a method for vectorizing an electronic file based on a format file, which takes the distance between a scanning bitmap of characters and the upper edge point and the central point of the closed edge of a vector character bitmap as a description characteristic value of the closed edge, then performs multi-sampling-scale pairing on a distance sequence, and identifies the font of each character through a pairing result, thereby avoiding the influence of the image quality of a scanned piece on the matching accuracy.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A method for vectorizing an electronic file based on a layout file is characterized by comprising the following steps:
acquiring a scanning bitmap of a paper document;
acquiring an index label and a font size label of each character on a scanning bitmap and an outer surrounding frame of each character;
acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the character size label, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of the outer surrounding frame of the characters and corresponding images in the outer surrounding frame of the vector characters;
acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;
sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale, calculating the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale, performing KM (K-K) matching on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value;
and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.
2. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of setting a plurality of different sampling scales comprises:
acquiring the length of a closed edge corresponding to a distance sequence with the shortest length in all distance sequences in the two distance sequence sets;
taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;
the sampling scale is 1, 2 and 3 in sequence, 8230, to the maximum sampling scale.
3. The method for vectorizing the electronic file based on the layout file according to claim 1, wherein the step of sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale comprises:
adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence;
and by analogy, obtaining a target sequence corresponding to each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.
4. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale.
5. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of obtaining the mean value of the similarity distance between the matched target sequences comprises:
after matching is completed, recording the similarity distance of two unmatched target sequences as 1;
calculating the similarity distance between every two well-matched target sequences by using a dynamic time warping algorithm;
and after the matching is completed, calculating the similarity distance average value of the similarity distance of all the unmatched two target sequences and the similarity distance average value of all the matched two target sequences.
6. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of calculating the similarity of the KM matching results of the target sequences in the corresponding target sequence sets at two different sampling scales comprises:
acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs;
obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales;
and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.
7. The method for vectorizing an electronic document based on a layout file according to claim 1, wherein the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises:
calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales;
calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation value and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;
and summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the vector character of each corresponding font.
8. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the scanning bitmap is subjected to image correction, and the image correction comprises: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
9. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein an OCR algorithm is used to obtain the text on the scanned bitmap and obtain the outer bounding box of each text.
10. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the index tag of the text is a tag of the text to be searched.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211266067.0A CN115346227B (en) | 2022-10-17 | 2022-10-17 | Method for vectorizing electronic file based on layout file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211266067.0A CN115346227B (en) | 2022-10-17 | 2022-10-17 | Method for vectorizing electronic file based on layout file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115346227A true CN115346227A (en) | 2022-11-15 |
CN115346227B CN115346227B (en) | 2023-08-08 |
Family
ID=83957216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211266067.0A Active CN115346227B (en) | 2022-10-17 | 2022-10-17 | Method for vectorizing electronic file based on layout file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115346227B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117037184A (en) * | 2023-10-10 | 2023-11-10 | 深圳牛图科技有限公司 | OCR fuzzy recognition system and method based on cloud matching |
CN117236291A (en) * | 2023-11-16 | 2023-12-15 | 北京点聚信息技术有限公司 | Method and system for rapidly converting scanned file into vector layout file |
CN117475438A (en) * | 2023-10-23 | 2024-01-30 | 北京点聚信息技术有限公司 | OCR technology-based scan file vectorization conversion method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110232724A (en) * | 2019-06-13 | 2019-09-13 | 大连民族大学 | A kind of Chinese character style image vector representation method |
CN114419635A (en) * | 2022-03-30 | 2022-04-29 | 北京点聚信息技术有限公司 | Electronic seal vector diagram identification method based on graphic identification |
CN114926839A (en) * | 2022-07-22 | 2022-08-19 | 富璟科技(深圳)有限公司 | Image identification method based on RPA and AI and electronic equipment |
-
2022
- 2022-10-17 CN CN202211266067.0A patent/CN115346227B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110232724A (en) * | 2019-06-13 | 2019-09-13 | 大连民族大学 | A kind of Chinese character style image vector representation method |
CN114419635A (en) * | 2022-03-30 | 2022-04-29 | 北京点聚信息技术有限公司 | Electronic seal vector diagram identification method based on graphic identification |
CN114926839A (en) * | 2022-07-22 | 2022-08-19 | 富璟科技(深圳)有限公司 | Image identification method based on RPA and AI and electronic equipment |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117037184A (en) * | 2023-10-10 | 2023-11-10 | 深圳牛图科技有限公司 | OCR fuzzy recognition system and method based on cloud matching |
CN117475438A (en) * | 2023-10-23 | 2024-01-30 | 北京点聚信息技术有限公司 | OCR technology-based scan file vectorization conversion method |
CN117475438B (en) * | 2023-10-23 | 2024-05-24 | 北京点聚信息技术有限公司 | OCR technology-based scan file vectorization conversion method |
CN117236291A (en) * | 2023-11-16 | 2023-12-15 | 北京点聚信息技术有限公司 | Method and system for rapidly converting scanned file into vector layout file |
CN117236291B (en) * | 2023-11-16 | 2024-01-12 | 北京点聚信息技术有限公司 | Method and system for rapidly converting scanned file into vector layout file |
Also Published As
Publication number | Publication date |
---|---|
CN115346227B (en) | 2023-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115346227B (en) | Method for vectorizing electronic file based on layout file | |
CN107239786B (en) | Character recognition method and device | |
US20190294921A1 (en) | Field identification in an image using artificial intelligence | |
US9959475B2 (en) | Table data recovering in case of image distortion | |
WO2021258634A1 (en) | Image auditing and identification method and apparatus, and storage medium | |
US11436852B2 (en) | Document information extraction for computer manipulation | |
WO2021051527A1 (en) | Image segmentation-based text positioning method, apparatus and device, and storage medium | |
CN111401099A (en) | Text recognition method, device and storage medium | |
CN113158895A (en) | Bill identification method and device, electronic equipment and storage medium | |
US10380456B2 (en) | Classification dictionary learning system, classification dictionary learning method and recording medium | |
CN111881923B (en) | Bill element extraction method based on feature matching | |
CN107403179B (en) | Registration method and device for article packaging information | |
CN112861842A (en) | Case text recognition method based on OCR and electronic equipment | |
JP7384603B2 (en) | Document form identification | |
Devi et al. | Pattern matching model for recognition of stone inscription characters | |
Kataria et al. | CNN-bidirectional LSTM based optical character recognition of Sanskrit manuscripts: A comprehensive systematic literature review | |
CN111213157A (en) | Express information input method and system based on intelligent terminal | |
CN111814801B (en) | Method for extracting labeled strings in mechanical diagram | |
CN118135584A (en) | Automatic handwriting form recognition method and system based on deep learning | |
Shah et al. | Devnagari handwritten character recognition (DHCR) for ancient documents: a review | |
US10366284B1 (en) | Image recognition and parsing | |
CN111414917A (en) | Identification method of low-pixel-density text | |
Zheng et al. | Recognition of expiry data on food packages based on improved DBNet | |
CN114004976A (en) | LBP (local binary pattern) feature-based target identification method and system | |
Koopmans et al. | The effects of character-level data augmentation on style-based dating of historical manuscripts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |