Detailed Description
In order to solve the problems in the prior art, embodiments of the present invention provide a method and a system for embedding double-byte fonts into a PDF file, so that a target PDF file (i.e., a PDF file with embedded fonts) is directly generated while a PDF file with to-be-embedded fonts is parsed, and font descriptions of the double-byte fonts are embedded in a process of generating the target file. Compared with the prior art, the method avoids the use of the intermediate format, thereby better ensuring the correctness of the target file and improving the efficiency of the embedded operation.
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
For an original PDF file with a font to be embedded as an input (hereinafter referred to as an original file), in order to generate a PDF file with an embedded font as an output (hereinafter referred to as a target file) based thereon, in the embodiment of the present invention, the target file is generated according to the steps shown in fig. 1:
step 101, analyzing an original PDF file, determining a font used by the PDF file but not embedded into the PDF file, and determining a double-byte font to be embedded (such as Chinese simplified font, Chinese traditional font, Japanese and Korean); obtaining font description information of the double-byte font to be embedded according to the determined double-byte font to be embedded, wherein the font description information can comprise font coding mode information and font name;
102, determining all characters which are output by using a double-byte font to be embedded in the original PDF file by analyzing the content flow of the original PDF file, coding and mapping the characters according to the font type and the corresponding coding mode of the characters to obtain identifications (such as character identifications or font identifications) corresponding to the characters, and acquiring font description information corresponding to the identifications of the characters from the font file of the double-byte font to be embedded;
step 103, organizing the obtained font description information into a font program (fontprogram) data stream conforming to the PDF file specification, and using the font program data stream and the obtained font description object as font file data embedded in the PDF file, thereby generating a target PDF file.
In the above process, the determined double-byte fonts to be embedded may be all the double-byte fonts used by the original PDF file but not embedded, or a part of the double-byte fonts. The above-mentioned process can be implemented by a corresponding software system.
A detailed flow of all double-byte fonts used by embedding a PDF file in the PDF file, but not embedded in the PDF file, by using the software system of the embodiment of the present invention is described below with reference to fig. 2.
To facilitate implementation of the embodiments of the present invention, the following set may be used as a data storage module for storing intermediate data when implementing the process:
and (3) a font set to be embedded: is a simple set of font objects that contains all the double-byte fonts to be embedded. When the original file is analyzed, when a double-byte font to be embedded (existing in the form of a font object) is found, one record is added in the set, and the repetition is not counted (namely, only one corresponding record is stored under the condition that the same font object is used for multiple times);
the font description set to be embedded comprises: is a simple set of font description objects containing all the double-byte fonts to be embedded. Not counting repetition (namely only saving a corresponding record for the condition that the same font description object is referred for multiple times;
and (3) character set to be embedded: is a set indexed by the font object that contains all the characters that the font is used in the original document. In this set, characters are recorded in the form of a character Code (Char Code) or a Character Identifier (CID) or a Glyph identifier (Glyph ID), not counting repetitions (i.e. only one corresponding record is saved for the case where the same character is used multiple times).
As shown in fig. 2, the process of embedding double-byte fonts in a PDF file by the software system of the embodiment of the present invention includes:
step 201, parsing the original file, obtaining Font objects (Font objects) of all non-embedded double-byte fonts used by the original file, and storing the objects in a Font set to be embedded.
In general, a Font object in a PDF exists in a PDF file in the form of a PDF dictionary object, and a double-byte Font used by the PDF file but not embedded in the PDF file can be determined by looking up the PDF dictionary object of the original PDF file. The Font object contains important information about the Font, such as the Font name and the encoding method.
Step 202, for the font objects of all double-byte fonts in the font set to be embedded, searching the corresponding descendant fonts (DespendantFonts), searching the font description objects (FontDescriptors) corresponding to the double-byte fonts to be embedded from the descendant fonts, and storing the searched font description objects in the font description set to be embedded.
In general, child fonts exist in a PDF file in the form of a PDF dictionary object, and FontDescriptor objects of the corresponding fonts are included in the child fonts and exist in the PDF file in the form of a dictionary object.
Step 203, analyzing all content streams in the original PDF file, acquiring the fonts used by all instructions related to text output and codes of output characters, acquiring character identifications or font identifications of each output character using the to-be-embedded double-byte fonts according to the font type and the coding mode of each output character, and storing the acquired character identifications or font identifications into a to-be-embedded character set with the font description as an index.
In this step, for a font of Type1(CID), the character can be coded and mapped to obtain a Character Identification (CID); for a font of a TrueType (CID) type, encoding and mapping characters to obtain character Unicode, and then querying a font identification table in a font file of the TrueType to obtain a corresponding character identification (Glyph), wherein the encoding and mapping method is to find a character name corresponding to a character code by searching an encoding mapping table, and the encoding mapping table is an attribute contained in each font description object. The method comprises the following specific steps:
as shown in FIG. 3, for fonts of Type1(CID), if the encoding mode is Identity-H or Identity-V, the Character Identification (CID) of the character can be parsed from the content stream (see steps 301-302); for other encoding modes, analyzing the character code (CharCode) of the character from the content stream, and then mapping the character code to obtain a corresponding Character Identifier (CID) according to the character code (see steps 301, 303 and 304); after the Character Identification (CID) is obtained, the font description corresponding to the character can be found from the corresponding font file according to the character identification. If the character also contains the sub-character, the character identifier corresponding to the sub-character needs to be acquired together according to the above mode.
As shown in fig. 4, for a font of truetype (CID), if the encoding mode is Identity-H or Identity-V, the Character Identifier (CID) of the character is analyzed from the content stream, a character Unicode (Unicode) corresponding to the character identifier is obtained according to a mapping table from the character identifier to the character Unicode, and a Glyph identifier (Glyph ID) corresponding to the character Unicode is obtained according to a mapping table from the character Unicode to the Glyph identifier (see steps 401, 402, 406, and 407); for the Unicode encoding mode, analyzing the character Unicode of the character from the content stream, and then obtaining a Glyph identifier (Glyph ID) corresponding to the character Unicode according to a mapping table from the character Unicode to the Glyph identifier (Glyph ID) (see steps 401, 403 and 407); for other coding modes, analyzing the character Code (Char Code) of the character from the content stream, acquiring the Character Identifier (CID) corresponding to the character Code (Char Code) through the middle mapping table of the font description object, using the inquired Character Identifier (CID), and then inquiring the mapping table from the CID to the Unicode to acquire the corresponding Unicode and the corresponding font identifier (Glyph ID) (see steps 401, 404, 405, 406, 407); after the font identifier (Glyph ID) is obtained, the corresponding font description can be found from the corresponding font file according to the font identifier (Glyph ID). If the character also contains the sub-character, the font identification corresponding to the sub-character is required to be obtained together according to the above mode.
Step 204, a font program data stream is constructed. Constructing a CFF font program data stream if the double-byte font to be embedded is of Type1(CID) Type; constructing a TrueType (CID) font program data stream if the double-byte font to be embedded is a TrueType (CID) type; if the double-byte font to be embedded includes both Type1(CID) and TrueType (CID) types, a CFF font program data stream and a TrueType (CID) font program data stream are constructed.
Step 205, in the font description set to be embedded, one font description is located by an index formed by the font descriptions, all character identifications or font identifications corresponding to the font descriptions are read, corresponding font description information is respectively searched in the font file, and the searched font description information is written into the corresponding font program data stream.
In this step, if the font corresponding to the current font description is of Type1(CID), the following steps are performed:
in a character set to be embedded, traversing each Character Identifier (CID) under the current font description by taking the current font description as an index, searching corresponding font description information in a Type1(CID) font file according to each Character Identifier (CID), and if the Character Identifier (CID) comprises a sub-character identifier, searching sub-character font description information corresponding to the sub-character identifier; then storing the found font description information into the CFF data stream constructed before according to the CFF font program specification;
if the font corresponding to the current font description is a TrueType (CID) type, the following steps are performed:
in a character set to be embedded, traversing each font identifier (Glyph ID) under the current font description by taking the current font description as an index, searching corresponding font description information in a TrueType (CID) font file according to each font identifier, and if the font identifiers contain the font identifiers of sub-characters, searching sub-character font description information corresponding to the font identifiers of the sub-characters; the obtained glyph description information is then stored in the previously constructed TrueType (CID) data stream according to the TrueType (CID) font program specification.
Step 206, regarding to the character identifiers and the font identifiers corresponding to all the font descriptions in the character set to be embedded, whether the corresponding font description information has been written into the font program data stream, that is, whether the character identifiers and the font description information corresponding to the font identifiers in the character set to be embedded have been written into the font program data stream, if yes, step 207 is executed; otherwise, return to step 205.
And step 207, writing the font program data stream into the target PDF file, and writing the font description objects recorded in the font description set to be embedded into the target PDF file according to the PDF specification.
Writing a font description object of a Type1(CID) font into a target PDF file after necessary modification (mainly referring to a generated CFF data stream) according to a specification embedded in a CFF font in a PDF specification; and writing a font description object of a TrueType (CID) type font into the target PDF file after necessary modification (mainly referring to the generated TrueType (CID) data stream) according to the embedded specification of the TrueType (CID) font in the PDF specification.
And step 208, traversing the objects in the original PDF file, and storing all other objects into the target PDF file without modification except the font description objects written into the target PDF file through the steps.
In step 202 of the above flow, the obtained font description information of the to-be-embedded font may selectively include a set of characters (such as an identifier or a name of a character set) used by the font in the original PDF file, so that when obtaining the font description information in the subsequent font embedding process, it is only necessary to obtain the corresponding font description information from the character set according to the character set included in the corresponding font file and write the corresponding font description information into the target PDF file, and thus, only one minimized subset of the font is embedded when embedding the font, and the subset only includes the characters used by the original PDF file in the font, thereby reducing the data volume of the target PDF file.
Based on the same technical concept, an embodiment of the present invention further provides a system capable of embedding a double-byte font into a PDF file, where as shown in fig. 5, the system includes: a font description information determining module 501, a font description information acquiring module 502, and a PDF file generating module 503; wherein,
a font description information determining module 501, configured to determine a double-byte font used by a PDF file to be embedded but not embedded in the PDF file, and font description information of the double-byte font;
a font description information obtaining module 502, configured to determine, in the PDF file of the font to be embedded, all characters and character identifiers thereof or font identifiers thereof that perform text output using the double-byte font, and obtain, according to the font file of the double-byte font, font description information corresponding to the identifiers;
the PDF file generating module 503 is configured to generate a PDF file embedded with the to-be-embedded double-byte fonts according to the acquired font description information and the acquired font description information.
The font describing information determining module 501 may include:
the file analysis submodule 5011 is used for analyzing a PDF file of the fonts to be embedded;
the sub-module 5012 for determining the fonts to be embedded and the description thereof is configured to determine the double-byte fonts used by the PDF file but not embedded and the font description information of the double-byte fonts according to the PDF dictionary object of the PDF file analyzed by the file analyzing sub-module 5011.
The font describing information obtaining module 502 may include:
the content stream analyzing submodule 5021 is used for analyzing the content stream of the PDF file with the fonts to be embedded to obtain all instructions related to character output;
the character and identifier acquiring submodule 5022 is used for determining characters which are output by using the double-byte fonts to be embedded according to the instruction analyzed by the content stream analyzing submodule 5021; and acquiring a character identifier or a font identifier of the output character according to the determined font type and the corresponding encoding mode of the output character. If the font Type to which the output character belongs is the Type1 Type, the character identifier is obtained, and if the font Type to which the output character belongs is the Type TrueType, the font identifier is obtained, and the process of obtaining the character identifier or the font identifier according to the font Type and further according to the encoding mode is as described above;
the font describing information obtaining sub-module 5023 is used for obtaining font describing information corresponding to the identifier according to the font file of the double-byte font.
The font describing information obtaining module 502 further includes: the font file loading submodule 5024 is configured to load a corresponding font file according to the font description information of the double-byte font determined by the font to be embedded and the description determination submodule 5012 thereof. When the font describing information acquiring submodule 5023 acquires the font describing information, the font describing information corresponding to the identifier is acquired from the loaded font file.
The PDF file generating module 503 may include:
a font program data stream constructing submodule 5031, configured to construct a corresponding font program data stream according to a font type to which a double-byte font to be embedded belongs;
the font program data stream writing sub-module 5032 is configured to store the obtained font description information into a corresponding font program data stream;
the PDF file writing sub-module 5033 is configured to write the font program data stream in which the font description information is stored and the font description information of the double-byte font to be embedded into the target PDF file, where the target PDF file is the PDF file in which the double-byte font to be embedded is embedded.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.