CN101996160A

CN101996160A - Method and system for processing script data

Info

Publication number: CN101996160A
Application number: CN 200910090817
Authority: CN
Inventors: 丁力; 张磊; 仇睿恒; 王毅
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Peking University Founder Research and Development Center
Priority date: 2009-08-10
Filing date: 2009-08-10
Publication date: 2011-03-30
Anticipated expiration: 2029-08-10
Also published as: CN101996160B

Abstract

The invention discloses a method and system for processing script data, belonging to the script technology field. The volume of traditional script data is lager, logic using the traditional script data is more complex, the speed is lower, and the efficiency is lower. In the method and system of the invention, a direct map relation between the character encoding and the character pattern index is firstly established and recorded in the mapping table of the character encoding and the character pattern index, and the redundant data in the script data is deleted; and when the script is used, the character pattern index of characters is obtained firstly through inquiring the mapping table of the character encoding and the character pattern index, and then the character pattern description data of the characters is obtained from the script data according to the character pattern index. After the script data is processed by adopting the method and system of the invention, the volume of the scrip data is reduced, and the use efficiency of the script data is improved. The invention is especially suitable for files integrating character information with script data into a whole or being added into file reading software.

Description

A kind of disposal route of character font data and system

Technical field

The invention belongs to the font technical field, be specifically related to a kind of disposal route and system of character font data, be specially adapted to perhaps append in the document ocr software in the document that Word message and character font data combine.

Background technology

Electronic document comprise elements such as figure, picture, form, formula, multilingual literal, and literal is the element of topmost expression document content when storage and showing, the proportion that accounts in document is also maximum.Character font data is as a kind of resource, the font description data of a series of literal have been stored, when document was showed, according to the Word message in the document, the font data in the corresponding character font data rendered image or one group of path and is shown to computer screen or outputs on the printer.

Type1 full name PostScript Type1, it is a cover vector fonts standard that proposed by Adobe company in 1985, because this standard is based on PostScript Description Language (PDL), and PDL is the printing descriptive language of high-end printer first-selection, so Type1 comes into vogue rapidly.But Type1 is non-open font, and Adobe imposes the usage charges of great number to the company that uses Type1.

TrueType is a kind of novel math glyph description technique.It describes the character contour profile with mathematical function, contains instructions such as glyph construction, color filling, digital described function, the control of flow process condition, grid processing controls, additional prompt control.Quadratic b-spline curve and straight line are described the appearance profile of font in the TrueType employing geometry, and be characterized in: TrueType both can make type fount, can be used as screen display again; Because it is by instruction font to be described, so it is irrelevant with resolution, during output always the resolution according to printer export.No matter amplify or dwindle, character is always smooth, does not have sawtooth and occurs.But relative PostScript font, its quality is less better.Particularly at literal too hour, just show to such an extent that be not very clear.

OpenType also is the Type2 font, and it also is a kind of cutline font, and is more more powerful than TrueType, and the most tangible benefit is exactly to be embedded into the PostScript font in the software of TrueType.And support a plurality of platforms, and support very big character set, also have copyright protection.We can say that it is the superset of Type1 and TrueType.

The major advantage of OpenType is as follows:

1) the cross-platform function of Zeng Qianging

2) better support the international character collection of Unicode standard definition

3) support senior printing control ability

4) document size of Sheng Chenging is littler

5) be supported in and add digital signature in the character set, guarantee the integrated functionality of file

The OpenType standard has also defined the suffix name of OpenType file name.Comprise the OpenType file suffixes .ttf by name of TureType font, the file suffixes that comprises the PostScript font is called .OTF.If comprise the font packet file of a series of TrueType fonts, suffix is called .TTC so.

Unicode (Unicode, ten thousand country codes, single sign indicating number) is a kind of character code that uses on computers.It is in every kind of language each character setting unified and unique binary coding, stride language, the cross-platform requirement of carrying out text-converted, processing to satisfy.

Want the Word message content of the displaying of the former formula of real master, the color that must keep the user that literal is provided with, font, attribute informations such as size.Guarantee the same result of output in any system, just must integrate character font data and Word message as a whole.At this moment just need handle character font data, processing mode of the prior art is: the information of character font data being removed partial redundance, it mainly is the data of description of font, for example the description among the glyf in the OpenType font to font, because the data to other do not process, so font use-pattern and complete font are as broad as long.

There is following shortcoming in prior art:

1. font is used the logic complexity.Because the character font data that prior art was handled is as broad as long with normal font in the use, to determine to use which kind of mapping table according to current literal code type during use.For example, a plurality of cmap tables are arranged among the OpenType, search could be determined the position of font description data through one or many, and also needs coding is carried out the one or many conversion when using some mapping table, and logic is complicated.

2. still have some redundant datas in the character font data.Owing to only removed part font data of description, some information that also exist some and environment for use to have nothing to do in the font are as some information in name table among the OpenType and the cmap table.

Summary of the invention

At the defective that exists in the prior art, the purpose of this invention is to provide a kind of disposal route and system of character font data.After these method and system are handled character font data, can improve the service efficiency of character font data.

To achieve these goals, the technical solution used in the present invention is as follows:

A kind of disposal route of character font data, this method are at first set up the corresponding relation between literal code and the font index, are recorded in literal code and the font index mapping table;

When using character font data,, obtain the font index corresponding, from character font data, obtain the font description data of literal again according to described font index with literal code by described literal code and font index mapping table.

The disposal route of character font data as mentioned above, wherein, literal code is meant the standard code of literal, comprises that Unicode coding and GBK encode.If literal code is not the Unicode coding, then be translated into the Unicode coding.

The disposal route of character font data as mentioned above, wherein, the process of setting up corresponding relation between literal code and the font index may further comprise the steps:

(1) obtains and discerns literal code;

(2) resolve character font data, from character font data, obtain the corresponding relation of literal code and font index;

(3) generate literal code and font index mapping table.

The disposal route of character font data as mentioned above, in the step (2), resolve character font data, the process of obtaining literal code and font index corresponding relation is: find one or more mapping tables that can finally obtain literal code and font index corresponding relation according to the usage platform of literal and the type of literal code in character font data, obtain the font index corresponding with literal code according to the one or more mapping tables that find again.

(1) resolves character font data, find one or more mapping tables that can finally obtain corresponding relation between literal code and the font index;

(2) judge whether the literal code that font is used is the Unicode coding, if not the Unicode coding, then converts thereof into the Unicode coding;

(3) in the mapping table that step (1) finds, extract the font index corresponding with literal code, generate literal code and font index mapping table, the Unicode coding of described literal code and font index mapping table shorthand and the corresponding relation between the font index.

The disposal route of character font data wherein, behind the corresponding relation of setting up between literal code and the font index, is deleted the redundant data in the character font data as mentioned above.

A kind of disposal system of character font data comprises font processor (2) and font application apparatus (3);

Described font processor (2) comprises and is used to resolve character font data, obtains the font index acquisition module (22) of literal code and font index corresponding relation; Be used to generate the mapping table generation module (23) of literal code and font index mapping table, described literal code and font index mapping table are used for the corresponding relation between shorthand coding and the font index;

Described font application apparatus (3) comprises and is used to resolve literal code and font index mapping table, obtain the font index corresponding with literal code or with the mapping table parsing module (31) of font index corresponding character coding; Be used for obtaining from character font data the font description data acquisition module (32) of text font data of description according to the font index.

The disposal system of character font data as mentioned above, wherein, font processor (2) comprises that also the literal code that is used to obtain and discern literal code obtains and identification module (21), when literal code is non-standard coding, also is used for non-standard code conversion is become standard code.

The disposal system of character font data as mentioned above, wherein, font processor (2) also comprises the redundant data removing module (24) that is used for the redundant data deletion of character font data.

Method and system of the present invention compared with prior art have following advantage:

(1) font uses logic simple, and speed is fast.According to literal code and font index, generate the mapping table of corresponding relation between literal code and the font index, when saving each use font, all need to carry out the judgement of literal code type and the operations such as parsing that cmap shows.And literal code and font index mapping table use simple, have saved the complex logic of repeatedly searching the cmap table, make font index location faster, have improved the operating speed of character font data.

(2) removed more redundant information, made character font data littler.According to literal code and font index mapping table, removed in the character font data redundant datas such as description to environment for use, for example, in the name of the OpenType table to different platform, the data of description of different language, and some redundant mapping tables in the cmap table.

Description of drawings

Fig. 1 is the structured flowchart of system described in the embodiment 1;

Fig. 2 is the structured flowchart of system described in the embodiment 2;

Fig. 3 is the process flow diagram that adopts the described system of Fig. 1 that character font data is handled and used;

Fig. 4 is a process flow diagram of handling character font data among the embodiment 1;

Fig. 5 is a process flow diagram of using character font data among the embodiment 1.

Embodiment

Core concept of the present invention is: the character font data in the document that existing Word message and character font data are combined or have whole character font data now and handle, set up direct mapping relations between literal code and the font index, be recorded in literal code and the font index mapping table, then the redundant data in the character font data deleted.Wherein, character font data is meant the data in the font file, comprises describing needed all data of font, as one or more mapping tables of title, copyright, font description data and record font description data and literal code corresponding relation.The font index is used for indicating the position of font description data at character font data.When using font,, obtain the font index of literal earlier, from character font data, obtain the font description data of literal again according to the font index by inquiry literal code and font index mapping table.

Below in conjunction with embodiment and accompanying drawing, describe the present invention.

Embodiment 1

Present embodiment is to be treated to example to the character font data in the document that literal code and character font data are combined.

Fig. 1 has shown the structure of the described system of present embodiment, and this system comprises character font data treating apparatus 2 and character font data application apparatus 3.

Character font data treating apparatus 2 comprises that literal code is obtained and identification module 21, font index acquisition module 22 and mapping table generation module 23.Literal code is obtained and identification module 21 is used for obtaining and discerning the literal code of document 1, when literal code is non-standard coding, also is used for non-standard code conversion is become standard code.Font index acquisition module 22 is used for the character font data of parse documents 1, obtains the corresponding relation of literal code and font index, and described font index is used for indicating the position of font description data at character font data.Mapping table generation module 23 is used to generate literal code and font index mapping table, and this table is used for the corresponding relation between shorthand coding and the font index.

Character font data application apparatus 3 comprises mapping table parsing module 31 and font description data acquisition module 32.Mapping table parsing module 31 is used to resolve literal code and font index mapping table, obtains the font index corresponding with literal code or encodes with font index corresponding character.Font description data acquisition module 32 is used for obtaining the font description data according to the font index from character font data.

In addition, for the redundant data in the delete font data, reduce the data volume of character font data, character font data treating apparatus 2 also comprises redundant data removing module 24.After generating literal code and font index mapping table, redundant data removing module 24 can be deleted the redundant data in the character font data.

Fig. 3 has shown the flow process that adopts system shown in Figure 1 character font data to be handled and used processing back character font data, comprise character font data treating apparatus 2 processing character font datas, and character font data application apparatus 3 is used character font datas.

The process that character font data treating apparatus 2 is handled character font data may further comprise the steps:

(1) literal code is obtained and identification module 21 obtains and discern literal code in the document 1.

Literal code can be a standard code, as Unicode or GBK, and also can criteria of right and wrong coding.If non-standard coding then needs non-standard coding is converted into standard code.

(2) font index acquisition module 22 is resolved character font data, obtains the corresponding relation of literal code and font index from character font data.

At first in character font data, find one or more mapping tables that can finally obtain literal code and font index corresponding relation, obtain the font index corresponding according to the one or more mapping tables that find again with literal code according to the usage platform of literal and the type of literal code.

(3) mapping table generation module 23 generates literal code and font index mapping table, and this table is used for the corresponding relation between shorthand coding and the font index.

After generating literal code and font index mapping table, redundant data removing module 24 is the deletion of the redundant data in the character font data, as in the name table of OpenType to the data of description of different platform, different language, and cmap show in some redundant mapping tables.

The process that character font data application apparatus 3 is used the character font data of handling through character font data treating apparatus 2 may further comprise the steps:

(4) mapping table parsing module 31 is resolved literal code and font index mapping table, obtains literal code or font index.The use-pattern of literal code and font index mapping table has following two kinds:

1. obtain literal code, resolve literal code and font index mapping table, search the font index corresponding with literal code;

2. obtain the font index, resolve literal code and font index mapping table, search with font index corresponding character and encode.

(5) font description data acquisition module 32 obtains the font description data of literal according to the font index from character font data.

So that " specification " two words in the embedded font of PDF document are treated to example, the processing of above-mentioned character font data and the process of application are illustrated below.As shown in Figure 4, at first obtain and identification module 21 obtains and discern the literal code of " specification ", be respectively 0x21 and 0x22 by literal code.Because this is encoded to unknown coding, therefore promptly non-standard coding need convert thereof into standard code earlier.Can obtain the Unicode coding of " specification " according to the table of the ToUnicode in the PDF document, be respectively 0x89C4 and 0x683C.

Because the usage platform of font is Windows, type of coding is Unicode, therefore searches Platform ID=3 (Windows) in character font data, the cmap table (character map) of Encoding ID=1 (Unicode BMP (UCS-2)).Font index acquisition module 22 is resolved character font data, finds to have in the character font data cmap table, but this table is not the cmap table of Platform ID=3, Encoding ID=1.At this moment, adopt the cmap table (type is 4) of section mapped mode, directly " specification " coding 0x21,0x22 are mapped to 0xF000 to 0xF08F.Coding 0x21 is mapped to 0xF021, and coding 0x22 is mapped to 0xF022.The cmap table of the section of searching mapped mode, the font index that acquisition is corresponding with 0xF021,0xF022 is respectively 1 and 2.

Mapping table generation module 33 generates literal code and font index mapping table according to the Unicode coding and the font index of " specification ", and is as follows:

The Unicode coding	The font index
		0x89C4	1
0x683C	2

After generating literal code and font index mapping table, redundant data removing module 24 is deleted the redundant data in the character font data.

As shown in Figure 5, during use, obtain Unicode coding 0x89C4, the 0x683C of " specification " earlier, mapping table parsing module 31 is resolved above-mentioned literal code and font index mapping table earlier then, obtain the corresponding font index 1 and 2 of Unicode coding with " specification ", font description data acquisition module 32 obtains and the corresponding font description data of this font index in the loca table (character locating table) of character font data according to font index 1 and 2 then.

Embodiment 2

Present embodiment carries out bulk treatment to character font data, appends in the document ocr software, can use this font resource like this under the situation of this font not, and a plurality of document can multiplexing same font resource.

Fig. 2 has shown the structure of the described system of present embodiment, comparing with embodiment 1, because the object of handling is not the character font data of integrating in the document, but is process object with whole character font data, therefore, do not comprise in the font processor 2 that literal code is obtained and identification module.

So that the Eu-bx font is appended in the Apabi Reader software as a means of the source side formula is example, as follows to the processing procedure of character font data:

(1) font index acquisition module 22 is resolved the Eu-bx character font data, obtains the cmap table in the character font data.Two cmap tables are arranged in the Eu-bx font, a Platform ID=3 is wherein arranged, the cmap table of Specific ID=1.

(2) because the Eu-bx font uses is not the Unicode coding of standard, be the GBK coding so need with the Eu code conversion, and then convert thereof into the Unicode coding according to " EUtoGBK.dat " resource.

(4) extract the corresponding font index of Unicode coding among the cmap.

(5) mapping table generation module 23 generates literal code and font index mapping table.

(6) redundant data removing module 24 is removed the font redundant data.

(7) will append to through the character font data after the above-mentioned processing in the Apabi Reader software.

After generating literal code and font index mapping table, can utilize this table and the known literal code inquiry font index corresponding, and then obtain the font description data with literal code; Also can utilize this table and known font search index and font index corresponding character coding.

As seen from the above-described embodiment: the present invention is by the analysis to literal code and character font data, obtain the font index, generate literal code and font index mapping table, according to above-mentioned mapping table the redundant data in the character font data is removed again, greatly the data volume of the few character font data of letter.And in use, can directly obtain the font index according to literal code and literal code and font index mapping table, and then obtain the font description data, simplify the complicated processes of obtaining the font description data according to literal code, thereby improved the operating speed of literal.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technology thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. the disposal route of a character font data, it is characterized in that: described method is at first set up the corresponding relation between literal code and the font index, is recorded in literal code and the font index mapping table;

2. the disposal route of a kind of character font data as claimed in claim 1, it is characterized in that: described literal code is meant the standard code of font, comprises that Unicode coding and GBK encode.

3. the disposal route of a kind of character font data as claimed in claim 2, it is characterized in that: described literal code is encoded if not Unicode, then is translated into the Unicode coding.

4. as the disposal route of the described a kind of character font data of one of claim 1 to 3, it is characterized in that the described process of setting up corresponding relation between literal code and the font index may further comprise the steps:

(1) obtains and discerns literal code;

(3) generate literal code and font index mapping table.

5. the disposal route of a kind of character font data as claimed in claim 4, it is characterized in that, in the step (2), resolve character font data, the process of obtaining literal code and font index corresponding relation is: find one or more mapping tables that can finally obtain literal code and font index corresponding relation according to the usage platform of literal and the type of literal code in character font data, obtain the font index corresponding with literal code according to the one or more mapping tables that find again.

6. as the disposal route of the described a kind of character font data of one of claim 1 to 3, it is characterized in that the described process of setting up corresponding relation between literal code and the font index may further comprise the steps:

7. the disposal route of a kind of character font data as claimed in claim 1 is characterized in that: behind the corresponding relation of setting up between literal code and the font index, the redundant data in the character font data is deleted.

8. the disposal system of a character font data, it is characterized in that: described system comprises font processor (2) and font application apparatus (3);

9. the disposal system of a kind of character font data as claimed in claim 8 is characterized in that: described font processor (2) comprises that also the literal code that is used to obtain and discern literal code obtains and identification module (21).

10. the disposal system of a kind of character font data as claimed in claim 9 is characterized in that: when described literal code was non-standard coding, literal code is obtained and identification module (21) also is used for non-standard code conversion is become standard code.

11. the disposal system as the described a kind of character font data of one of claim 8 to 10 is characterized in that: described font processor (2) also comprises the redundant data removing module (24) that is used for the redundant data deletion of character font data.