CN101996160B

CN101996160B - Method and system for processing script data

Info

Publication number: CN101996160B
Application number: CN 200910090817
Authority: CN
Inventors: 丁力; 张磊; 仇睿恒; 王毅
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Peking University Founder Research and Development Center
Priority date: 2009-08-10
Filing date: 2009-08-10
Publication date: 2013-01-02
Anticipated expiration: 2029-08-10
Also published as: CN101996160A

Abstract

The invention discloses a method and system for processing script data, belonging to the script technology field. The volume of traditional script data is lager, logic using the traditional script data is more complex, the speed is lower, and the efficiency is lower. In the method and system of the invention, a direct map relation between the character encoding and the character pattern index is firstly established and recorded in the mapping table of the character encoding and the character pattern index, and the redundant data in the script data is deleted; and when the script is used, the character pattern index of characters is obtained firstly through inquiring the mapping table of the character encoding and the character pattern index, and then the character pattern description data of the characters is obtained from the script data according to the character pattern index. After the script data is processed by adopting the method and system of the invention, the volume of the scrip data is reduced, and the use efficiency of the script data is improved. The invention is especially suitable for files integrating character information with script data into a whole or being added into file reading software.

Description

A kind of disposal route of character font data and system

Technical field

The invention belongs to the font technical field, be specifically related to a kind of disposal route and system of character font data, be specially adapted to perhaps append in the document ocr software in the document that Word message and character font data combine.

Background technology

Electronic document comprise the elements such as figure, picture, form, formula, multilingual literal, and literal is the element of topmost expression document content when storage and showing, the proportion that accounts in document is also maximum.Character font data is as a kind of resource, the font description data of a series of literal have been stored, when document was showed, according to the Word message in the document, the font data in the corresponding character font data rendered image or one group of path and is shown to computer screen or outputs on the printer.

Type1 full name PostScript Type1, it is a cover vector fonts standard that was proposed by Adobe company in 1985, because this standard is based on PostScript Description Language (PDL), and PDL is the printing descriptive language of high-end printer first-selection, so Type1 comes into vogue rapidly.But Type1 is non-open font, and Adobe imposes the usage charges of great number to the company that uses Type1.

TrueType is a kind of Mathematical font description technology.It describes the character contour profile with mathematical function, contains the instructions such as glyph construction, color filling, digital described function, the control of flow process condition, grid processing controls, additional prompt control.Quadratic b-spline curve and straight line are described the appearance profile of font in the TrueType employing geometry, and be characterized in: TrueType both can make type fount, can be used as screen display again; Because it is by instruction font to be described, so it is irrelevant with resolution, during output always the resolution according to printer export.No matter zoom in or out, character is always smooth, does not have sawtooth and occurs.But relative PostScript font, its quality is less better.Particularly at literal too hour, just show to such an extent that be not very clear.

OpenType also is the Type2 font, and it also is a kind of cutline font, and is more more powerful than TrueType, and the most obvious benefit is exactly to be embedded into the PostScript font in the software of TrueType.And support a plurality of platforms, and support very large character set, also have copyright protection.Can say that it is the superset of Type1 and TrueType.

The major advantage of OpenType is as follows:

1) the cross-platform function that strengthens

2) better support the international character collection of Unicode standard definition

3) support senior printing control ability

4) document size that generates is less

5) be supported in and add digital signature in the character set, guarantee the integrated functionality of file

The OpenType standard has also defined the suffix name of OpenType file name.Comprise the OpenType file suffixes .ttf by name of TureType font, the file suffixes that comprises the PostScript font is called .OTF.If comprise the font packet file of a series of TrueType fonts, suffix is called .TTC so.

Unicode (Unicode, ten thousand country codes, single code) is a kind of character code that uses on computers.It is in every kind of language each character setting unified and unique binary coding, stride language, the cross-platform requirement of carrying out text-converted, processing to satisfy.

Want the Word message content of the displaying of the former formula of real master, the color that must keep the user that literal is arranged, font, the attribute informations such as size.Guarantee the same result of output in any system, just must integrate character font data and Word message as a whole.At this moment just need to process character font data, processing mode of the prior art is: the information of character font data being removed partial redundance, it mainly is the data of description of font, the for example description to font among the glyf in the OpenType font, because the data to other do not process, so font use-pattern and complete font are as broad as long.

There is following shortcoming in prior art:

1. font uses logic complicated.Because the character font data that prior art was processed is as broad as long with normal font in the use, to according to current literal code type, determine to use which kind of mapping table during use.For example, a plurality of cmap tables are arranged among the OpenType, search could be determined the position of font description data through one or many, and also needs coding is carried out the one or many conversion when using some mapping table, and logic is complicated.

2. still have some redundant datas in the character font data.Owing to only having removed part font data of description, some information that also exist some and environment for use to have nothing to do in the font are such as some information in name table among the OpenType and the cmap table.

Summary of the invention

For the defective that exists in the prior art, the purpose of this invention is to provide a kind of disposal route and system of character font data.After the method and system process character font data, can improve the service efficiency of character font data.

To achieve these goals, the technical solution used in the present invention is as follows:

A kind of disposal route of character font data, the corresponding relation between the literal code of the method model and the font index is recorded in literal code and the font index-mapping table;

When using character font data, by described literal code and font index-mapping table, obtain the font index corresponding with literal code, from character font data, obtain again the font description data of literal according to described font index.

The disposal route of character font data as mentioned above, wherein, literal code refers to the standard code of literal, comprises that Unicode coding and GBK encode.If literal code is not the Unicode coding, then be translated into the Unicode coding.

The disposal route of character font data as mentioned above, wherein, the process of setting up corresponding relation between literal code and the font index may further comprise the steps:

(1) obtains and identifies literal code;

(2) resolve character font data, from character font data, obtain the corresponding relation of literal code and font index;

(3) generating character coding and font index-mapping table.

The disposal route of character font data as mentioned above, in the step (2), resolve character font data, the process of obtaining literal code and font index corresponding relation is: find one or more mapping tables that can finally obtain literal code and font index corresponding relation according to the usage platform of literal and the type of literal code in character font data, obtain the font index corresponding with literal code according to the one or more mapping tables that find again.

(1) resolves character font data, find one or more mapping tables that can finally obtain corresponding relation between literal code and the font index;

(2) whether Unicode encodes in the literal code of judgement font use, if not the Unicode coding, then converts thereof into the Unicode coding;

(3) in the mapping table that step (1) finds, extract the font index corresponding with literal code, generating character coding and font index-mapping table, the Unicode coding of described literal code and font index-mapping table record literal and the corresponding relation between the font index.

The disposal route of character font data wherein, behind the corresponding relation of setting up between literal code and the font index, is deleted the redundant data in the character font data as mentioned above.

A kind of disposal system of character font data comprises font processor (2) and font application apparatus (3);

Described font processor (2) comprises for resolving character font data, obtains the font index acquisition module (22) of literal code and font index corresponding relation; The mapping table generation module (23) that is used for generating character coding and font index-mapping table, described literal code and font index-mapping table are used for the corresponding relation between shorthand coding and the font index;

Described font application apparatus (3) comprises for resolving literal code and font index-mapping table, obtain the font index corresponding with literal code or with the mapping table parsing module (31) of literal code corresponding to font index; Be used for obtaining from character font data according to the font index font description data acquisition module (32) of text font data of description.

The disposal system of character font data as mentioned above, wherein, font processor (2) also comprises be used to the literal code of obtaining and identify literal code and obtaining and identification module (21), when literal code is non-standard coding, also is used for non-standard code conversion is become standard code.

The disposal system of character font data as mentioned above, wherein, font processor (2) also comprises for the redundant data removing module (24) with the redundant data deletion of character font data.

Method and system of the present invention compared with prior art have following advantage:

(1) font uses logic simple, and speed is fast.According to literal code and font index, the mapping table of corresponding relation between generating character coding and the font index when saving each use font, all needs to carry out the judgement of literal code type and the operations such as parsing of cmap table.And literal code and font index-mapping table use simple, have saved the complex logic of repeatedly searching the cmap table, make font index location faster, have improved the operating speed of character font data.

(2) removed more redundant information, made character font data less.According to literal code and font index-mapping table, removed in the character font data redundant datas such as description to environment for use, for example, in the name of the OpenType table to different platform, the data of description of different language, and some redundant mapping tables in the cmap table.

Description of drawings

Fig. 1 is the structured flowchart of system described in the embodiment 1;

Fig. 2 is the structured flowchart of system described in the embodiment 2;

Fig. 3 is the process flow diagram that adopts the described system of Fig. 1 that character font data is processed and used;

Fig. 4 is the process flow diagram of processing character font data among the embodiment 1;

Fig. 5 is the process flow diagram of using character font data among the embodiment 1.

Embodiment

Core concept of the present invention is: the character font data in the document that existing Word message and character font data are combined or have whole character font data now and process, set up direct mapping relations between literal code and the font index, be recorded in literal code and the font index-mapping table, then the redundant data in the character font data deleted.Wherein, character font data refers to the data in the font file, comprises describing needed all data of font, such as one or more mapping tables of title, copyright, font description data and record font description data and literal code corresponding relation.The font index is used to indicate the position of font description data in character font data.When using font, by inquiry literal code and font index-mapping table, obtain first the font index of literal, from character font data, obtain again the font description data of literal according to the font index.

Below in conjunction with embodiment and accompanying drawing, describe the present invention.

Embodiment 1

The present embodiment is to be treated to example to the character font data in the document that literal code and character font data are combined.

Fig. 1 has shown the structure of the described system of the present embodiment, and this system comprises character font data treating apparatus 2 and character font data application apparatus 3.

Character font data treating apparatus 2 comprises that literal code is obtained and identification module 21, font index acquisition module 22 and mapping table generation module 23.Literal code is obtained and identification module 21 is used for obtaining and identifying the literal code of document 1, when literal code is non-standard coding, also for non-standard code conversion is become standard code.Font index acquisition module 22 is used for the character font data of parse documents 1, obtains the corresponding relation of literal code and font index, and described font index is used to indicate the position of font description data in character font data.Mapping table generation module 23 is used for generating character coding and font index-mapping table, and this table is used for the corresponding relation between shorthand coding and the font index.

Character font data application apparatus 3 comprises mapping table parsing module 31 and font description data acquisition module 32.Mapping table parsing module 31 is used for resolving literal code and font index-mapping table, obtains the font index corresponding with literal code or the literal code corresponding with the font index.Font description data acquisition module 32 is used for obtaining the font description data according to the font index from character font data.

In addition, for the redundant data in the delete font data, reduce the data volume of character font data, character font data treating apparatus 2 also comprises redundant data removing module 24.Behind generating character coding and font index-mapping table, redundant data removing module 24 can be deleted the redundant data in the character font data.

Fig. 3 has shown the employing system shown in Figure 1 and character font data has been processed and used the flow process of processing rear character font data, comprise character font data treating apparatus 2 processing character font datas, and character font data application apparatus 3 is used character font datas.

The process that character font data treating apparatus 2 is processed character font data may further comprise the steps:

(1) literal code is obtained and identification module 21 obtains and identify literal code in the document 1.

Literal code can be standard code, such as Unicode or GBK, and also can criteria of right and wrong coding.If non-standard coding then needs non-standard coding is converted into standard code.

(2) font index acquisition module 22 is resolved character font data, obtains the corresponding relation of literal code and font index from character font data.

At first in character font data, find one or more mapping tables that can finally obtain literal code and font index corresponding relation according to the usage platform of literal and the type of literal code, obtain the font index corresponding with literal code according to the one or more mapping tables that find again.

(3) mapping table generation module 23 generating characters coding and font index-mapping table, this table is used for the corresponding relation between shorthand coding and the font index.

Behind generating character coding and the font index-mapping table, redundant data removing module 24 is the deletion of the redundant data in the character font data, as in the name table of OpenType to the data of description of different platform, different language, and some redundant mapping tables in the cmap table.

The process that character font data application apparatus 3 is used the character font data of processing through character font data treating apparatus 2 may further comprise the steps:

(4) mapping table parsing module 31 is resolved literal code and font index-mapping table, obtains literal code or font index.The use-pattern of literal code and font index-mapping table has following two kinds:

1. obtain literal code, resolve literal code and font index-mapping table, search the font index corresponding with literal code;

2. obtain the font index, resolve literal code and font index-mapping table, search the literal code corresponding with the font index.

(5) font description data acquisition module 32 obtains the font description data of literal according to the font index from character font data.

The below is illustrated the processing of above-mentioned character font data and the process of application so that " specification " two words in the embedded font of PDF document are treated to example.As shown in Figure 4, at first obtained by literal code and identification module 21 obtains and identify the literal code of " specification ", be respectively 0x21 and 0x22.Because this is encoded to unknown coding, namely therefore non-standard coding need to convert thereof into first standard code.Can obtain the Unicode coding of " specification " according to the table of the ToUnicode in the PDF document, be respectively 0x89C4 and 0x683C.

Because the usage platform of font is Windows, type of coding is Unicode, therefore in character font data, search Platform ID=3 (Windows), the cmap table (character map) of Encoding ID=1 (Unicode BMP (UCS-2)).Font index acquisition module 22 is resolved character font data, finds to have in the character font data cmap table, but this table is not the cmap table of Platform ID=3, Encoding ID=1.At this moment, adopt the cmap table (type is 4) of section mapped mode, directly " specification " coding 0x21,0x22 are mapped to 0xF000 to 0xF08F.Coding 0x21 is mapped to 0xF021, and coding 0x22 is mapped to 0xF022.The cmap table of the section of searching mapped mode, the font index that acquisition is corresponding with 0xF021,0xF022 is respectively 1 and 2.

Mapping table generation module 33 is according to Unicode coding and the font index of " specification ", and generating character is encoded and font index-mapping table, and is as follows:

The Unicode coding	The font index
		0x89C4
	1
		0x683C	2

Behind generating character coding and the font index-mapping table, redundant data removing module 24 is deleted the redundant data in the character font data.

As shown in Figure 5, during use, obtain first Unicode coding 0x89C4, the 0x683C of " specification ", then mapping table parsing module 31 is resolved first above-mentioned literal code and font index-mapping table, obtain the

font index

1 and 2 corresponding with the Unicode coding of " specification ", then font description data acquisition module 32 obtains the font description data corresponding with this font index according to

font index

1 and 2 in the loca table (character locating table) of character font data.

Embodiment 2

The present embodiment carries out bulk treatment to character font data, append in the document ocr software, like this can in the situation that not this font use this font resource, and a plurality of document can multiplexing same font resource.

Fig. 2 has shown the structure of the described system of the present embodiment, compare with embodiment 1, because the object of processing is not the character font data of integrating in the document, but take whole character font data as processing object, therefore, do not comprise in the font processor 2 that literal code is obtained and identification module.

So that the Eu-bx font is appended in the Apabi Reader software as example as a means of the source side formula, as follows to the processing procedure of character font data:

(1) font index acquisition module 22 is resolved the Eu-bx character font data, obtains the cmap table in the character font data.Two cmap tables are arranged in the Eu-bx font, a Platform ID=3 is wherein arranged, the cmap table of Specific ID=1.

(2) because the Eu-bx font uses is not the Unicode coding of standard, be the GBK coding so need according to " EUtoGBK.dat " resource with the Eu code conversion, and then convert thereof into the Unicode coding.

(4) extract font index corresponding to Unicode coding among the cmap.

(5) mapping table generation module 23 generating characters coding and font index-mapping table.

(6) redundant data removing module 24 is removed the font redundant data.

(7) will append to through the character font data after the above-mentioned processing in the Apabi Reader software.

Behind generating character coding and the font index-mapping table, can utilize this table and the known literal code inquiry font index corresponding with literal code, and then obtain the font description data; Also can utilize this table and the known font search index literal code corresponding with the font index.

As seen from the above-described embodiment: the present invention is by the analysis to literal code and character font data, obtain the font index, generating character coding and font index-mapping table are removed the redundant data in the character font data according to above-mentioned mapping table again, greatly the data volume of the few character font data of letter.And in use, can directly obtain the font index according to literal code and literal code and font index-mapping table, and then obtain the font description data, simplify the complicated processes of obtaining the font description data according to literal code, thereby improved the operating speed of literal.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technology thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. the disposal route of a character font data, it is characterized in that: the corresponding relation between the literal code of described method model and the font index is recorded in literal code and the font index-mapping table; The process of setting up corresponding relation between literal code and the font index may further comprise the steps:

(1) obtains and identifies literal code;

(3) generating character coding and font index-mapping table;

Perhaps may further comprise the steps:

1) resolves character font data, find one or more mapping tables that can finally obtain corresponding relation between literal code and the font index;

2) whether Unicode encodes in the literal code of judgement font use, if not the Unicode coding, then converts thereof into the Unicode coding;

3) in step 1) extract the font index corresponding with literal code in the mapping table that finds, generating character coding and font index-mapping table, the Unicode coding of described literal code and font index-mapping table record literal and the corresponding relation between the font index;

2. the disposal route of a kind of character font data as claimed in claim 1, it is characterized in that: described literal code refers to the standard code of font, comprises that Unicode coding and GBK encode.

3. the disposal route of a kind of character font data as claimed in claim 2, it is characterized in that: described literal code is encoded if not Unicode, then is translated into the Unicode coding.

4. such as the disposal route of the described a kind of character font data of one of claims 1 to 3, it is characterized in that, in the step (2), resolve character font data, the process of obtaining literal code and font index corresponding relation is: find one or more mapping tables that can finally obtain literal code and font index corresponding relation according to the usage platform of literal and the type of literal code in character font data, obtain the font index corresponding with literal code according to the one or more mapping tables that find again.

5. the disposal route of a kind of character font data as claimed in claim 1 is characterized in that: behind the corresponding relation of setting up between literal code and the font index, the redundant data in the character font data is deleted.

6. the disposal system of a character font data is characterized in that: described system comprises be used to the corresponding relation of setting up between literal code and the font index, and this corresponding relation is recorded in font processor in literal code and the font index-mapping table; Be used for when using character font data, obtain the font index corresponding with literal code by described literal code with font index-mapping table, from character font data, obtain again the font application apparatus of the font description data of literal according to described font index;

Wherein, described font processor comprises be used to the literal code of obtaining and identify literal code and obtaining and identification module; Be used for resolving character font data, obtain the font index acquisition module of literal code and font index corresponding relation; And the mapping table generation module that is used for generating character coding and font index-mapping table; Perhaps

Described font processor comprises for resolving character font data, finds the corresponding relation mapping table acquisition module of one or more mapping tables that can finally obtain corresponding relation between literal code and the font index; Whether Unicode encodes for judging the literal code that font is used, and if not the Unicode coding, then converts thereof into the literal code modular converter that Unicode encodes; And in the mapping table that corresponding relation mapping table acquisition module finds, extracting the font index corresponding with literal code, the mapping table generation module of generating character coding and font index-mapping table;

The Unicode coding of described literal code and font index-mapping table record literal and the corresponding relation between the font index.

7. the disposal system of a kind of character font data as claimed in claim 6 is characterized in that: when described literal code was non-standard coding, literal code was obtained and identification module also is used for non-standard code conversion is become standard code.

8. such as the disposal system of claim 6 or 7 described a kind of character font datas, it is characterized in that: described font processor also comprises the redundant data removing module of deleting for the redundant data of character font data.