CN105653548A - Method and system for identifying page type of electronic document - Google Patents

Method and system for identifying page type of electronic document Download PDF

Info

Publication number
CN105653548A
CN105653548A CN201410645725.6A CN201410645725A CN105653548A CN 105653548 A CN105653548 A CN 105653548A CN 201410645725 A CN201410645725 A CN 201410645725A CN 105653548 A CN105653548 A CN 105653548A
Authority
CN
China
Prior art keywords
page
feature word
eigenvalue
type
reference page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410645725.6A
Other languages
Chinese (zh)
Inventor
冯浩然
郭巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201410645725.6A priority Critical patent/CN105653548A/en
Publication of CN105653548A publication Critical patent/CN105653548A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for identifying a page type of an electronic document. The method includes: 1) calculating characteristic words in the same type of reference pages and the number of times that the characteristic words occur, and calculating a reference threshold; 2) calculating the number of times that characteristic words in a target page occur, and calculating a characteristic value in the target page according to the number of times that characteristic words in the target page occur; and 3) determining the type of the target page through comparison of the characteristic value and the reference threshold. The method extracts the characteristic words of the reference pages and the number of times that the characteristic words occur in advance, and then extracts those characteristic words of the target page and the number of times that the characteristic words occur, and identifies the type of the target page through this consistent statistical law. The method overcomes the problem of low efficiency in page classification of the electronic document in the prior art, and provides a method for identifying a characteristic page of a format document associated with a text; and after automatic identification, the format electronic document page can be split into a cover page, a title page, a copyright page, a copyright statement page, a catalog page, a main body page and any subset of other extended characteristic pages according to the service demands, and in this way, the format electronic document page can be applied to different demand environments.

Description

A kind of electronic document page kind identification method and system
Technical field
The present invention relates to electronic document process field, specifically a kind of electronic document feature page recognition methods and system.
Background technology
Electronic document refers to what people were formed in social activity, the written material being carrier with chemistry magnetic materials such as computer disc, disk and CDs, and it relies on computer operating system or the access of other operating system and can transmit on a communication network. It mainly includes electronic document, electronic mail, electronic report forms, electronic drawing etc. Electronic document is generally divided into streaming document and format document, streaming document refer to without fixed format, content can represent the actual size of medium according to it and carry out the document of content arrangement; Format document formula is the electronic document that the space of a whole page presents that effect is fixing, and presenting of format document is unrelated with equipment, and when reading on various equipment, print or print, the result that presents of its space of a whole page is all consistent. Format document is mainly used in the issue of written rear file, propagation and archive.
The feature of format document is that the space of a whole page is fixed, do not run version, and in the e-book of format document storage, the page of format electronic document is generally divided into: cover page, title page, colophon, copyright statement page, catalogue page, text page, wherein cover page is divided into two kinds of front and back front cover. These different types of pages are exactly the feature page of document. Sometimes, user is in order to classify to e-book, add up or other purposes, it is necessary to extract the feature pages such as the cover page of e-book, colophon.
Under the background of information-based high speed development, the efficiency of electronic document tissue becomes important factor in order time user browses and search information, the reading requirement different according to user and different authority settings, and the demand of file characteristics page identification is arisen at the historic moment. Such as, the cover page of books, colophon and catalogue, in order to introduce books, can be pushed to user and read, and for authorized user, then need the text page of books is pushed user by library automation. Additionally, when books are carried out classified statistic by library automation, it is also desirable to add up respectively for different feature pages. Therefore, automatic Extracting Information complete page classifications from electronic document page efficiently a, it has also become problem demanding prompt solution of digital publication manufacturing system, in addition the accuracy rate of information retrieval is also largely dependent on the result of feature page identification. At present big multi-page identification process is that people rule of thumb accumulates and manually completes, but increasing along with text document resource, automatization identifies that the demand of feature page is more obvious.
Summary of the invention
For this, the feature page that the technical problem to be solved is in that in prior art electronic document needs manual extraction, the problem wasted time and energy, thus proposing a kind of electronic document feature page recognition methods and system.
For solving above-mentioned technical problem, the offer one electronic document feature page recognition methods of the present invention and system.
The present invention provides the recognition methods of a kind of electronic document page type, including:
Type according to reference page determines feature word and the occurrence number thereof of reference page;
Feature word and the number of times of appearance thereof according to described reference page determine baseline threshold;
Obtain the described feature word occurred in page object;
The feature word occurred in feature word according to reference page and occurrence number thereof and page object, calculates the eigenvalue of page object;
Described eigenvalue and described baseline threshold are compared the type determining page object.
The present invention also provides for the identification system of a kind of electronic document page type, including:
Reference page determines unit: determine feature word and the occurrence number thereof of reference page according to the type of reference page;
Threshold value determination unit: determine baseline threshold according to the number of times of the feature word of described reference page and appearance thereof;
Feature word extraction unit: obtain the described feature word occurred in page object;
Page object eigenvalue calculation unit: according to the feature word occurred in the feature word of reference page and occurrence number and page object thereof, calculate the eigenvalue of page object;
Comparing unit: described eigenvalue and described baseline threshold are compared the type determining page object.
The technique scheme of the present invention has the following advantages compared to existing technology,
(1) present invention provides the recognition methods of a kind of electronic document page type, first, add up the feature word in same type of reference page and occurrence number thereof, and calculate baseline threshold, then the number of times that in statistics page object, feature word occurs, and the eigenvalue in page object is calculated according to it, by eigenvalue is compared with baseline threshold, it is determined that the type of page object. the method extracts feature word and the occurrence number of reference page in advance, go to extract these feature word and occurrence numbers for page object again, by this conforming statistical law, identify the type of book page object, the method overcome the existing inefficiency problem to electronic document page categorizing process, propose the format document feature page recognition methods that a kind of text is relevant, format electronic document page after automatically identifying can be split as cover page according to business demand, title page, colophon, copyright statement page, catalogue page, the random subset of text page and other more autgmentability feature pages, it is applied under no demand environment with this.
(2) method in the present invention, first pass through each feature word occurrence number of calculating and account for the ratio of all feature word occurrence numbers, weight as this feature word, the weight sum of all of feature word should be 1, then the sum of the weight of feature word in present page object is calculated, using this value eigenvalue as page object, judge whether to belong to certain page type according to its size. The program realizes simple, but can objectively reflect page object and whether reference page has the feature of identical page type, it is possible to well reflects whether they exist concordance, and then carries out type judgement.
(3) method in the present invention, baseline threshold selects the weight sum of Partial Feature word, uses this value the feature of the eigenvalue of page object and reference page can be carried out quantization and compares, it is simple to calculates and realizes, and having good objectivity and accuracy.
(4) method in the present invention, eigenvalue can also be calculated with the deviation of feature word occurrence number in reference page by feature word occurrence number in calculating page object, if page object is more consistent with the feature of reference page, then deviation is more little, therefore, can objectively reflect page object by this inclined extent and whether reference page has concordance, it is possible to objectively weigh the type of page object.
(5) method in the present invention, when calculating baseline threshold, the number of times occurred due to feature word has the different orders of magnitude in different reference page, therefore unified baseline threshold cannot be set, therefore, the number of times occurred according to feature word in reference page in this programme calculates baseline threshold, have expressed the page properties of different types of reference page more objectively, improves accuracy of identification.
(6) method in the present invention, when calculating baseline threshold, can also as desired to arrange the size of baseline threshold, it is it desired to obtain page type that matching degree is good, that accuracy is high, what can threshold value be arranged is higher, be it desired to as far as possible comprehensively obtain the page object belonging to this page type, to comprehensive requirement higher time, then what can threshold value be arranged is lower. Arrange according to different needs, it is possible to make the program have better practicality and adaptability.
(7) present invention provides the identification system of a kind of electronic document page type, determines unit, threshold value determination unit, feature word extraction unit, page object eigenvalue calculation unit, comparing unit including reference page. This system overcomes the existing inefficiency problem to electronic document page categorizing process, format electronic document page after automatically identifying can be split as cover page according to business demand, title page, colophon, copyright statement page, catalogue page, the random subset of text page and other more autgmentability feature pages, it is adaptable to no demand environment.
Accompanying drawing explanation
In order to make present disclosure be more likely to be clearly understood, below according to specific embodiments of the invention and in conjunction with accompanying drawing, the present invention is further detailed explanation, wherein
Fig. 1 is the recognition methods flow chart of the electronic document page type of one embodiment of the invention;
Fig. 2, Fig. 3 are the schematic diagrams of the feature page of the recognition methods of electronic document page type in one embodiment of the invention;
Fig. 4 is the structure flow chart of the identification system of the electronic document page type in one embodiment of the invention.
Detailed description of the invention
The detailed description of the invention of the page type recognition methods of of the present invention electronic document is given below, and the program may be used for the page of the different types such as the catalogue page in extraction electronic journal or e-book, text page, colophon, cover page, back cover page. The program can realize by having the computer of the instruction realizing following process, server etc.
Embodiment 1:
A kind of recognition methods of electronic document page type, including:
S1: determine feature word and the occurrence number thereof of reference page according to the type of reference page.
First, in digital resource, select the page of known type as reference page. As for colophon, selected the colophon of a large amount of e-book as reference page. Then extract the word in these reference pages, it is thus achieved that the representational feature word that outlet number of times is higher, and add up the occurrence number of these words.
S2: determine baseline threshold according to the number of times of the feature word of described reference page and appearance thereof.
Herein, it is possible to according to different types of colophon by adding up, some values are manually set as baseline threshold. As for colophon, baseline threshold is 0.7, it is 0.6 etc. for text page baseline threshold. The selection standard of these numerals is: the page more than this threshold value, has the page feature of this more type, it is believed that it belongs to the page of this type.
S3: obtain the occurrence number of feature word described in page object.
By the word in page object is identified, then passes through participle and obtain each word the feature word match through obtaining in these words and step S1, obtain in page object, the occurrence number of each word.
First, calculate the weight of each feature word according to the feature word of reference page and occurrence number thereof,
W k = P k Σ i = 1 n P i
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Then, the feature word occurred in statistics page object, by the above-mentioned weight summation calculated corresponding for the feature word of each appearance, obtain the eigenvalue of page object. Visible, the subset of the set that set is the feature word occurred in reference page of the feature word occurred in page object. The weight sum of all feature words in reference page is 1, the feature word occurred in page object is certainly less than feature word all of equal in reference page, therefore the weight of the feature word occurred in page object being added, the eigenvalue obtained is necessarily less than or equal to 1.
S4: according to the occurrence number of feature word in the feature word of reference page and occurrence number and page object thereof, calculate the eigenvalue of page object.
First, calculate the weight of each feature word according to the feature word of reference page and occurrence number thereof,
W k = P k Σ i = 1 n P i
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Then, the feature word occurred in statistics page object, by the above-mentioned weight summation calculated corresponding for the feature word of each appearance, obtain the eigenvalue of page object. Visible, the subset of the set that set is the feature word occurred in reference page of the feature word occurred in page object. The weight sum of all feature words in reference page is 1, the feature word occurred in page object is certainly less than feature word all of equal in reference page, therefore the weight of the feature word occurred in page object being added, the eigenvalue obtained is necessarily less than or equal to 1.
Herein, size according to eigenvalue judges whether to belong to certain page type, and the program realizes simple, but can objectively reflect page object and whether reference page has the feature of identical page type, can well reflect whether they exist concordance, and then carry out type judgement.
S5: described eigenvalue and described baseline threshold are compared the type determining page object.
After obtaining the eigenvalue of page object, when eigenvalue is more than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
So, by all types of pages are identified, then the type of page object can be obtained.
In the present embodiment, baseline threshold, by the mode added up in advance, artificial is demarcated except baseline threshold corresponding to various page types. In other implementations, baseline threshold can also be calculated according to the number of times of the feature word of described reference page and appearance thereof, the a number of weight summation being arranged in front is chosen in weight descending as will be referred to all feature words in page, as baseline threshold, the quantity chosen accounts for 20% to the 40% of feature word sum. Herein, baseline threshold is selected the weight sum of non-Partial Feature word, use this value the feature of the eigenvalue of page object and reference page can be carried out quantization and compare, it is simple to calculate and realize, and there is good objectivity and accuracy.
Comprehensive above analysis can be seen that, when calculating baseline threshold, can also as desired to arrange the size of baseline threshold, it is it desired to obtain page type that matching degree is good, that accuracy is high, what can threshold value be arranged is higher, be it desired to as far as possible comprehensively obtain the page object belonging to this page type, to comprehensive requirement higher time, then what can threshold value be arranged is lower. Arrange according to different needs, it is possible to make the program have better practicality and adaptability.
In other real-time modes that can replace, calculating in the process of eigenvalue of page object, computing formula is as follows:
Σ i = 1 n ( P i - M i ) 2
Wherein, PiFor the number of times that ith feature word occurs in reference page, Mi is the number of times that ith feature word occurs in feature page, and n is the number of feature word.
Herein, eigenvalue is calculated with the deviation of feature word occurrence number in reference page by calculating feature word occurrence number in page object, if page object is more consistent with the feature of reference page, then deviation is more little, therefore, can objectively reflect page object by this inclined extent and whether reference page has concordance, it is possible to objectively weigh the type of page object.
Now, the calculation of corresponding baseline threshold is:
The number of times ascending order arrangement that feature word in described reference page is occurred;
Choosing a number of number of times summation being arranged in front, as baseline threshold, described selected quantity accounts for 10% to the 20% of feature word sum.
When calculating baseline threshold, the number of times occurred due to feature word has the different orders of magnitude in different reference page, therefore unified baseline threshold cannot be set, therefore, the number of times occurred according to feature word in reference page in this programme calculates baseline threshold, have expressed the page properties of different types of reference page more objectively, improve accuracy of identification. Now eigenvalue is the smaller the better, and therefore eigenvalue is identical with reference page type less than the page object of baseline threshold. Now, if it is desired to obtain accurate page type, what threshold value be arranged is smaller, and in order to obtain the page belonging to the type comprehensively, then what threshold value to be oppositely arranged is larger, as desired to select.
The recognition methods of the electronic document page type in the present embodiment, first, add up the feature word in same type of reference page and occurrence number thereof, and determine baseline threshold, then the number of times that in statistics page object, feature word occurs, and the eigenvalue in page object is calculated according to it, by eigenvalue is compared with baseline threshold, it is determined that the type of page object. the method extracts feature word and the occurrence number of reference page in advance, go to extract these feature word and occurrence numbers for page object again, by this conforming statistical law, identify the type of book page object, the method overcome the existing inefficiency problem to electronic document page categorizing process, propose the format document feature page recognition methods that a kind of text is relevant, format electronic document page after automatically identifying can be split as cover page according to business demand, title page, colophon, copyright statement page, catalogue page, the random subset of text page and other more autgmentability feature pages, it is applied under no demand environment with this.
Embodiment 2:
In the present embodiment, it is provided that the recognition methods of another electronic document page type, for identifying the feature page in format electronic document, specifically comprise the steps of
(1) feature word and the occurrence number thereof of reference page is determined according to the type of reference page.
Inquiring about a series of stack room, rule of thumb artificial cognition goes out the feature page of all of needs as reference page, and under limited conditions, number of samples should be tried one's best greatly, and number of samples is more big, and the reference page parameter characteristic obtained more trends towards completely correct.
All of word content can be directly obtained for each page electronic document.
Carry out respectively extracting and differentiating by the Word message of all feature pages through confirming, be trained by disclosed machine learning training method, sum up a series of feature word and the number of times of each feature word appearance.
(2) carry out feature word checking according to reference page identified in stack room, obtain baseline threshold.
Owing to feature word each in step (1) and occurrence number thereof may decide that the weight that this vocabulary is shared in whole reference page identification process, both are directly proportional, as follows according to the computing formula of the feature word of reference page and the weight of the occurrence number each feature word of calculating thereof:
W k = P k Σ i = 1 n P i
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word.
So, the weight set of all feature words is just obtained.
The computational methods of baseline threshold are by the weight descending of described feature word, choose the weight summation of be arranged in front 30%, as baseline threshold.
(3) for page object, obtain all words therein, and word carried out participle, the word obtained is mated with feature word, it is thus achieved that the occurrence number of feature word in page object.
(4) weight according to the feature word obtained and correspondence thereof, calculates the eigenvalue of the page undetermined.
Concrete computational methods are, weight corresponding for all feature vocabulary occurred in page object are added, and the result obtained is exactly eigenvalue. Feature word owing to occurring in page object is the subset of all target words in reference page, and therefore eigenvalue is less than 1, and the eigenvalue tried to achieve is more big, illustrates that the feature word in the reference page occurred in page object is more many, more big with the similarity of reference page.
(5) described eigenvalue and described baseline threshold are compared the type determining page object. Namely according to whether eigenvalue meets or exceeds a threshold value, judge the characteristic attribute of the page undetermined. Threshold value herein is the level threshold value calculated in step (2). When eigenvalue is more than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
After making in this way document file page to be classified, it is possible to greatly improve efficiency and the accuracy rate of magnanimity document automatic archiving. When calculating baseline threshold and reference value, it is also possible to adopt other the mode mentioned in embodiment 1.
Embodiment 3:
The present embodiment provides a kind of concrete application example, the method providing the type how determining page object, comprises the following steps:
1. inquiry stack room, rule of thumb identify required feature page, as included colophon and catalogue page.
2. extract the word content of all required feature pages.
3. by a kind of known machine learning method (such as ... method or ... method, provide the title of two kinds of known methods), train all word contents of all colophons, all word contents of retraining catalogue page, sum up a series of feature word.
As, a series of word that colophon is corresponding is:
T={T1, T2, T3..., Tn}
Then the occurrence number that each feature word is corresponding is obtained, for:
P={P1, P2, P3..., Pn}
Thus calculate the weight obtaining each feature word:
W={W1, W2, W3..., Wn}
Weight calculation formula is:
WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word.
4. carry out feature word checking according to feature page identified in stack room, utilize the algorithm of matching regular expressions or equivalence with it to obtain the feature set of words of feature page:
L={L1, L2, L3..., Ln, wherein L is a subset of T, chooses a part of feature word therein as required.
Baseline threshold is the sum of the weight W that each word in the set L of feature word is corresponding.
5. the weight according to the feature word obtained and correspondence thereof, calculate the eigenvalue of the page undetermined, similar with step 4 of algorithm, obtain the feature word occurred in the page undetermined, then these words weight in W is obtained, suing for peace the weight of these feature words as eigenvalue, this eigenvalue V is the number less than 1.
6. according to whether eigenvalue V meets or exceeds baseline threshold, it is determined that the characteristic attribute of the page undetermined, if eigenvalue is more than baseline threshold, then judge that page undetermined belongs to this feature page, otherwise in like manner.
Embodiment 4:
The present embodiment provides another embodiment, including following process:
1. inquiry stack room, rule of thumb identify required feature page, as included colophon and catalogue page, as shown in Figure 2 and Figure 3.
2. extract the word content of all required feature pages.
3., by a kind of known machine learning method, train all word contents of all colophons, all word contents of retraining catalogue page, sum up a series of feature word.
As, a series of feature word that colophon is corresponding is:
T={T1, T2, T3..., Tn}
Then the occurrence number that each feature word is corresponding is obtained, for:
P={P1, P2, P3..., Pn}
4. adding up the word in the page undetermined, and obtain the feature word occurred in the page undetermined, add up the number of times of its appearance, for not having the feature word occurred, the number of times of its appearance is 0, thus obtains the number of times that in the page undetermined, each feature word occurs,
M={M1,M2,M3..., Mn, MiFor feature word TiThe number of times occurred in text undetermined.
Then the deviation of M and P is calculated as eigenvalue
5. calculate benchmark threshold.
The number of times ascending order arrangement occurred by feature word in feature page, chooses a number of number of times summation being arranged in front, as baseline threshold, as selected front 10% number summation, as baseline threshold.
The number of times occurred due to feature word has the different orders of magnitude in different reference page, therefore unified baseline threshold cannot be set, therefore, the number of times occurred according to feature word in reference page in this programme calculates baseline threshold, have expressed the page properties of different types of reference page more objectively, improve accuracy of identification. Now eigenvalue is the smaller the better, and therefore eigenvalue is identical with reference page type less than the page object of baseline threshold. Now, if it is desired to obtain accurate page type, what threshold value be arranged is smaller, and in order to obtain the page belonging to the type comprehensively, then what threshold value to be oppositely arranged is larger, as desired to select.
6. the relation according to eigenvalue Yu baseline threshold, it is determined that the characteristic attribute of the page undetermined, owing to eigenvalue is deviation, therefore deviation is more little, and similarity is more high, if therefore eigenvalue is less than baseline threshold, then judge that page undetermined belongs to this feature page, on the contrary in like manner.
Embodiment 4:
The present invention also provides for the electronic document page identification system of a kind of method that can realize described in embodiment 1-4, as shown in Figure 4, and including:
Reference page determines unit: determine feature word and the occurrence number thereof of reference page according to the type of reference page;
Threshold value determination unit: determine baseline threshold according to the number of times of the feature word of described reference page and appearance thereof;
Feature word extraction unit: obtain the described feature word occurred in page object;
Page object eigenvalue calculation unit: according to the feature word occurred in the feature word of reference page and occurrence number and page object thereof, calculate the eigenvalue of page object;
Comparing unit: described eigenvalue and described baseline threshold are compared the type determining page object
In the present embodiment, page object eigenvalue calculation unit includes:
Weight determining unit, calculates each feature word according to the feature word of reference page and occurrence number thereof
Weight,
W k = P k Σ i = 1 n P i
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Statistical computation unit, the feature word occurred in statistics page object, by described weight summation corresponding for the feature word of each appearance, obtain the eigenvalue of page object.
In the present embodiment, threshold value determination unit includes:
Weight calculation unit, calculates the weight of each feature word;
W k = P k Σ i = 1 n P i
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Calculating threshold cell, described weight descending is chosen m the weight summation being arranged in front, as baseline threshold, m accounts for 20% to the 40% of the quantity of n.
In other implementations that can replace, page object eigenvalue calculation unit includes computing formula, as follows:
Σ i = 1 n ( P i - M i ) 2
Wherein, PiFor the number of times that ith feature word occurs in reference page, Mi is the number of times that ith feature word occurs in feature page, and n is the number of feature word.
In this embodiment, threshold value determination unit includes:
Sequence subelement: the number of times ascending order arrangement that the feature word in described reference page is occurred;
Threshold calculations subelement: choosing m the number of times summation being arranged in front, as baseline threshold, described m accounts for 10% to the 20% of the quantity of n
In the present embodiment, reference page determines that unit includes:
Select subelement: for each type, data base selects same type of reference page;
Identify subelement: identify the word content in each reference page;
Training subelement: described word content is trained, obtains feature word and the number of times of each feature word appearance.
In the present embodiment, comparing unit includes: when eigenvalue is more than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
In corresponding interchangeable other embodiments, comparing unit includes working as eigenvalue less than baseline threshold, and page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
The present embodiment provides the identification system of a kind of electronic document page type, determines unit, threshold value determination unit, feature word extraction unit, page object eigenvalue calculation unit, comparing unit including reference page. This system overcomes the existing inefficiency problem to electronic document page categorizing process, format electronic document page after automatically identifying can be split as cover page according to business demand, title page, colophon, copyright statement page, catalogue page, the random subset of text page and other more autgmentability feature pages, it is adaptable to no demand environment.
Obviously, above-described embodiment is only for clearly demonstrating example, and is not the restriction to embodiment.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description. Here without also cannot all of embodiment be given exhaustive. And the apparent change thus extended out or variation are still among the protection domain of the invention.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program. Therefore, the present invention can adopt the form of complete hardware embodiment, complete software implementation or the embodiment in conjunction with software and hardware aspect. And, the present invention can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) wherein including computer usable program code.
The present invention is that flow chart and/or block diagram with reference to method according to embodiments of the present invention, equipment (system) and computer program describe. It should be understood that can by the combination of the flow process in each flow process in computer program instructions flowchart and/or block diagram and/or square frame and flow chart and/or block diagram and/or square frame. These computer program instructions can be provided to produce a machine to the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device so that the instruction performed by the processor of computer or other programmable data processing device is produced for realizing the device of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and can guide in the computer-readable memory that computer or other programmable data processing device work in a specific way, the instruction making to be stored in this computer-readable memory produces to include the manufacture of command device, and this command device realizes the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, make on computer or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computer or other programmable devices provides for realizing the step of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art are once know basic creative concept, then these embodiments can be made other change and amendment. So, claims are intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.

Claims (16)

1. the recognition methods of an electronic document page type, it is characterised in that including:
Type according to reference page determines feature word and the occurrence number thereof of reference page;
Feature word and the number of times of appearance thereof according to described reference page determine baseline threshold;
Obtain the described feature word occurred in page object;
The feature word occurred in feature word according to reference page and occurrence number thereof and page object, calculates the eigenvalue of page object;
Described eigenvalue and described baseline threshold are compared the type determining page object.
2. method according to claim 1, it is characterised in that the process of the eigenvalue calculating page object includes:
First, calculate the weight of each feature word according to the feature word of reference page and occurrence number thereof,
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Then, the feature word occurred in statistics page object, by described weight summation corresponding for the feature word of each appearance, obtain the eigenvalue of page object.
3. method according to claim 2, it is characterised in that determine the process of baseline threshold according to the number of times of the feature word of described feature reference page and appearance thereof, including:
First, the weight of each feature word is calculated;
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Then, described weight descending being chosen m the weight summation being arranged in front, as baseline threshold, m accounts for 20% to the 40% of the quantity of n.
4. method according to claim 1, it is characterised in that calculating in the process of eigenvalue of page object, computing formula is as follows:
Wherein, PiFor the number of times that ith feature word occurs in reference page, Mi is the number of times that ith feature word occurs in feature page, and n is the number of feature word.
5. method according to claim 4, it is characterised in that determine the process of baseline threshold according to the number of times of the feature word of described reference page and appearance thereof, including:
The number of times ascending order arrangement that feature word in described reference page is occurred;
Choosing m the number of times summation being arranged in front, as baseline threshold, described m accounts for 10% to the 20% of the quantity of n.
6. according to the arbitrary described method of claim 1-5, it is characterised in that determine the feature word of reference page and the process of occurrence number thereof according to the type of reference page, including:
For each type, data base selects same type of reference page;
Identify the word content in each reference page;
Described word content is trained, obtains feature word and the number of times of each feature word appearance.
7. method according to claim 3, it is characterized in that, the process that described eigenvalue and described baseline threshold compare the type determining page object is included: when eigenvalue is more than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
8. method according to claim 5, it is characterized in that, the process that described eigenvalue and described baseline threshold compare the type determining page object is included: when eigenvalue is less than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
9. the identification system of an electronic document page type, it is characterised in that including:
Reference page determines unit: determine feature word and the occurrence number thereof of reference page according to the type of reference page;
Threshold value determination unit: determine baseline threshold according to the number of times of the feature word of described reference page and appearance thereof;
Feature word extraction unit: obtain the described feature word occurred in page object;
Page object eigenvalue calculation unit: according to the feature word occurred in the feature word of reference page and occurrence number and page object thereof, calculate the eigenvalue of page object;
Comparing unit: described eigenvalue and described baseline threshold are compared the type determining page object.
10. system according to claim 9, it is characterised in that page object eigenvalue calculation unit includes:
Weight determining unit, calculates each feature word according to the feature word of reference page and occurrence number thereof
Weight,
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Statistical computation unit, the feature word occurred in statistics page object, by described weight summation corresponding for the feature word of each appearance, obtain the eigenvalue of page object.
11. system according to claim 10, it is characterised in that threshold value determination unit includes:
Weight calculation unit, calculates the weight of each feature word;
Wherein, WkFor the weight of kth feature word, PkFor the number of times that k-th feature word occurs, n is the number of feature word;
Calculating threshold cell, described weight descending is chosen m the weight summation being arranged in front, as baseline threshold, m accounts for 20% to the 40% of the quantity of n.
12. system according to claim 9, it is characterised in that page object eigenvalue calculation unit includes computing formula, as follows:
Wherein, PiFor the number of times that ith feature word occurs in reference page, Mi is the number of times that ith feature word occurs in feature page, and n is the number of feature word.
13. system according to claim 12, it is characterised in that threshold value determination unit includes:
Sequence subelement: the number of times ascending order arrangement that the feature word in described reference page is occurred;
Threshold calculations subelement: choosing m the number of times summation being arranged in front, as baseline threshold, described m accounts for 10% to the 20% of the quantity of n.
14. according to the arbitrary described system of claim 9-13, it is characterised in that reference page determines that unit includes:
Select subelement: for each type, data base selects same type of reference page;
Identify subelement: identify the word content in each reference page;
Training subelement: described word content is trained, obtains feature word and the number of times of each feature word appearance.
15. system according to claim 11, it is characterised in that comparing unit includes: when eigenvalue is more than baseline threshold, page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
16. system according to claim 13, it is characterised in that comparing unit includes working as eigenvalue less than baseline threshold, and page object belongs to type belonging to reference page, is otherwise not belonging to type belonging to reference page.
CN201410645725.6A 2014-11-12 2014-11-12 Method and system for identifying page type of electronic document Pending CN105653548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410645725.6A CN105653548A (en) 2014-11-12 2014-11-12 Method and system for identifying page type of electronic document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410645725.6A CN105653548A (en) 2014-11-12 2014-11-12 Method and system for identifying page type of electronic document

Publications (1)

Publication Number Publication Date
CN105653548A true CN105653548A (en) 2016-06-08

Family

ID=56478934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410645725.6A Pending CN105653548A (en) 2014-11-12 2014-11-12 Method and system for identifying page type of electronic document

Country Status (1)

Country Link
CN (1) CN105653548A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679196A (en) * 2017-10-10 2018-02-09 中国移动通信集团公司 A kind of multimedia recognition methods, electronic equipment and storage medium
CN108268457A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of file classification method and device based on SVM
CN108628869A (en) * 2017-03-16 2018-10-09 富士施乐实业发展(中国)有限公司 A kind of method and apparatus that category division is carried out to electronic document
CN111507123A (en) * 2019-01-11 2020-08-07 北京字节跳动网络技术有限公司 Method and device for placing reading materials, reading equipment, electronic equipment and medium
CN112861985A (en) * 2021-02-24 2021-05-28 郑州轻工业大学 Automatic book classification method based on artificial intelligence
CN116383390A (en) * 2023-06-05 2023-07-04 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6191792B1 (en) * 1997-02-10 2001-02-20 Nippon Telegraph And Telephone Corporation Scheme for automatic data conversion definition generation according to data feature in visual multidimensional data analysis tool
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN102831246A (en) * 2012-09-17 2012-12-19 中央民族大学 Method and device for classification of Tibetan webpage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6191792B1 (en) * 1997-02-10 2001-02-20 Nippon Telegraph And Telephone Corporation Scheme for automatic data conversion definition generation according to data feature in visual multidimensional data analysis tool
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN102831246A (en) * 2012-09-17 2012-12-19 中央民族大学 Method and device for classification of Tibetan webpage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊海涛: "《复杂数据分析方法及其应用研究》", 31 May 2013 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268457A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of file classification method and device based on SVM
CN108628869A (en) * 2017-03-16 2018-10-09 富士施乐实业发展(中国)有限公司 A kind of method and apparatus that category division is carried out to electronic document
CN107679196A (en) * 2017-10-10 2018-02-09 中国移动通信集团公司 A kind of multimedia recognition methods, electronic equipment and storage medium
CN111507123A (en) * 2019-01-11 2020-08-07 北京字节跳动网络技术有限公司 Method and device for placing reading materials, reading equipment, electronic equipment and medium
CN112861985A (en) * 2021-02-24 2021-05-28 郑州轻工业大学 Automatic book classification method based on artificial intelligence
CN112861985B (en) * 2021-02-24 2023-01-31 郑州轻工业大学 Automatic book classification method based on artificial intelligence
CN116383390A (en) * 2023-06-05 2023-07-04 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform
CN116383390B (en) * 2023-06-05 2023-08-08 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform

Similar Documents

Publication Publication Date Title
CN105653548A (en) Method and system for identifying page type of electronic document
CN108833458B (en) Application recommendation method, device, medium and equipment
WO2017097231A1 (en) Topic processing method and device
CN107767259A (en) Loan risk control method, electronic installation and readable storage medium storing program for executing
US10565253B2 (en) Model generation method, word weighting method, device, apparatus, and computer storage medium
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN111401700A (en) Data analysis method, device, computer system and readable storage medium
CN110032650B (en) Training sample data generation method and device and electronic equipment
CN103577462A (en) Document classification method and document classification device
CN110019785B (en) Text classification method and device
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN109766416A (en) A kind of new energy policy information abstracting method and system
Hellrich et al. Exploring diachronic lexical semantics with JeSemE
CN111291936B (en) Product life cycle prediction model generation method and device and electronic equipment
CN110781275B (en) Question answering distinguishing method based on multiple characteristics and computer storage medium
CN105787004A (en) Text classification method and device
CN106997340A (en) The generation of dictionary and the Document Classification Method and device using dictionary
CN110019556A (en) A kind of topic news acquisition methods, device and its equipment
CN110858214B (en) Recommendation model training and further auditing program recommendation method, device and equipment
CN107315807B (en) Talent recommendation method and device
CN110941952A (en) Method and device for perfecting audit analysis model
CN112579847A (en) Method and device for processing production data, storage medium and electronic equipment
CN116050404A (en) Method and device for intelligent classification and identification of electronic files
KR20190104745A (en) Issue interest based news value evaluation apparatus and method, storage media storing the same

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160608