CN107633039A - It is a kind of by the pdf document cutting method for being related to stock right transfer theme - Google Patents
It is a kind of by the pdf document cutting method for being related to stock right transfer theme Download PDFInfo
- Publication number
- CN107633039A CN107633039A CN201710823110.1A CN201710823110A CN107633039A CN 107633039 A CN107633039 A CN 107633039A CN 201710823110 A CN201710823110 A CN 201710823110A CN 107633039 A CN107633039 A CN 107633039A
- Authority
- CN
- China
- Prior art keywords
- page number
- pdf document
- srt
- right transfer
- information set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of by the pdf document cutting method for being related to stock right transfer theme, comprise the following steps:1) service scripts that is disclosed and being stored with PDF format is obtained by distributed interconnection crawler technology;2) language Expressive Features, keyword and the keyword title for being related to the pdf document of stock right transfer theme are determined;3) the page number information set P of the pdf document comprising keyword and keyword title is determined;4) the abnormal page number in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) is removed, the pdf document page number information set P after must removingfinal;5) pdf document page number information set P after the removal obtained according to step 4)finalCutting on stock right transfer theme is carried out to source pdf document, completes the pdf document cutting for being related to stock right transfer theme, this method can efficiently, accurately realize the pdf document cutting for being related to stock right transfer theme.
Description
Technical field
The data structured process field of unstructured data in terms of the invention belongs to big data research, be related to it is a kind of by
It is related to the pdf document cutting method of stock right transfer theme.
Background technology
Unstructured data is included to be converted to user with file existing for WORD, EXCEL, PDF, TXT, audio, video
It is friendly, be used directly for statistical analysis and the structural data of application includes the data that are stored in the form of SQL or ORCAL
Etc. the difficult point for being the more urgent demand of current big data application field and research.
There are some achievements in the data structured method of the current file for the shorter PDF format of length, in document
The main thought of existing method is converted to for this completely unstructured existing data first by source PDF document → with XML
Or WORD forms have the file of this semi-structured data, by canonical method → be ultimately converted to SQL or ORCAL shapes
The data of structuring existing for formula;And there is conversion efficiency in both idea and methods when XML or WORD text lengths are larger
It is relatively low, conversion error rate it is higher the deficiencies of.
The content of the invention
The shortcomings that it is an object of the invention to overcome above-mentioned prior art, there is provided a kind of by being related to stock right transfer theme
Pdf document cutting method, this method can efficiently, accurately realize the pdf document cutting for being related to stock right transfer theme.
To reach above-mentioned purpose, it is of the present invention by the pdf document cutting method for being related to stock right transfer theme include with
Lower step:
1) service scripts stored with PDF format is obtained by distributed interconnection crawler technology;
2) carry out being related to stock right transfer according to the service scripts with PDF format storage that operation layer demand obtains step 1)
The business layer analysis of theme, it is determined that being related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and keyword mark
Topic;
3) the PDF texts for being related to stock right transfer theme are determined by the pdf document and step 2) that are related to stock right transfer theme
The language Expressive Features of part carry out keyword to source pdf document page by page and the canonical of keyword title is searched for, it is determined that including key
The page number information set P of the pdf document of word and keyword title;
4) the abnormal page number in the pdf document page number information set P obtained to step 3) using page number exception removal mechanisms at work
It is removed, the pdf document page number information set P after must removingfinal;
5) pdf document page number information set P after the removal obtained according to step 4)finalTo source pdf document carry out on
The cutting of stock right transfer theme, complete the pdf document cutting for being related to stock right transfer theme.
Set SRT={ SRT including keyword and keyword title1,SRT2,SRT3,SRT4,SRT5, wherein, SRT1Table
Show counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent to transfer the possession of equity shareholding equity, SRT5Table
Show exchange hour.
The page number value information collection for being related to the pdf document of stock right transfer theme is combined intoP1={ include SRT1's
The page number value set of pdf document };P2={ include SRT2Pdf document page number value set;P3={ include SRT3PDF text
The page number value set of part };P4={ include SRT4Pdf document page number value set;P5={ include SRT5Pdf document page
Code value set }.
Step 4) concrete operations are:The pdf document page number information collection obtained using page number exception removal mechanisms at work to step 3)
The abnormal page number closed in P is removed, the pdf document page number information set P after must removingfinal;
When in pdf document page number information set P the page number value corresponding with second element of page number value corresponding to the first element it
Difference is more than pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to the page number
Value;When the page number value corresponding with element second from the bottom of page number value corresponding to element last in pdf document page number information set P
Difference be more than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then remove element last in pdf document page number information set P
Corresponding page number value, the pdf document page number information set P after must removingfinal。
The invention has the advantages that:
It is of the present invention by the pdf document cutting method for being related to stock right transfer theme in concrete operations, first obtain with
The service scripts of PDF format storage, then determine to be related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and
Keyword title, it is then determined that the pdf document page number information set P comprising keyword and keyword title, while to improve PDF
File page number information aggregate P accuracy and reliability, realize and pdf document page number information set P is about simplified, the present invention passes through
The abnormal page number in the pdf document page number information set P that page number exception removal mechanisms at work obtains to step 3) is removed, Ran Houzai
The pdf document for completing to be related to stock right transfer theme according to pdf document page number information set P after removal is cut, so as to effectively carry
The precision and reliability of height cutting, have universality and stronger application foundation.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the flow chart of embodiment one.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings:
It is of the present invention to comprise the following steps by the pdf document cutting method for being related to stock right transfer theme with reference to figure 1:
1) service scripts that is disclosed and being stored with PDF format is obtained by distributed interconnection crawler technology;
2) it is related to according to the service scripts that is disclosed and being stored with PDF format that operation layer demand obtains to step 1)
The business layer analysis of stock right transfer theme, it is determined that be related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and
Keyword title;
3) the PDF texts for being related to stock right transfer theme are determined by the pdf document and step 2) that are related to stock right transfer theme
The language Expressive Features of part carry out keyword to source pdf document page by page and the canonical of keyword title is searched for, it is determined that including key
The page number information set P of the pdf document of word and keyword title;
4) the abnormal page number in the pdf document page number information set P obtained to step 3) using page number exception removal mechanisms at work
It is removed, the pdf document page number information set P after must removingfinal;
5) pdf document page number information set P after the removal obtained according to step 4)finalTo source pdf document carry out on
The cutting of stock right transfer theme, complete the pdf document cutting for being related to stock right transfer theme.
It is related to the keyword and keyword head stack note SRT={ SRT of stock right transfer1,SRT2,SRT3,...,SRTn,
Wherein, SRT1Represent counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent that transfer equity is total
Capital stock, SRT5Represent exchange hour.
The page number value information collection for being related to the pdf document of stock right transfer theme is combined intoP1={ closed comprising feature
Key word and keyword title SRT1Pdf document page number value set;P2={ include feature critical word and keyword title SRT2
Pdf document page number value set;P3={ include feature critical word and key topics SRT3Pdf document page number value collection
Close };P4={ include feature critical word and key topics SRT4Pdf document page number value set;P5={ include feature critical
Word and key topics SRT5Pdf document page number value set.
Step 4) concrete operations are:The pdf document page number information collection obtained using page number exception removal mechanisms at work to step 3)
The abnormal page number closed in P is removed, the pdf document page number information set P after must removingfinal;
When in pdf document page number information set P the first element correspond to page number value and the corresponding page number value of second element difference it is big
In pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to page number value;
When page number value corresponding to element last in pdf document page number information set P and the difference of the corresponding page number value of element second from the bottom
More than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then it is right to remove element institute last in pdf document page number information set P
Page number value is answered, the pdf document page number information set P after must removingfinal。
Embodiment one
With reference to figure 1, according to the business layer analysis of the pdf document to being related to stock right transfer theme, it is determined that being related to stock right transfer
Pdf document language Expressive Features, keyword and keyword title, be related to the keyword of the pdf document of stock right transfer theme
And keyword title is defined as counterparty, " counterpart, transfers the possession of total number, transfers the possession of shareholding equity and exchange hour;It is crucial using this
Word and keyword title find the page number information set of the source pdf document where keyword and keyword title using regularity conditions
P, it is corresponding to be combined into P with counterparty page number collection1={ 15,22,25 }, it is corresponding to be combined into P with counterpart page number collection2=
{ 22,23,28 }, the corresponding page number collection with transferring the possession of total number are combined into P3={ 25,28,31 }, it is corresponding with transferring the possession of shareholding equity
Page number collection is combined into P4={ 25,31 }, it is corresponding to be combined into P with exchange hour page number collection6={ 15,28,31,45 }, then be related to equity
The page number collection for transferring the possession of the source PDF document of theme is combined into P={ 15,22,25,28,31,45 };According to page number exception removal mechanisms at work, then
What is formed is related in the page number value set P of the source pdf document of stock right transfer theme, the first element page corresponding with second element
Code is respectively 15 and 22, and the difference of its page number value is more than given threshold value pthreshold=3, then give up to fall in page number information set P
One element corresponds to page number value 15, then the page number collection for being now related to the source pdf document of stock right transfer theme be combined into 22,25,28,
31,45 }, and in page number information set P the element second from the bottom page number corresponding with element last is respectively 31 and 45, its page number
The difference of value is more than given threshold value pthreshold, then give up to fall element page number value 45 last in page number information set P, now
The page number collection for being related to the source pdf document of stock right transfer theme is combined into Pfinal={ 22,25,28,31 };According to this page number set
Theme cutting is carried out to source pdf document, that is, cuts page 22, page 25, page 28 and page 31 in the pdf document of source, stamps
Watermark, form the pdf document of the design stock right transfer theme of new cutting.
Claims (4)
- It is 1. a kind of by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that to comprise the following steps:1) service scripts stored with PDF format is obtained by distributed interconnection crawler technology;2) carry out being related to stock right transfer theme according to the service scripts with PDF format storage that operation layer demand obtains step 1) Business layer analysis, it is determined that being related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and keyword title;3) pdf document for determining to be related to stock right transfer theme by the pdf document and step 2) that are related to stock right transfer theme The canonical that language Expressive Features carry out keyword and keyword title to source pdf document page by page is searched for, it is determined that comprising keyword and The page number information set P of the pdf document of keyword title;4) the abnormal page number in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) is carried out Remove, the pdf document page number information set P after must removingfinal;5) pdf document page number information set P after the removal obtained according to step 4)finalSource pdf document is carried out on equity The cutting of theme is transferred the possession of, completes the pdf document cutting for being related to stock right transfer theme.
- It is 2. according to claim 1 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that including Set SRT={ the SRT of keyword and keyword title1,SRT2,SRT3,SRT4,SRT5, wherein, SRT1Represent counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent to transfer the possession of equity shareholding equity, SRT5When representing transaction Between.
- It is 3. according to claim 2 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that to be related to The page number value information collection of the pdf document of stock right transfer theme is combined intoP1={ include SRT1Pdf document the page number Value set };P2={ include SRT2Pdf document page number value set;P3={ include SRT3Pdf document page number value collection Close };P4={ include SRT4Pdf document page number value set;P5={ include SRT5Pdf document page number value set.
- It is 4. according to claim 1 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that step 4) concrete operations are:Abnormal page in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) Code is removed, the pdf document page number information set P after must removingfinal;When the difference of page number value corresponding to the first element in pdf document page number information set P and the corresponding page number value of second element is big In pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to page number value; When in pdf document page number information set P the page number value corresponding with element second from the bottom of page number value corresponding to element last it Difference is more than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then remove element institute last in pdf document page number information set P Corresponding page number value, the pdf document page number information set P after must removingfinal。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710823110.1A CN107633039A (en) | 2017-09-13 | 2017-09-13 | It is a kind of by the pdf document cutting method for being related to stock right transfer theme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710823110.1A CN107633039A (en) | 2017-09-13 | 2017-09-13 | It is a kind of by the pdf document cutting method for being related to stock right transfer theme |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107633039A true CN107633039A (en) | 2018-01-26 |
Family
ID=61101203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710823110.1A Pending CN107633039A (en) | 2017-09-13 | 2017-09-13 | It is a kind of by the pdf document cutting method for being related to stock right transfer theme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107633039A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05151264A (en) * | 1991-12-02 | 1993-06-18 | Fuji Electric Co Ltd | Information retrieving device |
CN102646129A (en) * | 2012-03-09 | 2012-08-22 | 武汉大学 | Topic-relative distributed web crawler system |
CN103176956A (en) * | 2011-12-21 | 2013-06-26 | 北大方正集团有限公司 | Method and device for extracting file structure |
CN105701091A (en) * | 2014-11-24 | 2016-06-22 | 北大方正集团有限公司 | Semantic-based PDF document processing method and processing device |
CN105760457A (en) * | 2016-02-05 | 2016-07-13 | 成都康赛信息技术有限公司 | Data paging optimizing method based on MongoDB |
CN106649229A (en) * | 2015-11-04 | 2017-05-10 | 北京广联达正源兴邦科技有限公司 | PDF file splitting method, PDF file splitting system and terminal |
CN106951400A (en) * | 2017-02-06 | 2017-07-14 | 北京因果树网络科技有限公司 | The information extraction method and device of a kind of pdf document |
-
2017
- 2017-09-13 CN CN201710823110.1A patent/CN107633039A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05151264A (en) * | 1991-12-02 | 1993-06-18 | Fuji Electric Co Ltd | Information retrieving device |
CN103176956A (en) * | 2011-12-21 | 2013-06-26 | 北大方正集团有限公司 | Method and device for extracting file structure |
CN102646129A (en) * | 2012-03-09 | 2012-08-22 | 武汉大学 | Topic-relative distributed web crawler system |
CN105701091A (en) * | 2014-11-24 | 2016-06-22 | 北大方正集团有限公司 | Semantic-based PDF document processing method and processing device |
CN106649229A (en) * | 2015-11-04 | 2017-05-10 | 北京广联达正源兴邦科技有限公司 | PDF file splitting method, PDF file splitting system and terminal |
CN105760457A (en) * | 2016-02-05 | 2016-07-13 | 成都康赛信息技术有限公司 | Data paging optimizing method based on MongoDB |
CN106951400A (en) * | 2017-02-06 | 2017-07-14 | 北京因果树网络科技有限公司 | The information extraction method and device of a kind of pdf document |
Non-Patent Citations (2)
Title |
---|
6到不胜寒: ""PDF定位关键字/词所在坐标及页码"", 《CSDN HTTPS://BLOG.CSDN.NET/GUO123K/ARTICLE/DETAILS/76417702》 * |
CHAMSU: ""[Python]:关于截取pdf中的某些页"", 《CSDN HTTPS://BLOG.CSDN.NET/CHAM_3/ARTICLE/DETAILS/60135490》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101950284B (en) | Chinese word segmentation method and system | |
CN102073692B (en) | Based on the semantic retrieval system and method for agriculture field ontology library | |
CN102646125B (en) | Structured digital content extraction and reorganization method | |
CN108829858A (en) | Data query method, apparatus and computer readable storage medium | |
CN101539904B (en) | Automatic indexing method of quotations | |
CN106095762A (en) | A kind of news based on ontology model storehouse recommends method and device | |
CN104537116A (en) | Book search method based on tag | |
CN101430714B (en) | Content structuring process method and system based on model | |
CN103246710A (en) | Method and device for automatically generating multimedia travel notes | |
CN103823838A (en) | Method for inputting and comparing multi-format documents | |
CN104166683A (en) | Data mining method | |
CN104915449A (en) | Faceted search system and method based on water conservancy object classification labels | |
CN107391479A (en) | The construction method in modularization achievement storehouse | |
CN102402561A (en) | Searching method and device | |
CN112650858B (en) | Emergency assistance information acquisition method and device, computer equipment and medium | |
CN100498783C (en) | Method for supporting full text retrieval system, and searching numerical value categorical data domain meanwhile | |
CN103440343B (en) | Knowledge base construction method facing domain service target | |
CN102375863A (en) | Method and device for keyword extraction in geographic information field | |
CN107633039A (en) | It is a kind of by the pdf document cutting method for being related to stock right transfer theme | |
CN101799890A (en) | Certificate data processing method and system | |
CN102591976A (en) | Text characteristic extracting method and document copy detection system based on sentence level | |
CN107562921A (en) | It is a kind of by the pdf document cutting method for being related to backdoor listing theme | |
CN107633040A (en) | It is a kind of by be related to it is great restructuring theme pdf document cutting method | |
CN102043802A (en) | Method for searching XML (Extensive Makeup Language) key words based on structural abstract | |
Heidhues et al. | Convergence of accounting standards in Germany: biases and challenges |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180126 |