CN107633039A - It is a kind of by the pdf document cutting method for being related to stock right transfer theme - Google Patents

It is a kind of by the pdf document cutting method for being related to stock right transfer theme Download PDF

Info

Publication number
CN107633039A
CN107633039A CN201710823110.1A CN201710823110A CN107633039A CN 107633039 A CN107633039 A CN 107633039A CN 201710823110 A CN201710823110 A CN 201710823110A CN 107633039 A CN107633039 A CN 107633039A
Authority
CN
China
Prior art keywords
page number
pdf document
srt
right transfer
information set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710823110.1A
Other languages
Chinese (zh)
Inventor
张贝贝
徐小艳
周帅鹏
荆姝娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710823110.1A priority Critical patent/CN107633039A/en
Publication of CN107633039A publication Critical patent/CN107633039A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of by the pdf document cutting method for being related to stock right transfer theme, comprise the following steps:1) service scripts that is disclosed and being stored with PDF format is obtained by distributed interconnection crawler technology;2) language Expressive Features, keyword and the keyword title for being related to the pdf document of stock right transfer theme are determined;3) the page number information set P of the pdf document comprising keyword and keyword title is determined;4) the abnormal page number in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) is removed, the pdf document page number information set P after must removingfinal;5) pdf document page number information set P after the removal obtained according to step 4)finalCutting on stock right transfer theme is carried out to source pdf document, completes the pdf document cutting for being related to stock right transfer theme, this method can efficiently, accurately realize the pdf document cutting for being related to stock right transfer theme.

Description

It is a kind of by the pdf document cutting method for being related to stock right transfer theme
Technical field
The data structured process field of unstructured data in terms of the invention belongs to big data research, be related to it is a kind of by It is related to the pdf document cutting method of stock right transfer theme.
Background technology
Unstructured data is included to be converted to user with file existing for WORD, EXCEL, PDF, TXT, audio, video It is friendly, be used directly for statistical analysis and the structural data of application includes the data that are stored in the form of SQL or ORCAL Etc. the difficult point for being the more urgent demand of current big data application field and research.
There are some achievements in the data structured method of the current file for the shorter PDF format of length, in document The main thought of existing method is converted to for this completely unstructured existing data first by source PDF document → with XML Or WORD forms have the file of this semi-structured data, by canonical method → be ultimately converted to SQL or ORCAL shapes The data of structuring existing for formula;And there is conversion efficiency in both idea and methods when XML or WORD text lengths are larger It is relatively low, conversion error rate it is higher the deficiencies of.
The content of the invention
The shortcomings that it is an object of the invention to overcome above-mentioned prior art, there is provided a kind of by being related to stock right transfer theme Pdf document cutting method, this method can efficiently, accurately realize the pdf document cutting for being related to stock right transfer theme.
To reach above-mentioned purpose, it is of the present invention by the pdf document cutting method for being related to stock right transfer theme include with Lower step:
1) service scripts stored with PDF format is obtained by distributed interconnection crawler technology;
2) carry out being related to stock right transfer according to the service scripts with PDF format storage that operation layer demand obtains step 1) The business layer analysis of theme, it is determined that being related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and keyword mark Topic;
3) the PDF texts for being related to stock right transfer theme are determined by the pdf document and step 2) that are related to stock right transfer theme The language Expressive Features of part carry out keyword to source pdf document page by page and the canonical of keyword title is searched for, it is determined that including key The page number information set P of the pdf document of word and keyword title;
4) the abnormal page number in the pdf document page number information set P obtained to step 3) using page number exception removal mechanisms at work It is removed, the pdf document page number information set P after must removingfinal
5) pdf document page number information set P after the removal obtained according to step 4)finalTo source pdf document carry out on The cutting of stock right transfer theme, complete the pdf document cutting for being related to stock right transfer theme.
Set SRT={ SRT including keyword and keyword title1,SRT2,SRT3,SRT4,SRT5, wherein, SRT1Table Show counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent to transfer the possession of equity shareholding equity, SRT5Table Show exchange hour.
The page number value information collection for being related to the pdf document of stock right transfer theme is combined intoP1={ include SRT1's The page number value set of pdf document };P2={ include SRT2Pdf document page number value set;P3={ include SRT3PDF text The page number value set of part };P4={ include SRT4Pdf document page number value set;P5={ include SRT5Pdf document page Code value set }.
Step 4) concrete operations are:The pdf document page number information collection obtained using page number exception removal mechanisms at work to step 3) The abnormal page number closed in P is removed, the pdf document page number information set P after must removingfinal
When in pdf document page number information set P the page number value corresponding with second element of page number value corresponding to the first element it Difference is more than pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to the page number Value;When the page number value corresponding with element second from the bottom of page number value corresponding to element last in pdf document page number information set P Difference be more than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then remove element last in pdf document page number information set P Corresponding page number value, the pdf document page number information set P after must removingfinal
The invention has the advantages that:
It is of the present invention by the pdf document cutting method for being related to stock right transfer theme in concrete operations, first obtain with The service scripts of PDF format storage, then determine to be related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and Keyword title, it is then determined that the pdf document page number information set P comprising keyword and keyword title, while to improve PDF File page number information aggregate P accuracy and reliability, realize and pdf document page number information set P is about simplified, the present invention passes through The abnormal page number in the pdf document page number information set P that page number exception removal mechanisms at work obtains to step 3) is removed, Ran Houzai The pdf document for completing to be related to stock right transfer theme according to pdf document page number information set P after removal is cut, so as to effectively carry The precision and reliability of height cutting, have universality and stronger application foundation.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the flow chart of embodiment one.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings:
It is of the present invention to comprise the following steps by the pdf document cutting method for being related to stock right transfer theme with reference to figure 1:
1) service scripts that is disclosed and being stored with PDF format is obtained by distributed interconnection crawler technology;
2) it is related to according to the service scripts that is disclosed and being stored with PDF format that operation layer demand obtains to step 1) The business layer analysis of stock right transfer theme, it is determined that be related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and Keyword title;
3) the PDF texts for being related to stock right transfer theme are determined by the pdf document and step 2) that are related to stock right transfer theme The language Expressive Features of part carry out keyword to source pdf document page by page and the canonical of keyword title is searched for, it is determined that including key The page number information set P of the pdf document of word and keyword title;
4) the abnormal page number in the pdf document page number information set P obtained to step 3) using page number exception removal mechanisms at work It is removed, the pdf document page number information set P after must removingfinal
5) pdf document page number information set P after the removal obtained according to step 4)finalTo source pdf document carry out on The cutting of stock right transfer theme, complete the pdf document cutting for being related to stock right transfer theme.
It is related to the keyword and keyword head stack note SRT={ SRT of stock right transfer1,SRT2,SRT3,...,SRTn, Wherein, SRT1Represent counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent that transfer equity is total Capital stock, SRT5Represent exchange hour.
The page number value information collection for being related to the pdf document of stock right transfer theme is combined intoP1={ closed comprising feature Key word and keyword title SRT1Pdf document page number value set;P2={ include feature critical word and keyword title SRT2 Pdf document page number value set;P3={ include feature critical word and key topics SRT3Pdf document page number value collection Close };P4={ include feature critical word and key topics SRT4Pdf document page number value set;P5={ include feature critical Word and key topics SRT5Pdf document page number value set.
Step 4) concrete operations are:The pdf document page number information collection obtained using page number exception removal mechanisms at work to step 3) The abnormal page number closed in P is removed, the pdf document page number information set P after must removingfinal
When in pdf document page number information set P the first element correspond to page number value and the corresponding page number value of second element difference it is big In pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to page number value; When page number value corresponding to element last in pdf document page number information set P and the difference of the corresponding page number value of element second from the bottom More than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then it is right to remove element institute last in pdf document page number information set P Page number value is answered, the pdf document page number information set P after must removingfinal
Embodiment one
With reference to figure 1, according to the business layer analysis of the pdf document to being related to stock right transfer theme, it is determined that being related to stock right transfer Pdf document language Expressive Features, keyword and keyword title, be related to the keyword of the pdf document of stock right transfer theme And keyword title is defined as counterparty, " counterpart, transfers the possession of total number, transfers the possession of shareholding equity and exchange hour;It is crucial using this Word and keyword title find the page number information set of the source pdf document where keyword and keyword title using regularity conditions P, it is corresponding to be combined into P with counterparty page number collection1={ 15,22,25 }, it is corresponding to be combined into P with counterpart page number collection2= { 22,23,28 }, the corresponding page number collection with transferring the possession of total number are combined into P3={ 25,28,31 }, it is corresponding with transferring the possession of shareholding equity Page number collection is combined into P4={ 25,31 }, it is corresponding to be combined into P with exchange hour page number collection6={ 15,28,31,45 }, then be related to equity The page number collection for transferring the possession of the source PDF document of theme is combined into P={ 15,22,25,28,31,45 };According to page number exception removal mechanisms at work, then What is formed is related in the page number value set P of the source pdf document of stock right transfer theme, the first element page corresponding with second element Code is respectively 15 and 22, and the difference of its page number value is more than given threshold value pthreshold=3, then give up to fall in page number information set P One element corresponds to page number value 15, then the page number collection for being now related to the source pdf document of stock right transfer theme be combined into 22,25,28, 31,45 }, and in page number information set P the element second from the bottom page number corresponding with element last is respectively 31 and 45, its page number The difference of value is more than given threshold value pthreshold, then give up to fall element page number value 45 last in page number information set P, now The page number collection for being related to the source pdf document of stock right transfer theme is combined into Pfinal={ 22,25,28,31 };According to this page number set Theme cutting is carried out to source pdf document, that is, cuts page 22, page 25, page 28 and page 31 in the pdf document of source, stamps Watermark, form the pdf document of the design stock right transfer theme of new cutting.

Claims (4)

  1. It is 1. a kind of by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that to comprise the following steps:
    1) service scripts stored with PDF format is obtained by distributed interconnection crawler technology;
    2) carry out being related to stock right transfer theme according to the service scripts with PDF format storage that operation layer demand obtains step 1) Business layer analysis, it is determined that being related to the language Expressive Features of the pdf document of stock right transfer theme, keyword and keyword title;
    3) pdf document for determining to be related to stock right transfer theme by the pdf document and step 2) that are related to stock right transfer theme The canonical that language Expressive Features carry out keyword and keyword title to source pdf document page by page is searched for, it is determined that comprising keyword and The page number information set P of the pdf document of keyword title;
    4) the abnormal page number in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) is carried out Remove, the pdf document page number information set P after must removingfinal
    5) pdf document page number information set P after the removal obtained according to step 4)finalSource pdf document is carried out on equity The cutting of theme is transferred the possession of, completes the pdf document cutting for being related to stock right transfer theme.
  2. It is 2. according to claim 1 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that including Set SRT={ the SRT of keyword and keyword title1,SRT2,SRT3,SRT4,SRT5, wherein, SRT1Represent counterparty, SRT2Represent counterpart, SRT3Represent to transfer the possession of the total number of share of stock of equity, SRT4Represent to transfer the possession of equity shareholding equity, SRT5When representing transaction Between.
  3. It is 3. according to claim 2 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that to be related to The page number value information collection of the pdf document of stock right transfer theme is combined intoP1={ include SRT1Pdf document the page number Value set };P2={ include SRT2Pdf document page number value set;P3={ include SRT3Pdf document page number value collection Close };P4={ include SRT4Pdf document page number value set;P5={ include SRT5Pdf document page number value set.
  4. It is 4. according to claim 1 by the pdf document cutting method for being related to stock right transfer theme, it is characterised in that step 4) concrete operations are:Abnormal page in the pdf document page number information set P obtained using page number exception removal mechanisms at work to step 3) Code is removed, the pdf document page number information set P after must removingfinal
    When the difference of page number value corresponding to the first element in pdf document page number information set P and the corresponding page number value of second element is big In pthreshold, i.e., | p2-p1|>pthresholdWhen, then remove the first element in pdf document page number information set P and correspond to page number value; When in pdf document page number information set P the page number value corresponding with element second from the bottom of page number value corresponding to element last it Difference is more than pthreshold, i.e., | pm-pm-1|>pthresholdWhen, then remove element institute last in pdf document page number information set P Corresponding page number value, the pdf document page number information set P after must removingfinal
CN201710823110.1A 2017-09-13 2017-09-13 It is a kind of by the pdf document cutting method for being related to stock right transfer theme Pending CN107633039A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710823110.1A CN107633039A (en) 2017-09-13 2017-09-13 It is a kind of by the pdf document cutting method for being related to stock right transfer theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710823110.1A CN107633039A (en) 2017-09-13 2017-09-13 It is a kind of by the pdf document cutting method for being related to stock right transfer theme

Publications (1)

Publication Number Publication Date
CN107633039A true CN107633039A (en) 2018-01-26

Family

ID=61101203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710823110.1A Pending CN107633039A (en) 2017-09-13 2017-09-13 It is a kind of by the pdf document cutting method for being related to stock right transfer theme

Country Status (1)

Country Link
CN (1) CN107633039A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597422A (en) * 2020-12-30 2021-04-02 深圳市世强元件网络有限公司 PDF file segmentation method and PDF file loading method in webpage

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05151264A (en) * 1991-12-02 1993-06-18 Fuji Electric Co Ltd Information retrieving device
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN103176956A (en) * 2011-12-21 2013-06-26 北大方正集团有限公司 Method and device for extracting file structure
CN105701091A (en) * 2014-11-24 2016-06-22 北大方正集团有限公司 Semantic-based PDF document processing method and processing device
CN105760457A (en) * 2016-02-05 2016-07-13 成都康赛信息技术有限公司 Data paging optimizing method based on MongoDB
CN106649229A (en) * 2015-11-04 2017-05-10 北京广联达正源兴邦科技有限公司 PDF file splitting method, PDF file splitting system and terminal
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05151264A (en) * 1991-12-02 1993-06-18 Fuji Electric Co Ltd Information retrieving device
CN103176956A (en) * 2011-12-21 2013-06-26 北大方正集团有限公司 Method and device for extracting file structure
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN105701091A (en) * 2014-11-24 2016-06-22 北大方正集团有限公司 Semantic-based PDF document processing method and processing device
CN106649229A (en) * 2015-11-04 2017-05-10 北京广联达正源兴邦科技有限公司 PDF file splitting method, PDF file splitting system and terminal
CN105760457A (en) * 2016-02-05 2016-07-13 成都康赛信息技术有限公司 Data paging optimizing method based on MongoDB
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
6到不胜寒: ""PDF定位关键字/词所在坐标及页码"", 《CSDN HTTPS://BLOG.CSDN.NET/GUO123K/ARTICLE/DETAILS/76417702》 *
CHAMSU: ""[Python]:关于截取pdf中的某些页"", 《CSDN HTTPS://BLOG.CSDN.NET/CHAM_3/ARTICLE/DETAILS/60135490》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597422A (en) * 2020-12-30 2021-04-02 深圳市世强元件网络有限公司 PDF file segmentation method and PDF file loading method in webpage

Similar Documents

Publication Publication Date Title
CN101950284B (en) Chinese word segmentation method and system
CN102073692B (en) Based on the semantic retrieval system and method for agriculture field ontology library
CN102646125B (en) Structured digital content extraction and reorganization method
CN108829858A (en) Data query method, apparatus and computer readable storage medium
CN101539904B (en) Automatic indexing method of quotations
CN106095762A (en) A kind of news based on ontology model storehouse recommends method and device
CN104537116A (en) Book search method based on tag
CN101430714B (en) Content structuring process method and system based on model
CN103246710A (en) Method and device for automatically generating multimedia travel notes
CN103823838A (en) Method for inputting and comparing multi-format documents
CN104166683A (en) Data mining method
CN104915449A (en) Faceted search system and method based on water conservancy object classification labels
CN107391479A (en) The construction method in modularization achievement storehouse
CN102402561A (en) Searching method and device
CN112650858B (en) Emergency assistance information acquisition method and device, computer equipment and medium
CN100498783C (en) Method for supporting full text retrieval system, and searching numerical value categorical data domain meanwhile
CN103440343B (en) Knowledge base construction method facing domain service target
CN102375863A (en) Method and device for keyword extraction in geographic information field
CN107633039A (en) It is a kind of by the pdf document cutting method for being related to stock right transfer theme
CN101799890A (en) Certificate data processing method and system
CN102591976A (en) Text characteristic extracting method and document copy detection system based on sentence level
CN107562921A (en) It is a kind of by the pdf document cutting method for being related to backdoor listing theme
CN107633040A (en) It is a kind of by be related to it is great restructuring theme pdf document cutting method
CN102043802A (en) Method for searching XML (Extensive Makeup Language) key words based on structural abstract
Heidhues et al. Convergence of accounting standards in Germany: biases and challenges

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180126