CN107358208A - A kind of PDF document structured message extracting method and device - Google Patents

A kind of PDF document structured message extracting method and device Download PDF

Info

Publication number
CN107358208A
CN107358208A CN201710576556.9A CN201710576556A CN107358208A CN 107358208 A CN107358208 A CN 107358208A CN 201710576556 A CN201710576556 A CN 201710576556A CN 107358208 A CN107358208 A CN 107358208A
Authority
CN
China
Prior art keywords
title
page
content
level
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710576556.9A
Other languages
Chinese (zh)
Other versions
CN107358208B (en
Inventor
徐龙
李德彦
杨宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co., Ltd
Original Assignee
China Science And Technology (beijing) Co Ltd
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology (beijing) Co Ltd, Beijing Shenzhou Taiyue Software Co Ltd filed Critical China Science And Technology (beijing) Co Ltd
Priority to CN201710576556.9A priority Critical patent/CN107358208B/en
Publication of CN107358208A publication Critical patent/CN107358208A/en
Application granted granted Critical
Publication of CN107358208B publication Critical patent/CN107358208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/43Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR

Abstract

The embodiment of the present application discloses a kind of PDF document structured message extracting method, and methods described includes:Obtain the original page of PDF document;At least one actual page comprising content of text or title is extracted from the original page;Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;Each described title of structured storage and the content of text for being under the jurisdiction of the title.Structured message extracting method in above-mentioned technical proposal title at different levels in PDF document and can be under the jurisdiction of the corresponding content of text of titles at different levels and extract, and structured storage, so as to obtain structured message, so that the structured message extraction of PDF document can automate realization, avoid manual reprocessing, convenient and efficient.

Description

A kind of PDF document structured message extracting method and device
Technical field
The application is related to PDF document information extraction field, more particularly to a kind of PDF document structured message extracting method. In addition, the application further relates to a kind of PDF document structured message extraction element.
Background technology
PDF (Portable Document Format, portable document format), is developed by Adobe Systems The file format gone out, exchange files are carried out for the mode unrelated with application program, operating system, hardware, belong to format document. It is relatively independent between the PDF page, it can verily reproduce each character, color and the image of original copy, but PDF storage It is non-structured data memory format, without the logical construction of recording documents, without logical elements such as paragraph, forms.
Extract the information in PDF document, generally use OCR (Optical Character Recognition, optics word Symbol identification) technology.But the information of the PDF document extracted using OCR technique, it is rendering of being carried out in a manner of vector, It is no logical relation (such as adjacent, front and rear relation) between each character.The text that the character extracted is formed is only It is the matrix that three coordinates of x, y, z add rotation amount to render.The problem of form and random big position be present in such text, Also need to be handled again by hand, can just obtain the structured message with clear and definite hierarchical structure.
Therefore, the information in PDF document is extracted using existing method, in the text extracted, text formatting and position with Meaning, can not advantageously obtain structured message, this is those skilled in the art's urgent problem to be solved.
The content of the invention
The application provides a kind of PDF document structured message extracting method and a kind of PDF document structured message extraction dress Put, to solve the problems, such as advantageously obtain PDF document structured message by prior art.
In a first aspect, this application provides a kind of PDF document structured message extracting method, this method includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title.
With reference in a first aspect, in first aspect in the first possible implementation, extracted at least from the original page The step of one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page in original page, header or footer are deleted, obtain at least one actual page.
With reference to first aspect and above-mentioned possible implementation, in second of possible implementation of first aspect, from The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, including:
Extract the first order title in each actual page;
Current content between first order title and next first order title in actual page is extracted, as with current first Content corresponding to level title;If last first order title in the entitled actual page of the current first order, is extracted in the actual page Content after current first order title, as content corresponding with current first order title;
By each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE);
If one-level title in the absence of in the one-level logical page (LPAGE), each described title of the structured storage and it is subordinate to In the content of text of the title the step of, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is subordinate to In the content of text of first order title be content corresponding with the first order title.
With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, institute The content by each first order title, and corresponding to the first order title is stated, before the step of one-level logical page (LPAGE), It is further comprising the steps of:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order Content corresponding to title;
If first first order title in currently practical page, will be described currently practical not in the first row of currently practical page Content in page before first first order title is incorporated into content corresponding to a first order title.
With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, from The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, it is further comprising the steps of:
(N+1) level title is extracted from each N level logical page (LPAGE) respectively, and is under the jurisdiction of the text of (N+1) level title Content, N take >=1 integer.
With reference to first aspect and above-mentioned possible implementation, in the 5th kind of possible implementation of first aspect, institute State and extract (N+1) level title from each N level logical page (LPAGE) respectively, and be under the jurisdiction of the content of text of (N+1) level title Step, including:
Extract the N+1 level titles in each N levels logical page (LPAGE);
Extract the content between current N+1 levels title and next N+1 level titles, as with current N+1 level marks Content corresponding to topic;If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N levels and patrols The content after current N+1 level titles in page is collected, as content corresponding with current N+1 level titles;
By each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logical page (LPAGE);
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
Structured storage the 1st is to N+1 level titles, and is under the jurisdiction of the described 1st respectively in the text of N+1 level titles Hold, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage mark The content of text of topic is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
With reference to first aspect and above-mentioned possible implementation, in the 6th kind of possible implementation of first aspect, institute The step of stating and extract N+1 level titles from each N level logical page (LPAGE) respectively, and being under the jurisdiction of the content of text of N+1 level titles Including:
Determine to whether there is form in each N level logical page (LPAGE), if form be present, the form is cut into form area Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.
With reference to first aspect and above-mentioned possible implementation, in the 7th kind of possible implementation of first aspect, institute The step of extracting the first order title in each actual page is stated, including:
Obtain the title line in actual page and title line Y-axis coordinate in actual page;
If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis units in same actual page When, next title line is merged with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
Second aspect, present invention also provides a kind of PDF document structured message extraction element, including:
Acquiring unit, for obtaining the original page of PDF document;
First extraction unit, for extracting at least one reality comprising content of text or title from the original page Page;
Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title Hold;
Memory cell, each described title and it is under the jurisdiction of the content of text of the title for structured storage.
With reference to second aspect, in second aspect in the first possible implementation, first extraction unit, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Unit is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.
Compared with prior art, this method removes first from the original page of PDF document and structured message may be carried The part for producing and disturbing, such as catalogue page, header, footer etc. are taken, generates actual page, it is actual so as to complete to extract from original page The step of page.Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, ties Structureization stores, so as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid hand Work processing, convenient and efficient.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme of the application, letter will be made to the required accompanying drawing used in embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 to Fig. 7 is the flow of an embodiment of the PDF document structured message of the application this extracting method Figure;
Fig. 8 to Figure 19 is sub-step in one embodiment of this extracting method of the PDF document structured message of the application Effect diagram;
Figure 20 is the structural representation of one embodiment of the PDF document structured message of the application this extraction element.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair It is bright to be described in further detail.
Fig. 1 is refer to, in a detailed embodiment, this extracting method of PDF document structured message includes:
S100 obtains the original page of PDF document.
S200 extracts at least one actual page comprising content of text or title from original page.
S300 extracts titles at different levels from actual page and is under the jurisdiction of the content of text of the title.
Each described title of S400 structured storages and the content of text for being under the jurisdiction of the title.
Structured message refers to that information is decomposed into multiple inter-related parts, each part after analysis Between have clear and definite hierarchical structure.In this application, PDF document structured message means the text extracted from PDF document, Titles at different levels and the content of text for being under the jurisdiction of title have clear and definite hierarchical structure in text.Structured message can subsequently pass through The file of the multiple formats such as html, word, txt is shown.
Structured storage refers to the content of the multiple files of needs to be saved in by tree structure and level in a file. In this application, each described title of structured storage and it is under the jurisdiction of the content of text of the title, refers to titles at different levels, And it is under the jurisdiction of the content of titles at different levels, stored according to tree structure and level, so as to obtain the structuring of PDF document Information.
Above-mentioned method, the extraction that removal may be to structured message first from the original page of PDF document produce interference Part, such as catalogue page, header, footer etc., actual page is generated, the step of so as to complete to extract actual page from original page. Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, structured storage, So as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid manual processing, just It is prompt efficient.
The step of above-mentioned S100-S400, is described in detail below.
In the step of S100, the original page of PDF document can be inputted by user to obtain, can also be from storage medium Obtain.
In the step of S200, Fig. 2 is refer to, the step of can specifically including S210 and the step of S220.
Whether S210 is judged in original page comprising catalogue page, header and footer.
In step S210, comprise the following steps:
S211 obtains the page number, the character and the total line number of character of current original page of current original page;
S212 is matched the page number of current original page and character with the first preset rules, it is determined that current original page whether For catalogue page.
In step S211, the page number of current original page, the character of current original page and the total line number of character, it can pass through The instruments such as PDFBox, iText directly obtain.Wherein, PDFBox is the Java platform class libraries of an operation PDF document, is out Source instrument, anyone can be programmed on its basis, for creating PDF document, operation existing document and extraction The text message of document.IText is also a java class libraries for being used to generate PDF document increased income, by iText not only PDF or rtf document can be generated, and can be pdf document by XML, Html file translations.
In step S212, the first preset rules can be preset by developer or user.For example, first is pre- If in rule, determining whether the rule of catalogue page includes:The page number of current original page is first page or second page, and current former The line number shared by heading order number on beginning page exceedes the 40% of the total line number of character of current original page, and current original page is catalogue Page;Or the page number of current original page is first page or second page, and in the character of current original page, occur successively " Chinese, Line number shared by the character string of non-Chinese continuous symbol, sequence number " form exceedes the 40% of the total line number of character of current original page, when Preceding original page is catalogue page;Or the page number of current original page is first page or second page, and in the character in current original page Comprising preset keyword, current original page is catalogue page.
For example, if the page number of current original page is first page or second page, and the title sequence in original page Number, such as " 1.1 ", " 1.1.1 ", " 1, ", " 2, " etc., 40% of shared line number more than the total line number of character of current original page, It is catalogue page to determine that current original page.Or if the page number of current original page is first page or second page, it is current original Page character in, such as " from signing for this contract 10 days (hesitating the phase) if in you require surrender, cost only deducts in our company Take ... ... 1.4 ", " line number so shared by the character string of form such as chapter 1 ... ... 15 " exceedes current original page The 40% of the total line number of character, it is catalogue page to determine that current original page.Also or, Fig. 8 is refer to, if the page of current original page Code is first page or second page, and includes " chapter 1 ", " first ", " Co., Ltd ", " catalogue " in current original page During Deng these preset keywords, it is catalogue page to determine that current original page.
In step S212 the first preset rules, in another example, judge whether the rule comprising header includes in original page: If the first line character is identical in continuous 3-5 pages of original page, determine that original page includes header.Further for example, judge be in original page The no rule comprising footer includes:If last column character is identical in continuous 3-5 pages of original page, determine that original page includes page Pin.
S220 deletes the catalogue page in original page, header or footer, obtains at least one actual page.
Specifically, if including catalogue page in original page, the whole page of catalogue page in original page is deleted;If wrapped in original page Containing header, then the header in original page is deleted;If including footer in original page, the footer in original page is deleted.So as to Remove the partial content that may be produced to the extraction of the structured text of PDF document in the original page or original page of interference, obtain to A few actual page.
Before the step of carrying out S300, first the character being in actual page with a line can be merged, form row Text, as shown in figure 9, being merged to the character of same a line, each reality can be obtained by instruments such as PDFBox in advance The coordinate information of character on page, including X-axis coordinate and Y-axis coordinate are identical or gap is within preset range by Y-axis coordinate Character merges, and obtains style of writing originally.Traveled through in units of composing a piece of writing originally, come the text for extracting titles at different levels and being under the jurisdiction of the title The step of this content, for example, by traveling through the style of writing sheet in actual page, to extract first order title and be under the jurisdiction of the first order mark The content of text of topic;By traveling through one-level logical page (LPAGE), to extract second level title in one-level logical page (LPAGE) and be under the jurisdiction of the second level The content of text of title.
The step of S300 and two kinds of situations can be included the step of corresponding S400, one kind is not present in one-level logical page (LPAGE) The situation of next stage title, another kind are next stage title in one-level logical page (LPAGE) also be present.
Fig. 3, Fig. 4, Figure 10 be refer to Figure 14.Fig. 3 is the flow chart of S300-S400 in one embodiment, Fig. 4 the The flow chart of S311 steps in one embodiment.The effect diagram for the step of Figure 10 is S311 in one embodiment;Figure 11 For the effect diagram in one embodiment the step of S312;The effect for the step of Figure 12 is S313 in one embodiment is shown It is intended to;The effect diagram for the step of Figure 13 is S314 in one embodiment;Figure 14 is the step of S410 in one embodiment Rapid effect diagram.Include in one embodiment, the step of S300:
S311 extracts the first order title in each actual page;
Current content between first order title and next first order title in S312 extraction actual pages, as with it is current Content corresponding to first order title;If last first order title in the entitled actual page of the current first order, extracts the reality Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one Content corresponding to one-level title;If first first order title in currently practical page not in the first row of currently practical page, Content before first first order title in the currently practical page is incorporated into content corresponding to a upper first order title;
S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE).
If one-level title in the absence of in the one-level logical page (LPAGE), the step of corresponding S400, including:
Each first order title of S410 structured storages and the content of text for being under the jurisdiction of the first order title, wherein, The content of text for being under the jurisdiction of first order title is content corresponding with the first order title.
, can be according to the size of font, the pattern of font, word content or title in actual page in the step of S311 Line etc. extracts the first order title in actual page;The size of the font, the pattern of font, word content or title line are all It can be obtained by instruments such as PDFBox, iText.
First order title in actual page is extracted by the font size in actual page, for example, passing through relatively more each style of writing The size of this font, if the largest font of current line text, it is determined that current line text is first order title.Pass through reality The first order title in font style extraction actual page in page, for example, passing through this font style and the default font sample of composing a piece of writing Formula is matched, and it is first order title to determine current line text.The font size of above-mentioned style of writing sheet, current line text can be used Font size of the size of middle first character as the style of writing sheet, multiple size identicals in current line text can also be used The size of multiple characters, the font size as the style of writing sheet;The font style of above-mentioned style of writing sheet, style of writing can be used the in this Font style of the pattern of one character as the style of writing sheet, multiple pattern identicals in current line text can also be used multiple The pattern of character, the font style as the style of writing sheet.The first order in actual page is extracted by the word content in actual page Title, for example, being matched by word content with predetermined keyword, if containing " chapter 1 ", " second in word content The predetermined keyword such as chapter ", " first ", " Part I ", it is determined that current line text is first order title.
Divided into for some first order titles by the PDF document of title line, can also by the title line in actual page come First order title is extracted, Fig. 4 is refer to, specifically includes:
S3111 obtains title line and title line Y-axis coordinate in actual page in actual page;
If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis in the same actual pages of S3112 During unit, next title line is merged with current head line;
S3113 obtains the text of a line nearest from title line on title line as the first order title in actual page.
In the step of S3111, the Y-axis coordinate of title line can be obtained by instruments such as PDFBox, iText in actual page Take.
In the step of S3113, a line nearest from title line, can by compare this Y-axis coordinate of style of writing with it is current The distance between Y-axis coordinate of title line, to determine a line nearest from title line, the text of the row is obtained as in actual page First order title.
During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page Extraction, it is possible to a kind of situation occurs:Content corresponding to same first order title ought to be used as, but because respectively front and rear It is opened in two actual pages.The step of by above-mentioned S313, the content of this part in actual page can be merged into upper one Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF The problem of content that paging is split in document information acquisition methods can not be polymerize.
Fig. 5, Fig. 6, Figure 15 be refer to Figure 18, Fig. 5 is the flow chart of S300-S400 in second embodiment, Fig. 6 the The flow chart of S320 steps in two embodiments.The effect diagram for the step of Figure 15 is S321 in second embodiment;Figure 16 For the effect diagram in second embodiment the step of S322;The effect for the step of Figure 17 is S323 in second embodiment is shown It is intended to;Figure 18 is effect signal the step of being related to the content of text for being under the jurisdiction of i-stage title in S420 in second embodiment Figure.In the second embodiment, the step of S300 includes:
S311 extracts the first order title in each actual page;
Current content between first order title and next first order title in S312 extraction actual pages, as with it is current Content corresponding to first order title;If last first order title in the entitled actual page of the current first order, extracts the reality Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one Content corresponding to one-level title;If first first order title in currently practical page not in the first row of currently practical page, Content before first first order title in the currently practical page is incorporated into content corresponding to a upper first order title;
S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE);
Further comprising the steps of if next stage title in one-level logical page (LPAGE) be present, S320 is respectively from each N level logic (N+1) level title is extracted in page, and is under the jurisdiction of the content of text of (N+1) level title, N takes >=1 integer.The step for can To use recursive process, untill not including N+1 level titles in N level logical page (LPAGE)s.Specifically include:
S321 extracts the N+1 level titles in each N levels logical page (LPAGE), and N takes >=1 integer;
S322 extracts the content between current N+1 levels title and next N+1 level titles, as with current N+1 Content corresponding to level title;If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N Content in level logical page (LPAGE) after current N+1 level titles, as content corresponding with current N+1 level titles;
S323 is by each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logic Page.
Correspondingly the step of S400, including:
S420 structured storages the 1st to N+1 level titles, and be under the jurisdiction of respectively the described 1st to N+1 level titles text Content, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage The content of text of title is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
It is corresponding with N level titles interior herein it should be noted that when including N+1 level titles in N level logical page (LPAGE)s Hold, contain the content of text for being under the jurisdiction of N level titles, and N+1 level logical page (LPAGE)s.When N+1 levels are not present in N level logical page (LPAGE)s During title, content corresponding with N level titles, exactly it is under the jurisdiction of the content of text of N level titles.That is, in the application In, content corresponding with N level titles, and be under the jurisdiction of the content of text of N level titles, include therebetween and by comprising Relation.
It should be noted that if extracting multiple one-level logical page (LPAGE)s from actual page, wherein, in part primary logical page (LPAGE) not Next stage title be present, next stage title also be present in part primary logical page (LPAGE), then in the absence of one-level one-level logic Page, the step of structured storage for the structured storage in one embodiment the step of, for next stage title also be present One-level logical page (LPAGE), the step of structured storage for the structured storage in second embodiment the step of, the PDF texts that finally obtain In mark structure information, the structured storage result in two embodiments is contained.
For including form in some N level logical page (LPAGE)s, and there is title PDF document in form, such as the PDF shown in Figure 19 Document, then Fig. 7 and Figure 19 are refer to, Fig. 7 is the flow chart of S300-S400 in the 3rd embodiment, and Figure 19 is the 3rd implementation In example in 320a form cutting schematic diagram.In the 3rd embodiment, in foregoing PDF document structured message extracting method, The step of S320, includes:
S320a is determined to whether there is form in each N levels logical page (LPAGE), if form be present, the form is cut into form Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.
Specifically, in the step of S320a, " it is determined that whether there is form in each N level logical page (LPAGE)s, if form be present, by institute State form and be cut into form block " the step of can include:
S320a1 determines whether include form in N level logical page (LPAGE)s according to the second preset rules;The second preset rules bag Include:If in content corresponding to N level titles, include at least two continuous spaces with a line, and it is empty described at least continuous three row The position of lattice is identical, it is determined that form be present in current N level logical page (LPAGE)s, and to occur a line at least two continuous spaces for the first time As the initial row of form, there is end line of a line at least two continuous spaces as form in last time;
Longitudinally cutting lines of the S320a2 using the position at least two of form continuous spaces as form, with the null in form For transverse cut, form is cut into form block;
S320a3 is with from left to right, and order from top to bottom obtains the content in the form block successively, with current N levels Content in logical page (LPAGE) in addition to the form together, as the content corresponding to the N level titles in current N levels logical page (LPAGE).
The step of S320a, by the way that the form in N level logical page (LPAGE)s is carried out into cutting, the content in form is obtained, instead of original Some forms, so as to have updated content corresponding with N level titles in former N levels logical page (LPAGE), new N levels logical page (LPAGE) is formd to replace Change former N levels logical page (LPAGE).And afterwards the step of, that is, S321-323 extracts N+1 levels title and person in servitude from N level logical page (LPAGE)s In the step of belonging to the content of text of the N+1 level titles, described N level logical page (LPAGE)s, refer to new N level logical page (LPAGE)s.
It should be noted that during when handling a PDF document, it is understood that there may be part N level logical page (LPAGE)s have table Lattice, the situation of form is not present in part N levels logical page (LPAGE), now, for the N level logical page (LPAGE)s in the absence of form, using second reality The step of applying S320 in example contains for existing the content of text that extracts N+1 levels title and be under the jurisdiction of the N+1 level titles The N level logical page (LPAGE)s of the form of title, extract N+1 levels title the step of S320a using in the 3rd embodiment and be under the jurisdiction of The content of text of the N+1 level titles.
Figure 20 is refer to, in another embodiment, also provides a kind of PDF document structured message extraction dress Put, including:
Acquiring unit 1, for obtaining the original page of PDF document;
First extraction unit 2, for extracting at least one reality comprising content of text or title from the original page Page;
Second extraction unit 3, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title Hold;
Memory cell 4, each described title and it is under the jurisdiction of the content of text of the title for structured storage.
Alternatively, the first extraction unit 2, including:
Judging unit 21, for whether judging respectively in the original page comprising catalogue page, header and footer;
Unit 22 is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.
Above-mentioned PDF document structured message extraction element, the structured message of extraction PDF document can be automated, is kept away Exempt from manual processing, convenient and efficient.Deleted by the first extraction unit 2 on the influential mesh of PDF document structured message extraction Page, header and footer are recorded, so as to further ensure the accuracy of structured message extraction.
Alternatively, the second extraction unit 3 includes:
First order title extraction unit, for extracting the first order title in each actual page;
First order contents extracting unit, for extract in actual page current first order title and next first order title it Between content, as content corresponding with current first order title;If last in the current entitled actual page of the first order the One-level title, the content after current first order title in the actual page is extracted, in corresponding with current first order title Hold;
One-level logical page (LPAGE) generation unit, for the content by each first order title, and corresponding to the first order title, As an one-level logical page (LPAGE).
Memory cell 4 includes first order memory cell, for when one-level title in the absence of in the one-level logical page (LPAGE), Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is under the jurisdiction of the first order The content of text of title is content corresponding with the first order title.
Alternatively, the second extraction unit 3 also includes combining unit, the combining unit respectively with first order contents extraction list Member connects with one-level logical page (LPAGE) generation unit, if for not having first order title in currently practical page, by the institute of currently practical page There is content to be incorporated into content corresponding to a first order title;If or for first first order title in currently practical page Not in the first row of currently practical page, the content before first first order title in the currently practical page is incorporated into upper one Content corresponding to individual first order title.
During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page Extraction, it is possible to a kind of situation occurs:Content corresponding to same first order title ought to be used as, but because respectively front and rear It is opened in two actual pages.By above-mentioned combining unit, the content of this part in actual page can be merged into upper one Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF The problem of content that paging is split in document information acquisition methods can not be polymerize.
Alternatively, the second extraction unit 3 also includes N level extraction units, for being extracted respectively from each N level logical page (LPAGE) N+1 level titles, and it is under the jurisdiction of the content of text of N+1 level titles, N takes >=1 integer.Only exist when in N level logical page (LPAGE)s During N+1 level titles, N levels extraction unit is just run, and when N+1 level titles are not present in N level logical page (LPAGE)s, the extraction of N levels is single Member is out of service.
Alternatively, N levels extraction unit includes:
N+1 level title extraction units, for extracting the N+1 level titles in each N levels logical page (LPAGE);
N+1 level contents extracting units, for extracting between current N+1 levels title and next N+1 level titles Content, as content corresponding with current N+1 level titles;If last in the current entitled N levels logical page (LPAGE) of N+1 levels N+1 level titles, extract the content after current N+1 level titles in the N level logical page (LPAGE)s, as with current N+1 level titles Corresponding content;
N+1 level logical page (LPAGE) generation units, for by each N+1 level title, and it is corresponding with the N+1 level titles in Hold, as a N+1 level logical page (LPAGE).
Memory cell 4 also includes N level memory cell, is subordinate to for structured storage the 1st to N+1 level titles, and respectively Belong to the described 1st to N+1 level titles content of text, wherein, be under the jurisdiction of N+1 level titles content of text be and the N+ Content corresponding to 1 grade of title, the content of text for being under the jurisdiction of i-stage title are to remove i+1 levels in content corresponding with the i-stage title Content outside logical page (LPAGE), i=1,2 ..., N.N levels memory cell ability only when next stage title in one-level logical page (LPAGE) be present Operation, if in the absence of in one-level logical page (LPAGE) during one-level title, the operation of first order memory cell.
It should be noted that if the second extraction unit extracts multiple one-level logical page (LPAGE)s from actual page, wherein, part one , next stage title also be present in part primary logical page (LPAGE), then in the absence of next in one-level title in the absence of in level logical page (LPAGE) The one-level logical page (LPAGE) of level, structured storage uses first order memory cell, for the one-level logical page (LPAGE) of next stage title also be present, Structured storage uses N level memory cell, when handling a PDF document, two memory cell may all can use arrive, It can be used only and arrive one of memory cell.
Alternatively, the second extraction unit 3 also includes form cutting acquiring unit, for determining in each N level logical page (LPAGE) With the presence or absence of form, if form be present, the form is cut into form block, N+1 levels title is extracted and is under the jurisdiction of described The content of text of N+1 level titles.When in N level logical page (LPAGE)s, including form, and N+1 levels in content corresponding with N level titles Title in the table when, form cutting acquiring unit, direct cutting form can be used, then extract N+1 levels title and be subordinate to In the content of text of the N+1 level titles.Form cutting acquiring unit sometimes can be used alone, and be extracted instead of N levels single Member, it is sometimes necessary to be used cooperatively with N level extraction units.
Alternatively, first order title extraction unit can include:
Title line acquiring unit, for obtaining title line and title line Y-axis coordinate in actual page in actual page;
Title line combining unit, for when the Y-axis coordinate of current head line and next title line in same actual page Difference when being less than 3 Y-axis units, next title line is merged with current head line;
First order title acquiring unit, for obtaining the content of text conduct of a line nearest from title line on title line First order title in actual page.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present invention can add by software The mode of general hardware platform realize.Based on such understanding, the technical scheme in the embodiment of the present invention substantially or Say that the part to be contributed to prior art can be embodied in the form of software product, the computer software product can be deposited Storage is in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are causing a computer equipment (can be with Be personal computer, server, either network equipment etc.) perform some part institutes of each embodiment of the present invention or embodiment The method stated.
In this specification between each embodiment identical similar part mutually referring to.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims (10)

1. a kind of PDF document structured message extracting method, it is characterised in that methods described includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title.
2. PDF document structured message extracting method according to claim 1, it is characterised in that from the original page The step of extracting at least one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page in original page, header or footer are deleted, obtain at least one actual page.
3. PDF document structured message extracting method according to claim 1, it is characterised in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title, including:
Extract the first order title in each actual page;
Extract current content between first order title and next first order title in actual page, as with current first order mark Content corresponding to topic;If last first order title in the entitled actual page of the current first order, extract current in the actual page Content after first order title, as content corresponding with current first order title;
By each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE);
If one-level title in the absence of in the one-level logical page (LPAGE), each described title of the structured storage and it is under the jurisdiction of institute The step of stating the content of text of title, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is under the jurisdiction of The content of text of one-level title is content corresponding with the first order title.
4. PDF document structured message extracting method according to claim 3, it is characterised in that described by each first Level title, and the content corresponding to the first order title, before the step of one-level logical page (LPAGE), in addition to following step Suddenly:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order title Corresponding content;
If first first order title in currently practical page be not in the first row of currently practical page, by the currently practical page Content before first first order title is incorporated into content corresponding to a first order title.
5. PDF document structured message extracting method according to claim 3, it is characterised in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title, it is further comprising the steps of:
(N+1) level title is extracted from each N level logical page (LPAGE) respectively, and is under the jurisdiction of the content of text of (N+1) level title, N takes >=1 integer.
6. PDF document structured message extracting method according to claim 5, it is characterised in that described respectively from each The step of N+1 level titles being extracted in individual N levels logical page (LPAGE), and being under the jurisdiction of the content of text of N+1 level titles, including:
Extract the N+1 level titles in each N levels logical page (LPAGE);
Extract the content between current N+1 levels title and next N+1 level titles, as with current N+1 level titles pair The content answered;If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N level logical page (LPAGE)s In content after current N+1 level titles, as content corresponding with current N+1 level titles;
By each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logical page (LPAGE);
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
Structured storage the 1st to N+1 level titles, and be under the jurisdiction of respectively the described 1st to N+1 level titles content of text, its In, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of the text of i-stage title This content is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
7. PDF document structured message extracting method according to claim 5, it is characterised in that described respectively from each The step of extracting N+1 level titles in individual N levels logical page (LPAGE), and being under the jurisdiction of the content of text of N+1 level titles includes:
Determine to whether there is form in each N level logical page (LPAGE), if form be present, the form is cut into form block, carried Take N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.
8. the PDF document structured message extracting method according to claim any one of 3-7, it is characterised in that described to carry The step of taking the first order title in each actual page, including:
Obtain the title line in actual page and title line Y-axis coordinate in actual page;
, will if the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis units in same actual page Next title line merges with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
A kind of 9. PDF document structured message extraction element, it is characterised in that including:
Acquiring unit, for obtaining the original page of PDF document;
First extraction unit, for extracting at least one actual page comprising content of text or title from the original page;
Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of the content of text of the title;
Memory cell, each described title and it is under the jurisdiction of the content of text of the title for structured storage.
10. PDF document structured message extraction element according to claim 9, it is characterised in that first extraction is single Member, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Unit is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.
CN201710576556.9A 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device Active CN107358208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710576556.9A CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710576556.9A CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Publications (2)

Publication Number Publication Date
CN107358208A true CN107358208A (en) 2017-11-17
CN107358208B CN107358208B (en) 2018-07-13

Family

ID=60292655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710576556.9A Active CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Country Status (1)

Country Link
CN (1) CN107358208B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943956A (en) * 2017-11-24 2018-04-20 北京金堤科技有限公司 Conversion of page method, apparatus and conversion of page equipment
CN108614898A (en) * 2018-05-10 2018-10-02 爱因互动科技发展(北京)有限公司 Document method and device for analyzing
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN110363102A (en) * 2019-06-24 2019-10-22 北京融汇金信信息技术有限公司 A kind of identification of objects process method and device of pdf document
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN111881650A (en) * 2020-07-20 2020-11-03 北京百度网讯科技有限公司 PDF document generation method and device and electronic equipment
CN111985306A (en) * 2020-07-06 2020-11-24 北京欧应信息技术有限公司 OCR (optical character recognition) and information extraction method applied to documents in medical field
CN112712085A (en) * 2020-12-28 2021-04-27 哈尔滨工业大学 Method for extracting date in multi-language PDF document
CN113298914A (en) * 2021-07-28 2021-08-24 北京明略软件系统有限公司 Knowledge chunk extraction method and device, electronic equipment and storage medium
CN113673294A (en) * 2021-05-11 2021-11-19 苏州超云生命智能产业研究院有限公司 Method and device for extracting key information of document, computer equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
WO2005122009A2 (en) * 2004-06-07 2005-12-22 Archiveonline Ab Document database
US20050289161A1 (en) * 2004-06-29 2005-12-29 The Boeing Company Integrated document directory generator apparatus and methods
CN101114281A (en) * 2007-08-30 2008-01-30 上海交通大学 Open type document isomorphism engines system
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure
CN102541929A (en) * 2010-12-22 2012-07-04 北大方正集团有限公司 Method and device for extracting format file catalogue
CN102855244A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for file catalogue processing
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information
CN106446072A (en) * 2016-09-07 2017-02-22 百度在线网络技术(北京)有限公司 Webpage content processing method and apparatus
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005122009A2 (en) * 2004-06-07 2005-12-22 Archiveonline Ab Document database
US20050289161A1 (en) * 2004-06-29 2005-12-29 The Boeing Company Integrated document directory generator apparatus and methods
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101114281A (en) * 2007-08-30 2008-01-30 上海交通大学 Open type document isomorphism engines system
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102541929A (en) * 2010-12-22 2012-07-04 北大方正集团有限公司 Method and device for extracting format file catalogue
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure
CN102855244A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for file catalogue processing
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format
CN106446072A (en) * 2016-09-07 2017-02-22 百度在线网络技术(北京)有限公司 Webpage content processing method and apparatus
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943956A (en) * 2017-11-24 2018-04-20 北京金堤科技有限公司 Conversion of page method, apparatus and conversion of page equipment
CN108614898A (en) * 2018-05-10 2018-10-02 爱因互动科技发展(北京)有限公司 Document method and device for analyzing
CN108614898B (en) * 2018-05-10 2021-06-25 爱因互动科技发展(北京)有限公司 Document analysis method and device
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN109492199B (en) * 2018-10-17 2023-04-28 四川译讯信息科技有限公司 PDF file conversion method based on OCR pre-judgment
WO2020233332A1 (en) * 2019-05-20 2020-11-26 深圳壹账通智能科技有限公司 Text structured information extraction method, server and storage medium
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
CN110363102A (en) * 2019-06-24 2019-10-22 北京融汇金信信息技术有限公司 A kind of identification of objects process method and device of pdf document
CN110363102B (en) * 2019-06-24 2022-05-17 北京融汇金信信息技术有限公司 Object identification processing method and device for PDF (Portable document Format) file
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN110334346B (en) * 2019-06-26 2020-09-29 京东数字科技控股有限公司 Information extraction method and device of PDF (Portable document Format) file
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN111985306A (en) * 2020-07-06 2020-11-24 北京欧应信息技术有限公司 OCR (optical character recognition) and information extraction method applied to documents in medical field
CN111881650A (en) * 2020-07-20 2020-11-03 北京百度网讯科技有限公司 PDF document generation method and device and electronic equipment
CN112712085A (en) * 2020-12-28 2021-04-27 哈尔滨工业大学 Method for extracting date in multi-language PDF document
CN113673294A (en) * 2021-05-11 2021-11-19 苏州超云生命智能产业研究院有限公司 Method and device for extracting key information of document, computer equipment and storage medium
CN113298914A (en) * 2021-07-28 2021-08-24 北京明略软件系统有限公司 Knowledge chunk extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107358208B (en) 2018-07-13

Similar Documents

Publication Publication Date Title
CN107358208B (en) A kind of PDF document structured message extracting method and device
JP2022541199A (en) A system and method for inserting data into a structured database based on image representations of data tables.
CN100447779C (en) Document information processing apparatus, document information processing method, and document information processing program
CN114616572A (en) Cross-document intelligent writing and processing assistant
US10049100B2 (en) Financial event and relationship extraction
CN103329122A (en) Storage of a document using multiple representations
JP2005526314A (en) Document structure identifier
CN106446072B (en) The treating method and apparatus of web page contents
CN102436547A (en) Wrong sentence statistical method and system for teaching
JPH08241332A (en) Device and method for retrieving all-sentence registered word
CN111274239A (en) Test paper structuralization processing method, device and equipment
JPH11282955A (en) Character recognition device, its method and computer readable storage medium recording program for computer to execute the method
Ding et al. VQA: A New Dataset for Real-World VQA on PDF Documents
CN108804472A (en) A kind of webpage content extraction method, device and server
CN117034948B (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
JPWO2009048149A1 (en) Electronic document equivalence judgment system and equivalence judgment method
CN112418875A (en) Cross-platform tax intelligent customer service corpus migration method and device
CN115130437B (en) Intelligent document filling method and device and storage medium
CN108829898B (en) HTML content page release time extraction method and system
Deshpande et al. Summarization of graph using question answer approach
JP2000250908A (en) Support device for production of electronic book
CN114637505A (en) Page content extraction method and device
EP0328900A2 (en) Method and apparatus for editing documents
JP4934819B2 (en) Information extraction apparatus, method and program thereof
EP1072986A2 (en) System and method for extracting data from semi-structured text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190905

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee after: China Science and Technology (Beijing) Co., Ltd.

Address before: Room 601, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Co-patentee before: China Science and Technology (Beijing) Co., Ltd.

Patentee before: Beijing Shenzhou Taiyue Software Co., Ltd.

TR01 Transfer of patent right
CP03 Change of name, title or address

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Patentee after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CP03 Change of name, title or address