CN106502991B - Publication treating method and apparatus - Google Patents

Publication treating method and apparatus Download PDF

Info

Publication number
CN106502991B
CN106502991B CN201610972309.6A CN201610972309A CN106502991B CN 106502991 B CN106502991 B CN 106502991B CN 201610972309 A CN201610972309 A CN 201610972309A CN 106502991 B CN106502991 B CN 106502991B
Authority
CN
China
Prior art keywords
publication
keyword
information
classification
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610972309.6A
Other languages
Chinese (zh)
Other versions
CN106502991A (en
Inventor
石雄
宋永刚
董良广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
People Health Electronic Audio Visual Publishing Co Ltd
Original Assignee
People Health Electronic Audio Visual Publishing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by People Health Electronic Audio Visual Publishing Co Ltd filed Critical People Health Electronic Audio Visual Publishing Co Ltd
Priority to CN201610972309.6A priority Critical patent/CN106502991B/en
Publication of CN106502991A publication Critical patent/CN106502991A/en
Application granted granted Critical
Publication of CN106502991B publication Critical patent/CN106502991B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention discloses a kind of publication treating method and apparatus.This method comprises: obtaining digitized publication;Information is extracted from publication according to the layout information of publication, wherein information is divided into multiple classifications, and different classifications is used to indicate the content structure of publication;Publication is processed according to information.Through the invention, it solves the problems, such as that publication digitalization processing method limitation is high in the related technology, and then has achieved the effect that the flexibility for improving publication digitalization processing method.

Description

Publication treating method and apparatus
Technical field
The present invention relates to publication manufacture fields, in particular to a kind of publication treating method and apparatus.
Background technique
Currently, book and periodical Digital manufacturing is the base that traditional publication makes the transition to digital publishing with the development of IT technology Plinth work, the book and periodical Digital manufacturing of the prior art are all to generate the formats such as PDF by Books scanning or with type-setting document File, but be not directly able to satisfy digitlization by Books scanning or with the file type that type-setting document generates the formats such as PDF The needs propagated and read can not be in quickly look-up to chapters and sections for example, user wants to check the main contents of certain this book and periodical Hold facilities, can only check page by page, alternatively, user wants to find certain side with a certain subject direction in more books The relevant content of face content, cannot achieve in the prior art, due to the prior art no matter from content depth excavate or reading Propagation of the information age to knowledge content can not all be promoted in experience, and publishing house is to digital publishing especially book and periodical number The no mature experience of the processing of change can follow, and therefore, limitation is very high.
For the high problem of publication digitalization processing method limitation in the related technology, effective solution is not yet proposed at present Certainly scheme.
Summary of the invention
The main purpose of the present invention is to provide a kind of publication treating method and apparatus, to solve to publish in the related technology The high problem of object digitalization processing method limitation.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of publication substance treating method, this method It include: to obtain digitized publication;Information is extracted from the publication according to the layout information of the publication, In, the information is divided into multiple classifications, and different classifications is used to indicate the content structure of the publication;According to the information pair The publication is processed.
Further, the classification of the information includes at least: title, according to the layout information of the publication from it is described go out Extracting the information in version object includes: that all marks are extracted from the publication according to the space of a whole page pattern of the publication Topic;Carrying out processing to the publication according to the information includes: to process all titles of the publication, forming layer Gradeization catalogue.
Further, the classification of the information further include: text, according to the layout information of the publication from the publication It includes: to extract text from the publication according to the space of a whole page pattern of the publication that the information is extracted in object;According to The information carries out the corresponding relationship that processing includes: the title Yu the text of establishing the publication to the publication, or Person establishes the corresponding relationship of the hierarchical catalogue and the text.
Further, the corresponding relationship is stored in XML file, alternatively, saving in the database.
Further, the space of a whole page pattern of the publication includes at least one of: the space of a whole page pattern of additional character segmentation, The space of a whole page pattern of font style segmentation.
Further, the classification of the information includes at least: keyword, according to the layout information of the publication from described It includes: that at least one described keyword is extracted from the publication that the information is extracted in publication;According to the letter Breath to the publication carry out processing include: classification belonging to the publication is determined according at least one described keyword, and Save the category.
Further, it includes: to go out described in basis appears in that at least one described keyword is extracted from the publication The word frequency and/or word of word in version object appear in the location information in the publication and determine the word for belonging to keyword.
Further, determine that classification belonging to the publication includes: in the pass according at least one described keyword In the case that keyword is multiple, determined belonging to the corresponding part of the keyword according to the corresponding weight of each keyword Type.
To achieve the goals above, according to another aspect of the present invention, a kind of publication processing unit, the dress are additionally provided Setting includes: acquiring unit, for obtaining digitized publication;Extraction unit, for the layout information according to the publication Information is extracted from the publication, wherein the information is divided into multiple classifications, and different classifications is for indicating the publication The content structure of object;Unit is processed, for processing according to the information to the publication.
Further, the classification of the information includes at least: title, and the extraction unit is used for according to the publication Space of a whole page pattern extracts all titles from the publication;The processing unit is used for all titles of the publication It is processed, forms hierarchical catalogue.
The present invention extracts letter according to the layout information of publication by obtaining digitized publication from publication Breath, wherein information is divided into multiple classifications, and different classifications is used to indicate the content structure of publication, according to information to publication It is processed, solves the problems, such as that publication digitalization processing method limitation is high in the related technology, and then reached and improved out The effect of the flexibility of version object digitalization processing method.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart according to an embodiment of the present invention for publishing substance treating method;
Fig. 2 is the schematic diagram of publication treatment process according to an embodiment of the present invention;And
Fig. 3 is the schematic diagram of publication processing unit according to an embodiment of the present invention.
Specific embodiment
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
The embodiment of the invention provides a kind of publication substance treating methods.
Fig. 1 be it is according to an embodiment of the present invention publish substance treating method flow chart, as shown in Figure 1, this method include with Lower step:
Step S102: digitized publication is obtained.
Step S104: information is extracted from publication according to the layout information of publication, wherein information is divided into multiple classes Not, different classifications is used to indicate the content structure of publication.
Step S106: publication is processed according to information.
After the embodiment is using digitized publication is obtained, believed according to the space of a whole page of the digitized publication got Breath extracts the information of multiple classifications from publication, to obtain the content structure information of publication, then according to multiple classifications Information publication is processed, since the technical solution of the embodiment of the present invention digitizes paper book, from digitized The information for indicating multiple classifications of publication contents is extracted in the layout information of publication, and then publication is processed, It can be realized and processed paper publication object according to classification, to facilitate subsequent retrieval or classification to publication, solve correlation The high problem of publication digitalization processing method limitation in technology, and then reached and improved publication digitalization processing method The effect of flexibility.
In embodiments of the present invention, publication can be a plurality of types of publications such as books, periodical, these publications can To be the publication of papery, it is also possible to the publication of electronization, if it is the publication of papery, then to the publication of these paperys Object carry out digitized processing, can by optical character identification (Optical Character Recognition, referred to as OCR) method, for example, obtaining digitized publication by the methods of being scanned paper publication object.It obtains digitized Publication can be obtained by a variety of methods, after obtaining digitized publication, according to the layout information of publication from out Information is extracted in version object, the layout information of publication can be the text on the publication space of a whole page, icon, the multiple types such as picture Information, these information are divided into multiple classifications, and multiple classifications are used to indicate the content structure of publication, the content structure of publication Special format including title, text, picture, table, figure caption/caption, table title/table note and content of pages, subscript as above, Runic, italic, inlay etc. add publication according to the information extracted after extracting these information in publication Work, wherein carrying out processing to publication can be the information architecture publishing resource database that will be extracted in publication, can also be with Building and the recombining contents of knowledge data base are carried out to the content after processing, it is more multi-functional to realize.Optionally, to publication into Row processing, which can be, splits publication contents, is committed to indexing system, according to medicine classification --- and keyword corresponds to table Automatically classified;Finally layout information, classification information, index information integration are exported as in XML document or storing data library. In an optional application scenarios, paper publication object is inconvenient to carry and reads, and paper publication object is carried out digitlization can be with More users are facilitated to share the publication, if the paper publication object to be only scanned into the books of PDF format, user is being read Shi Wufa selection chapters and sections are read, and paper publication object is scanned into the books of PDF format, the shadow by scanning resolution It rings, books reading quality is bad, may be unclear after amplification, and be not easy to differentiate different content structures, it can not extract The contents such as title, chapters and sections information, therefore publication digitalization processing method limitation is high, and the technical solution of the embodiment of the present invention By obtaining digitized publication, information is then extracted from publication according to layout information, according to the information extracted Publication is processed, the more accurate publication of content can be obtained, improve the flexible of publication digitalization processing method Property.
In a kind of optional embodiment, the classification of information is included at least: title, according to the layout information of publication from It includes: that all titles are extracted from publication according to the space of a whole page pattern of publication that information is extracted in publication;According to letter It includes: to process all titles of publication that breath, which carries out processing to publication, forms hierarchical catalogue.
The information category of publication includes at least the title of publication, and title can be the title of each chapters and sections, according to out The layout information of version object can process the title extracted after extracting all titles in publication, be formed Hierarchical catalogue facilitates user to read or search.
In a kind of optional embodiment, the classification of information further include: text, according to the layout information of publication from out Extracting information in version object includes: that text is extracted from publication according to the space of a whole page pattern of publication;According to information to publication Object carries out processing and includes: the corresponding relationship of the title and text of establishing publication, or establishes pair of hierarchical catalogue and text It should be related to.Information category further includes text in addition to title, establishes the corresponding relationship of text and title, is looked into realizing according to title The purpose of centering text facilitates user to read or search.
In a kind of optional embodiment, corresponding relationship is stored in XML file, alternatively, saving in the database.? It establishes after the corresponding relationship of title and text, the corresponding relationship of title and text can be stored in XML file, it can be with The corresponding relationship of title and text is saved in the database.
In a kind of optional embodiment, the space of a whole page pattern of publication includes at least one of: additional character segmentation Space of a whole page pattern, font style segmentation space of a whole page pattern.It can be according to additional character or font style to the space of a whole page of publication Pattern is split, and is split for example, title can be corresponded to the body matter of publication, can also be corresponded to different fonts pair The space of a whole page pattern of publication is split.
In a kind of optional embodiment, the classification of information is included at least: keyword, according to the layout information of publication It include: that at least one keyword is extracted from publication from information is extracted in publication;Publication is carried out according to information Processing include: according at least one keyword determine publication belonging to classification, and save the category.
The classification of information can also include keyword, extract from publication from extracting information in publication and can be Keyword, the keyword extracted can be multiple, and publication institute can be confirmed according to the one or more keywords extracted The classification stated, for example, medicine publication, history class publication etc..For example, point of class entry can be corresponded to according to keyword Class coding classifies to publication, and can obtain Main classification coding and with reference to classification according to the difference of keyword weight Coding, to realize more accurate classification.It is alternatively possible to be ranked up according to word frequency, position, semantic content etc. to keyword.
In a kind of optional embodiment, it includes: that basis appears in that at least one keyword is extracted from publication There is location information in the publication and determine the word for belonging to keyword in the word frequency and/or word of word in publication.
It can be the determining word for belonging to keyword of word frequency of the word from publication from keyword is extracted in publication, For example, the higher keyword of the frequency of occurrences to be determined as to the keyword of publication, it is also possible to according to occurring in the publication Location information determines the keyword of publication, for example, the word that can be will appear in publication title is determined as publication Keyword.
In a kind of optional embodiment, according at least one keyword determine publication belonging to classification include: In the case that keyword is multiple, type belonging to publication is determined according to the corresponding weight of each keyword.
If the keyword of publication be it is multiple, can determine that keyword is corresponding according to the corresponding weight of each keyword Part belonging to type, for example, according to multiple keywords occur number or according to multiple keywords occur position it is true The weight of each keyword is made, type belonging to publication is then determined according to the corresponding weight of each keyword, for example, can Using the keyword by the maximum keyword of weight as publication.Corresponding classification belonging to publication may is that disease, drug, The medicine type such as surgical procedure.According to the corresponding weight of keyword determine the corresponding part of keyword belonging to type can determine Type corresponding to the text segment or chapters and sections of publication where keyword, for example, a certain chapters and sections belong to disease, drug, One of medicine type such as surgical procedure is a variety of, in addition to medicine type, is also possible to other types, for example, history, sound The types such as happy.By keyword to where keyword segment or chapters and sections classification more accurately can determine publication just Type belonging to literary segment improves the accuracy of classification.
Publication can be one, be also possible to it is multiple, if publication be it is multiple, extracted from multiple publications After information, the information of multiple publications is processed, the database of available multiple publication informations, for example, to more After this medicine publication carries out information extraction, the information extracted is processed and is saved in the database, Yong Huke To retrieve all the elements relevant to the keyword in database by keyword query, user can be facilitated from multiple publication The content of needs is quickly searched in object, saves the time.
Optionally, publication can be medical health class book document, row when which can publish paper book OCR identifies and is converted to digitized content after version file or paper book scanning, is able to solve Digitalizing Books deep processing Problem to be solved in the process, by establishing the process systems of a set of maturation, can not only complete paper book digitlization, Resource database is published in building, and building and the recombining contents of knowledge data base can be carried out to content after processing.
The publication substance treating method of the embodiment can be realized the processing to medicine publication, in an optional application In scene, this method can construct medicine classification system, including the classification of classification of diseases, symptom and sign, laboratory check classification, medicine Object classification, operation and activity classification and content classification.Medical speciality dictionary is created, Chinese and English medical speciality term is integrated, closes And synonym entry, classification is associated with dictionary foundation, forms " classification --- the antistop list " of medical speciality.Using the space of a whole page point Analysis tool carries out structuring mark to the space of a whole page contents of books, including title, text, picture, table, figure caption/caption, table title/ The special format of table note and content of pages, subscript as above, runic, italic, inlay etc..By the whole of books after structure mark Title extracts, and forms the hierarchical directory of structuring, according to content and header syntax, books is split into blocks of knowledge, are torn open The content contained text and title division of each section submit indexing system respectively after point, and indexing system is extracted keyword and plucked It wants, " classification --- antistop list " correspondence system that the keyword application of extraction loads, according to the theme of fractionation content to key Word carries out weight sequencing, and the factor being related to includes books title, books chapter title content at different levels and position and word frequency information Deng, identify that obtaining keyword correspond to the sorting code number of class entry by keyword, the classification of automatic progress document, and can be with According to the difference of keyword weight, Main classification coding is provided and with reference to sorting code number;It is automatic to carry out according to the title feature of submission The classification of content classification realizes the automation of knowledge classification.
It is split by the conversion of this Digitalizing Books, blocks of knowledge content, book document is realized in the index classification after fractionation To the process of Knowledge conversion, which may be implemented automatically completing for most of process, can save human cost raising Efficiency and processing quality.
Fig. 2 is the schematic diagram of publication treatment process according to an embodiment of the present invention, as shown in Fig. 2, the publication is handled Process the following steps are included:
Step S201: information is obtained by obtaining interface, digitized publication is obtained, paper book can be swept OCR identification is retouched, is converted to standard double pdf document, or the electronic document of books typesetting is converted into standard double PS/PDF Formatted file.
Step S202: converting processing platform by books and process to information, utilizes books machining tool parsing PDF text Part, by books text and the modular constructions mark such as picture identify, and books are split according to blocks of knowledge content.
Step S203: the content of text of the blocks of knowledge content after books are split submits to content index server.
Step S204: index server carries out keyword indexing to content and classification, index are carried out according to specialized dictionary Assignment index, the text message for being submitted to content index server index out keyword, the pass of index by content index server Keyword is being classified --- keyword is carried out in antistop list ---, and key word information is converted to classification by the conversion of sorting code number Encoded information, to realize automatic classification.Another kind index function is the heading message feature according to submission, carries out content classification Classified Indexing.
Step S205: the keyword of index is utilized into medical speciality dictionary and classification --- antistop list carry out keyword to Code conversion.
Step S206: the database of basic data is established.The database that can establish medicine basic data, in medicine The automation processing of appearance has great importance, and is experimentally confirmed system accuracy with higher, working efficiency is higher. Medicine basic database may include medical speciality dictionary, medical speciality classification, the classification of medical speciality --- antistop list Etc. data.
Step S207: the sorting code number information obtained after keyword is converted is sent to content index server.
Step S208: the text information after index is passed into books machining tool, wherein text information includes key Word, classification, autoabstract information etc..
Step S209: the index information that the blocks of knowledge content of fractionation, content index server return is integrated.It can It is engaged in the dialogue with the interactive interface combined by a personal-machine, completes to split, index, the manual examination and verification process of classification, pass through people The accuracy that fractionation, index, classification can be improved in work audit exports the XML file for formatting after the approval.
The publication substance treating method of the embodiment of the present invention can be used as a kind of book document digitalization processing method, Neng Goushi Existing content mechanized classification, can be converted by books, obtain exterior content data, and then books are processed, and pass through composition information Books distribution processing is carried out, further by man-computer cooperation classified Indexing, information flow is realized in text by index, conversion The subject distillation of appearance and automatic assorting process, after books are carried out fractionation index according to knowledget opic, formed have it is relatively independent, Knowledge content containing abundant description information, the XML file of export structure not only indicate books distribution information, also contain The classification information of location contents after book document fractionation.Further, blocks of knowledge content is indexed according to medical speciality dictionary Keyword represent the theme of the contents of the section, and keyword and classification have a stringent corresponding relationship, corresponding point of keyword Class can also react the theme of content, and then realize the automatic classification to medicine books.Further, professional medicine dictionary It is the preferably description of disclosure theme with categorizing system.
The embodiment using professional medicine dictionary and is divided by combining the links in books process Class --- antistop list realizes the Digital manufacturing to books, and books are formed blocks of knowledge content, are marked automatically by keyword Draw the index with medical speciality classification, and normative description is carried out to its theme, publishing house may be implemented and make the transition to digital publishing The processing of basic data in the process provides the basic metadata of content for digital applications, realizes knowledge classification automation.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.
The embodiment of the invention provides a kind of publication processing unit, which can be used for executing this hair The publication substance treating method of bright embodiment.
Fig. 3 is the schematic diagram of publication processing unit according to an embodiment of the present invention, as shown in figure 3, the device includes:
Acquiring unit 10, for obtaining digitized publication.
Extraction unit 20, for extracting information from publication according to the layout information of publication, wherein information is divided into Multiple classifications, different classifications are used to indicate the content structure of publication.
Unit 30 is processed, for processing according to information to publication.
Optionally, the classification of information includes at least: title, and extraction unit is used for according to the space of a whole page pattern of publication from publication All titles are extracted in object;Processing unit forms hierarchical catalogue for processing all titles of publication.It can Selection of land, processing unit may include index unit, and the segment that index unit is split according to publication carries out knowledge attribute contingency table Draw.
The embodiment obtains digitized publication using acquiring unit 10;Extraction unit 20 is believed according to the space of a whole page of publication Breath extracts information from publication, wherein information is divided into multiple classifications, and different classifications is used to indicate the content knot of publication Structure;Processing unit 30 processes publication according to information, solves publication digitalization processing method office in the related technology Sex-limited high problem, and then achieved the effect that the flexibility for improving publication digitalization processing method.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific Hardware and software combines.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (5)

1. a kind of publication substance treating method characterized by comprising
Obtain digitized publication, wherein be scanned by optical character recognition method to paper publication object, obtain number The publication of change;
Information is extracted from the publication according to the layout information of the publication, wherein the information is divided into multiple classes Not, different classifications is used to indicate the content structure of the publication;
The publication is processed according to the information;
The classification of the information includes at least: keyword,
It from the information is extracted in the publication include: to be mentioned from the publication according to the layout information of the publication Take out at least one described keyword;
Carrying out processing to the publication according to the information includes: to determine the publication according to keyword described at least one Affiliated classification, and save the category;
At least one described keyword is extracted from the publication includes:
The location information in the publication is appeared according to the word frequency and word that appear in the word in the publication to determine Belong to the word of keyword;
Determine that classification belonging to the publication includes: according at least one described keyword
In the case where the keyword is multiple, determine that the keyword is corresponding according to the corresponding weight of each keyword Part belonging to type;
The power of each keyword is determined according to the number of multiple keywords appearance or according to the position that multiple keywords occur Weight, then determines type belonging to publication according to the corresponding weight of each keyword;It is determined according to the corresponding weight of keyword Type belonging to the corresponding part of keyword, determines class corresponding to the text segment or chapters and sections of the publication where keyword Type;
Carrying out processing to publication is the information architecture publishing resource database that will be extracted in publication, or in after processing Hold the building for carrying out knowledge data base and recombining contents;
Wherein, carrying out processing to the publication includes the information architecture publishing resource database that will be extracted in publication, root Classify according to the sorting code number that keyword corresponds to class entry to publication, and according to the difference of keyword weight, obtains Main classification encodes and refers to sorting code number, is ranked up according to word frequency, position, semantic content to keyword;
The publication includes medicine publication, constructs medicine classification system, including classification of diseases, symptom and sign classify, are real It tests room and checks classification, classification of drug, operation and activity classification and content classification, create medical speciality dictionary, integration Chinese and English Medical speciality term merges synonym entry, and classification is associated with dictionary foundation, forms medicine classification --- and keyword is corresponding Table, according to medicine classification --- keyword corresponds to table and is classified automatically, by layout information, classification information, index information integration Output is XML document or storage into database;The classification of the information includes at least: title,
Extracting the information from the publication according to the layout information of the publication includes: according to the publication Space of a whole page pattern extracts all titles from the publication;
Carrying out processing to the publication according to the information includes: to process all titles of the publication, is formed Hierarchical catalogue;
The space of a whole page pattern of the publication includes at least one of: the space of a whole page pattern of additional character segmentation, font style segmentation Space of a whole page pattern.
2. the method according to claim 1, wherein the classification of the information further include: text,
Extracting the information from the publication according to the layout information of the publication includes: according to the publication Space of a whole page pattern extracts text from the publication;
According to the information to the publication carry out processing include: establish the publication title it is corresponding with the text Relationship, or establish the corresponding relationship of the hierarchical catalogue and the text.
3. according to the method described in claim 2, it is characterized in that, the corresponding relationship is stored in XML file, alternatively, protecting It deposits in the database.
4. a kind of publication processing unit characterized by comprising
Acquiring unit, for obtaining digitized publication, wherein paper publication object is carried out by optical character recognition method Scanning, obtains digitized publication;
Extraction unit, for extracting information from the publication according to the layout information of the publication, wherein the letter Breath is divided into multiple classifications, and different classifications is used to indicate the content structure of the publication;
Unit is processed, for processing according to the information to the publication;
The classification of the information includes at least: keyword,
It from the information is extracted in the publication include: to be mentioned from the publication according to the layout information of the publication Take out at least one described keyword;
Carrying out processing to the publication according to the information includes: to determine the publication according to keyword described at least one Affiliated classification, and save the category;
At least one described keyword is extracted from the publication includes:
The location information in the publication is appeared according to the word frequency and word that appear in the word in the publication to determine Belong to the word of keyword;
Determine that classification belonging to the publication includes: according at least one described keyword
In the case where the keyword is multiple, determine that the keyword is corresponding according to the corresponding weight of each keyword Part belonging to type;
The power of each keyword is determined according to the number of multiple keywords appearance or according to the position that multiple keywords occur Weight, then determines type belonging to publication according to the corresponding weight of each keyword;It is determined according to the corresponding weight of keyword Type belonging to the corresponding part of keyword, determines class corresponding to the text segment or chapters and sections of the publication where keyword Type;
Carrying out processing to publication is the information architecture publishing resource database that will be extracted in publication, or in after processing Hold the building for carrying out knowledge data base and recombining contents;
Wherein, carrying out processing to the publication includes the information architecture publishing resource database that will be extracted in publication, root Classify according to the sorting code number that keyword corresponds to class entry to publication, and according to the difference of keyword weight, obtains Main classification encodes and refers to sorting code number, is ranked up according to word frequency, position, semantic content to keyword, the publication packet Medicine publication is included, medicine classification system is constructed, including the classification of classification of diseases, symptom and sign, laboratory check classification, drug Classification, operation and activity classification and content classification, create medical speciality dictionary, integrate Chinese and English medical speciality term, merge Classification is associated with by synonym entry with dictionary foundation, and medicine classification --- keyword corresponds to table for formation, according to medicine classification --- Keyword corresponds to table and is classified automatically, and layout information, classification information, index information integration are exported as XML document or storage Into database;
The classification of the information includes at least: title,
Extracting the information from the publication according to the layout information of the publication includes: according to the publication Space of a whole page pattern extracts all titles from the publication;
Carrying out processing to the publication according to the information includes: to process all titles of the publication, is formed Hierarchical catalogue;
The space of a whole page pattern of the publication includes at least one of: the space of a whole page pattern of additional character segmentation, font style segmentation Space of a whole page pattern.
5. device according to claim 4, which is characterized in that the classification of the information includes at least: title,
The extraction unit is used to extract all titles from the publication according to the space of a whole page pattern of the publication;
The processing unit forms hierarchical catalogue for processing all titles of the publication.
CN201610972309.6A 2016-10-28 2016-10-28 Publication treating method and apparatus Active CN106502991B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610972309.6A CN106502991B (en) 2016-10-28 2016-10-28 Publication treating method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610972309.6A CN106502991B (en) 2016-10-28 2016-10-28 Publication treating method and apparatus

Publications (2)

Publication Number Publication Date
CN106502991A CN106502991A (en) 2017-03-15
CN106502991B true CN106502991B (en) 2019-07-19

Family

ID=58323170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610972309.6A Active CN106502991B (en) 2016-10-28 2016-10-28 Publication treating method and apparatus

Country Status (1)

Country Link
CN (1) CN106502991B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334725A (en) * 2017-04-24 2018-07-27 广东健凯医疗有限公司 Health data electronic disposal system and method
CN107609169A (en) * 2017-09-27 2018-01-19 合肥博力生产力促进中心有限公司 A kind of patent name back-stage management analysis system based on database
CN107748738A (en) * 2017-10-27 2018-03-02 上海京颐科技股份有限公司 The generation method and device of e-book catalogue, storage medium, computing device
CN108536683A (en) * 2018-04-18 2018-09-14 同方知网数字出版技术股份有限公司 A kind of paper fragmentation information abstracting method based on machine learning
CN111507091A (en) * 2019-01-11 2020-08-07 北大方正信息产业集团有限公司 Entry checking method, device, equipment and storage medium for publication
CN111061863B (en) * 2019-12-16 2023-09-15 新方正控股发展有限责任公司 Journal catalog display method, device and equipment
CN111046629B (en) * 2019-12-16 2022-03-01 北大方正集团有限公司 Outline display method, device and equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092931A (en) * 2012-12-31 2013-05-08 武汉传神信息技术有限公司 Multi-strategy combined document automatic classification method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure

Also Published As

Publication number Publication date
CN106502991A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN106502991B (en) Publication treating method and apparatus
CN111753099B (en) Method and system for enhancing relevance of archive entity based on knowledge graph
CN110399457B (en) Intelligent question answering method and system
US8046368B2 (en) Document retrieval system and document retrieval method
US8161059B2 (en) Method and apparatus for collecting entity aliases
US6907431B2 (en) Method for determining a logical structure of a document
CN106570171A (en) Semantics-based sci-tech information processing method and system
CN110097278B (en) Intelligent sharing and fusion training system and application system for scientific and technological resources
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN111814425A (en) Book automatic typesetting implementation method based on book character information
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
CN111814485A (en) Semantic analysis method and device based on massive standard document data
WO2017193472A1 (en) Method of establishing digital dongba ancient text interpretive library
CN111104437A (en) Test data unified retrieval method and system based on object model
Li et al. Visual segmentation-based data record extraction from web documents
CN111966940B (en) Target data positioning method and device based on user request sequence
EP2544100A2 (en) Method and system for making document modules
CN109766442A (en) A kind of couple of user takes down notes the method and system classified
US20080015843A1 (en) Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data
Klampfl et al. Reconstructing the logical structure of a scientific publication using machine learning
CN114238735B (en) Intelligent internet data acquisition method
CN110765107A (en) Question type identification method and system based on digital coding
Nghiem et al. Using MathML parallel markup corpora for semantic enrichment of mathematical expressions
CN114970543A (en) Semantic analysis method for crowdsourced design resources
KR101104753B1 (en) Extraction method for hierarchical structure in text contents of structural calculation document

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant