CN100552673C - Open type document isomorphism engines system - Google Patents

Open type document isomorphism engines system Download PDF

Info

Publication number
CN100552673C
CN100552673C CNB2007100454517A CN200710045451A CN100552673C CN 100552673 C CN100552673 C CN 100552673C CN B2007100454517 A CNB2007100454517 A CN B2007100454517A CN 200710045451 A CN200710045451 A CN 200710045451A CN 100552673 C CN100552673 C CN 100552673C
Authority
CN
China
Prior art keywords
document
module
information
submodule
notion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007100454517A
Other languages
Chinese (zh)
Other versions
CN101114281A (en
Inventor
刘功申
杨金升
王士林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNB2007100454517A priority Critical patent/CN100552673C/en
Publication of CN101114281A publication Critical patent/CN101114281A/en
Application granted granted Critical
Publication of CN100552673C publication Critical patent/CN100552673C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

A kind of open type document isomorphism engines system of field of information security technology, wherein: the physical arrangement module is accepted the input of various documents, and the physical arrangement of document is exported to the logical organization module; The logical organization module is handled the logical organization that obtains document to the information of physical arrangement module input, and is input to morphology and syntactic analysis module with this its; The information of morphology and syntactic analysis module receive logic construction module input, and this information handled document after being handled by analysis, and with the document input concept extraction module that obtains; The concept extraction module is handled the information of morphology and syntactic analysis module input and is obtained the notion and the concept attribute that are transformed out by the speech in the document and this notion that will obtain and concept attribute input theme representation module; The information that the theme representation module is imported the concept extraction module is handled and obtained with the notion is the document subject matter of unit.The invention solves the problem that to unify to handle at the multi-format document.

Description

Open type document isomorphism engines system
Technical field
What the present invention relates to is a kind of system of field of information security technology, specifically is a kind of open type document isomorphism engines system (ODIE-Open Document Isomorphic Engine).
Background technology
In the content safety field, all must carry out semantic understanding and flame filters to text based on the content safety product of text message.This series products all is faced with a unified problem, promptly extracts to be used to the plain text information understanding and filter from document miscellaneous.Because the complexity and the diversity of document format so most systems has all been avoided this difficult point problem, thereby cause these rate of accurateness low in the reality.
The process that obtains plain text information at present has two difficult point problems: (1), and how to handle diversified original document form, and therefrom obtain pure words information.According to the difference of structuring degree, the various electronic documents in the reality can be divided into structured document (as, XM), semi-structured document (as, HTML, DOC, WPS, PDF etc.) and free document (as, TXT) three classes.Free document only comprises content of text, and it is extremely simple to obtain plain text information.And structured document and semi-structured document have comprised content of text and a large amount of mark (Tag) information, and the process that therefore obtains plain text information is just quite complicated.If consider the difference in version of various document formats, the problem that obtains plain text information is just complicated more.Therefore, can enough unified method handling diversified original document form is a key issue.(2), how Word message is unified to describe, and make it be applicable to that content safety is in interior various application systems.Except that content safety system, all need pre-service to the multi-format document based on the information filtering of content of text, text automatic classification, information retrieval etc.Designing a unified description that can be applicable to various systems will be a key issue.
The target of open isomorphism engine is to obtain the semanteme of content of text and representative thereof from diversified document format, and offers other higher-level system use.The isomorphismization of multi-format document can make other application systems break away from this difficult point of document analysis, and only is absorbed in the proprietary technology of system itself.The document isomorphismization is based on the basic work of correlative studys such as the information security of content, automatic classification, automatic indexing, automatic retrieval.
Find by prior art documents, paper: Document Logic Structure ByMachine Learning, IEEE Conference on Machine Learning and Cybernetics, 2002,12 (based on the document logical structure analysis of machine learning, IEEE machine learning and kybernetics meeting, in Dec, 2002) open type document hierarchical model (ODLM-Open Document Layer Module) has been proposed, this model is according to the actual needs of natural language processing correlation technique, and quoted passage is divided into the physical arrangement layer to the open type document hierarchical model, the logical organization layer, morphology and syntactic analysis layer, the concept extraction layer, 5 levels such as theme presentation layer.By 5 levels, the ODLM refinement process analyzed of entire electronic document, particular content at all levels has been described, for the electronic document analysis provides a clear level framework.But do not have the complete system that can specifically use.
Also find in the retrieval, Document Logical Structure Analysis Based onPerceptive Cycles (based on the document logical structure analysis in perception loop), quoted passage source: LectureNotes in Computer Science 3872, PP.117-128.Springer-Verlag BerlinHeidelberg 2006 (computer science report, 3872 volumes, the 117-128 page or leaf, 2006, Heidelberg Springer publishing house published).The document identifies the logical organization of image document (or optical scanning document) with neural network method, but only concentrates on logical organization analytically.Its defective and not enough as follows: 1) main target is only to be to analyze document logical structure; 2) directly from image file analytical documentation logical organization, no abstract interface before the recognition logic structure---document physical arrangement identification; 3) owing to no this intermediary interface of document physical arrangement, only can handle single document format, rather than can handle diversified form; 4) fail to provide the service that relates to levels such as speech, sentence, notion, theme.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, a kind of open type document isomorphism engines system is provided, make its plain text content that can be used in extraction multi-format document and the semanteme of representative thereof, solve the problem that to unify to handle at the multi-format document, can be applicable to semantic and internet content safety analysis classes project.
The present invention is achieved by the following technical solutions, the present invention includes 5 big functional modules, sequencing by information processing is followed successively by: physical arrangement module, logical organization module, morphology and syntactic analysis module, concept extraction module, theme representation module, wherein:
Described physical arrangement module definition the physics arrangement and the layout of document various piece, it accepts the input of various documents, and the physical arrangement of document is exported to the logical organization module, the physical arrangement module also provides specification data for total system;
Described logical organization module has been stipulated each logical elements and the classification thereof of document, it receives the information of physical arrangement module input, and this information handled the logical organization that obtains document, the logical organization module is input to morphology and syntactic analysis module with the logical organization of the document;
Described morphology and syntactic analysis module provide speech dividing mark, part-of-speech tagging and the sentence structure mark of each sentence in the text, the information of its receive logic construction module input, and this information handled document after being handled by analysis, morphology and syntactic analysis module are with the document input concept extraction module that obtains;
Described concept extraction module summarizes the notion that document comprises automatically, it receives the information of morphology and the input of syntactic analysis module, and this information handled obtain the notion and the concept attribute that transform out by the speech in the document, this notion that the concept extraction module will obtain and concept attribute input theme representation module;
Described theme representation module calculates the weight of each notion according to user's selection, provide vector space model (VSM) expression of the document then, its receives the information of concept extraction module input, and this information is handled to obtain with the notion be the document subject matter of unit.
Described physical arrangement module, its input comprise electronic document (for example, TXT, XML, HTML, character scanning document, DOC, WPS, PD or the like) information with form of all kinds.The physical arrangement of the document of physical arrangement module output is to be made up of the format information of unformatted character (for example, English alphabet, Chinese character etc.), character correspondence, profile information.Physical arrangement can identify the new line symbol, that is to say clearly to distinguish paragragh.In addition, physical arrangement should be indicated the languages (for example, English, Chinese or the like) of original document, and simultaneously, if languages are Chinese, the coded format of original document (for example, GB, BIG5 or the like) also should mark in physical arrangement.Electronic document has form of all kinds, is not easy to information processing.Generally speaking, electronic document has comprised " the isomery information " of " multi-format ", by the physical arrangement module these " isomery information " carry out isomorphismization, just represents these isomery information with unified standard.
Described physical arrangement module is to be gone out the format information of plain text, text correspondence by marker extraction, and neglects junk information.The format information of described text correspondence can be divided into two kinds: character formatting information and paragraph format information.Character formatting information is used for describing single character.Paragraph format information is used for the section of description.
Described physical arrangement module comprises that paragraph standardization submodule, format information standardization submodule, the submodule that abates the noise, article feature identification submodule, Sub-title Recognition submodule, subhead error correction submodule and logical structure tree generate submodule, wherein:
The input of described paragraph standardization submodule contains the document lack of standardization of misapplying hard return, removes hard return use lack of standardization in the file structure, and the document that will revise after the hard return misuse is exported to format information standardization submodule;
Described format information standardization submodule is accepted the input of paragraph standardization submodule, and the format information that the physical arrangement layer is obtained carries out the coarsegrain unification at the logical organization layer, and the document behind the standardized format is exported to the submodule that abates the noise;
The described submodule that abates the noise is accepted the input of format information standardization submodule, remove the non-text message part in the article, and the document that will remove behind these noises is exported to article feature identification submodule;
Described article feature identification submodule is accepted the input of article feature identification submodule, judges the logic classification of each paragragh, and will indicate other document of paragragh logic class and export to the Sub-title Recognition submodule;
Described Sub-title Recognition submodule is accepted the input of Sub-title Recognition submodule, utilize automat identification that the label subhead is arranged, utilize feature identification unlabelled subhead, and will clearly indicate the document that label subhead and unlabelled subhead are arranged and export to subhead error correction submodule;
Described subhead error correction submodule is accepted the input of Sub-title Recognition submodule, corrects original text author's clerical mistake, and the document after the error correction is exported to logical structure tree generation submodule;
Described logical structure tree generates the input that submodule is accepted subhead error correction submodule, document logical structure is described as the logical structure tree form, and the logical structure tree of document is exported to the logical organization module.
Described logical organization module, its main task are to identify the logic classification of document various piece.Logical organization has been indicated the logic classification (for example, exercise question, author's summary, author information, key word, text, titles at different levels, list of references etc.) of original document various piece, and describes entire document with logical structure tree.Concrete is the logic classification of discerning the original document various piece with the method for machine learning, identify subheads at different levels (label subhead and unlabelled subhead are arranged), and subhead is carried out rank determine and correction process that formation can be expressed the logical structure tree of original text hierarchical relationship.
Described morphology and syntactic analysis module, according to the keyword dictionary that has attribute description, adopt lexical analysis and syntactic analysis to combine the sentence in the text is analyzed, marked, described lexical analysis has provided a plurality of candidates' word segmentation and part-of-speech tagging sequence.Described analysis of sentence method is an operation part of speech modified relationship on basis of lexical analysis, and sentence pattern marks out the composition (subject and predicate, guest) of sentence.The present invention gives the part of speech modified relationship of syntactic analysis, and sentence pattern is represented with probability, calculates analysis of sentence result's correct probability.According to the correct probability of the analysis of sentence, can from candidate's speech analysis result, select a sequence to come out conversely.
Described concept extraction module, its output are the notion that transformed out by the speech in the document and several attributes of notion, i.e. the frequency, notion position in the text, the distributivity of notion that occur in the text of notion.Owing to be subjected to society factors such as region, time, the speech on the broad sense is very extensive, is necessary with notion they to be summarized arrangement, and the concept extraction module realizes this function.The concept extraction module is to know that net (How-Net), WordNet (the vocabulary networks of Princeton university research and development), " synonym speech woods " are the base configuration conceptual base, based on conceptual base, obtain the notion that document comprises in conjunction with transfer algorithm, and provide the association attributes of notion.
Described concept extraction module, its concept extraction key problem are conceptual base structures and to the visit of conceptual base.Described conceptual base organizational form is: notion clauses and subclauses and zero, one, two, three expansion word string have higher synonym degree; The synonym degree of notion clauses and subclauses and four, Pyatyi expansion word string is lower; The synonym degree of notion clauses and subclauses and six grades of expansion word strings is minimum.In order to visit conceptual base apace, adopted salted hash Salted that zero level and one-level expansion word string are arranged by the dictionary preface, and each word string all can be mapped to corresponding notion clauses and subclauses.
Described theme representation module, according to selection, adopt notion frequency, notion position, boolean's weight, TFIDF (Term Frequency Inverse Document Frequency, word frequency-anti-document frequency) the type weight, calculate the weight of notion based on the weight of information entropy methods such as (the part method require the document sets support), then document is represented in the mode of vector space that dimension reduction method adopts the mode of threshold values control to realize.
The present invention is based on a basic theory---the open type document hierarchical model is realized.According to the actual needs of natural language processing correlation technique, open type document hierarchical model (ODLM-Open Document Layer Module) is divided into 5 levels such as physical arrangement layer, logical organization layer, morphology and syntactic analysis layer, concept extraction layer, theme presentation layer.With ODIE is bottom-up original document layer, ODIE and the application layer three parts of being divided into of System Application Architecture of core.The core of ODIE is according to the guidance of ODLM model the multi-format document to be analyzed and handled, thereby is divided into five levels that meet the ODLM model.Application layer can obtain the service (corresponding to five levels of ODLM model) of different quality from the ODIE engine, to adapt to the different needs of application layer.
Compared with prior art, the present invention can be used in the plain text content of extraction multi-format document and the semanteme of representative thereof.The present invention fully extracts and has utilized format information and feature string information such as font, font size, profile, just perfect information in physical arrangement and logical organization analytic process.The present invention adopts notion to represent the theme of article, and notion is than speech standard more, and its weight calculation will be more accurate also.Expandability is embodied in the user can be integrated new that document format arrives this engine, to support the special file format analysis processing; The service diversity, application program can obtain the service of different levels as required from this engine.System of the present invention can be applicable to semantic and internet content safety analysis classes project (for example, spam crime prevention system, Chinese autoabstract system, internet public feelings analysis and monitoring system etc.), and has reached actual application level.
Description of drawings
Fig. 1 system architecture diagram of the present invention
Fig. 2 Application Example block architecture diagram of the present invention
Fig. 3 Application Example document logical structure of the present invention analytic process synoptic diagram
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are elaborated.Present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
As shown in Figure 1, under the ODLM guide of theory, the present invention has realized an engine that is applicable to actual environment---open type document isomorphism engine (ODIE) system.Actual needs according to the natural language processing correlation technique, in theory the processing procedure of electronic document is divided into 5 levels, they are respectively: 5 levels such as physical arrangement layer, logical organization layer, speech, syntactic analysis layer, concept extraction layer, theme presentation layer.When technology realized, 5 levels corresponded respectively to physical arrangement module, logical organization module, morphology and syntactic analysis module, concept extraction module, theme representation module.
Described physical arrangement module definition the physics arrangement and the layout of document various piece, it accepts the input of various documents, and the physical arrangement of document is exported to the logical organization module, the physical arrangement module also provides specification data by computing for total system;
Described logical organization module has been stipulated each logical elements and the classification thereof of document, it receives the information of physical arrangement module input, and this information handled the logical organization that obtains document, the logical organization module is input to morphology and syntactic analysis module with the logical organization of the document;
Described morphology and syntactic analysis module provide speech dividing mark, part-of-speech tagging and the sentence structure mark of each sentence in the text, the information of its receive logic construction module input, and this information handled document after being handled by analysis, morphology and syntactic analysis module are with the document input concept extraction module that obtains;
Described concept extraction module summarizes the notion that document comprises automatically, it receives the information of morphology and the input of syntactic analysis module, and this information handled obtain the notion and the concept attribute that transform out by the speech in the document, this notion that the concept extraction module will obtain and concept attribute input theme representation module;
Described theme representation module calculates the weight of each notion according to user's selection, provide vector space model (VSM) expression of the document then, its receives the information of concept extraction module input, and this information is handled to obtain with the notion be the document subject matter of unit.
As shown in Figure 2, with ODIE be bottom-up original document layer, ODIE and the application layer three parts of being divided into of System Application Architecture of core.The core of ODIE is according to the guidance of ODLM model the multi-format document to be analyzed and handled, thereby is divided into five levels that meet the ODLM model.Application layer can obtain the service (corresponding to five levels of ODLM model) of different quality from the ODIE automotive engine system, to adapt to the different needs of application layer.
The first, physical structure analysis: add native system in order to adapt to the unknown document form, this partial design an extendible interface.Present embodiment is example with HTML, and its analytic process is as follows:
HTML is used to work out the hypertext document that can implement link on different platforms.The mark of HTML can be expressed news, mail, document and the hypermedia of hypertext.The physical arrangement module of present embodiment is to extract plain text in these marks, the format information of text correspondence, and neglect junk information.The format information of text correspondence can be divided into two kinds: character formatting information and paragraph format information.Character formatting information is used for describing single character.Paragraph format information is used for the section of description.
Described mark, for the document of html format,<P〉one section plain text of expression; Font-size represents the size of literal with the CSS form;<Li〉represent that literal is the title literal.
Described mark, for the document of PDF, obj<XXX〉stream, expression text and font format stream.
Described mark, for WORD, the document of WPS form, inner OLE (the Object Linkingand Embedding) coding mode that adopts needs the subsidiary interface of application software to read literal and corresponding font format.
Described mark for the document of txt form, because it is free document, therefore, can directly read text.The document of Txt form does not have font format information.
Be listed below respectively:
Character formatting information (C represents character):
font_absolute_size(C)={0,1,2,...,N}。The font absolute size of representing this character.
font_relative_size(C)={BIG,EQUAL,SMALL}。Represent the size of the font of this character with respect to text.The absolute size of all only having paid attention to font of document [95,96], still, the relative size of font more has effect than absolute number sometimes.For example, the principal feature of title font size is general big than text, and does not depend on its absolute font size.
font_style(C)={0,1,2,...,N}。The font style (font style has been mapped to the nature manifold) of representing this character.
font_color(C)={0,1,2,...,N}。The font color (font color has been mapped to the nature manifold) of representing this character.
Paragraph format information (the P section of expression):
alignment(P)={LEFT,CENTER,RIGHT}。They represent this section left-justify, Right Aligns, placed in the middle respectively.
width(P)={BROAD,EQUAL,NARROW}。Three values represent this section with respect to the wider width of text, equate, narrower.
type_of(P)={CHARACTER,TABLE,FIGURE,OTHER}。Represent that this paragragh is a literal, form, figure or other.
indent(P)={0,1,2,...,N}。The indentation number of characters of representing this paragragh.
The second, the logical organization analysis: one piece of structurized document can be divided into a plurality of parts, is exactly the simplest division methods such as " title+text+additional information ".Studies show that much the speech that appears at different piece and position is different to the contribution of theme.Therefore before extracting theme, the information that obtains its place part and position in advance is considerable.The effect of logical organization module is exactly the one-piece construction of analytical documentation, with the title (comprising main title, subtitle, subheads at different levels etc.) of article, and sentence in the position of article (first section, rear, section are first, section tail etc.) all analyze out.The text structure information of Huo Deing has very important effect for follow-up feature extraction like this.
As shown in Figure 3, document logical structure analytic process.The physical arrangement module realizes that the logical organization analysis comprises paragraph standardization, format information standardization, abates the noise, the steps such as generation of article feature identification, Sub-title Recognition, subhead error correction and logical structure tree.The paragraph standardization is to remove use lack of standardization even misuse hard return.The module that abates the noise is the part that originally should not belong to article content in order to remove, for example the peer link in the Internet news, advertisement etc.The article feature identification is judged the logic classification of each paragragh.The Sub-title Recognition module utilizes automat identification that the label subhead is arranged, and utilizes some specific characteristic identification unlabelled subheads.The function of subhead correction module is exactly to correct original text author's clerical mistake.At last, document logical structure is described as the logical structure tree form.Learning functionality has increased the adaptive faculty of logical organization layer.Off-line learning forms knowledge base by manual mark document is handled.Knowledge base is the rule source of logical organization layer computing.On-line study utilizes visualization interface that system is carried out teaching, thereby makes system have adaptive faculty.Above-mentioned logical organization is analyzed content and is adopted following submodule to realize respectively:
The function of paragraph standardization submodule is to remove use lack of standardization even misuse hard return in the file structure.Its input is to contain the document lack of standardization of misapplying hard return, and its output is the document of having revised after the hard return misuse.
The function of format information standardization submodule is carried out the coarsegrain standardization to the format information that the physical arrangement layer obtains at the logical organization layer.Through after the standardization, originally only act on the format information of character, expand to and act on a complete sentence or paragragh.For example, in a sentence character greater than 80% (adjustable valve value) being arranged is boldface type, so, just thinks that the format information of whole sentence is a boldface type at the logical organization layer.
The function of submodule of abating the noise is to remove the part that originally should not belong to article content, for example peer link in the Internet news, advertisement etc.Its input is the document that contains non-text messages such as advertisement link, related news link, and its output is the document that has removed behind these noises.
The function of article feature identification submodule is judged the logic classification of each paragragh.Its input is clearly not indicate other document of logic class, and its output is to have indicated other document of paragragh logic class, at this moment, just can know that part is the text of title, the document of document.
The functional utilization automat identification of Sub-title Recognition submodule has the label subhead, utilizes some feature identification unlabelled subheads.Its input is the document that does not clearly indicate subhead, and its output is clearly to have indicated the document that label subhead and unlabelled subhead are arranged.
The function of subhead error correction submodule is to correct original text author's clerical mistake.Its input is the output that subhead indicates module, and subhead at this moment indicates because original text author's clerical mistake also has mistake, and for example, the author is written as " 1.3.1 " to " 1.2.1 " mistake, and this module can be come this clerical mistake reparation.Its output is exactly the document of having done after the error correction work.
The function that logical structure tree generates submodule is that document logical structure is described as the logical structure tree form.Its input is the output of subhead error correction submodule, and its output is the logical structure tree of document.
The 3rd, speech, syntactic analysis: automatic word segmentation is a very basic problem of natural language processing circle, comprises that mechanical type divides morphology and understanding formula to cut two kinds of morphology, and both do not have strict precedence.Present embodiment morphology and syntactic analysis module adopt the method for morphology sentence structure analysis-by-synthesis, and analytic process has adopted based on grammer treebank probability model commonly used.
The 4th, concept extraction: the key problem of concept extraction is a conceptual base structure and to the access algorithm of conceptual base.The conceptual base organizational form of present embodiment concept extraction module is: notion clauses and subclauses and zero, one, two, three expansion word string have higher synonym degree; The synonym degree of notion clauses and subclauses and four, Pyatyi expansion word string is lower; The synonym degree of notion clauses and subclauses and six grades of expansion word strings is minimum.In order to visit conceptual base apace, adopted salted hash Salted that zero level and one-level expansion word string are arranged by the dictionary preface, and each word string all can be mapped to corresponding notion clauses and subclauses.
Referring to table 1 notion clauses and subclauses level expansion table, represented relevant word string relation in representative speech " Hong Kong " and the article.For example, " Hong Kong Special Administrative Region " is the identical word string of implication with " Hong Kong ".There is hyponymy in word string " New Territory " and " Hong Kong ", but can not replace fully.When running into relevant word string in article can standard be the coefficient vector of this speech of Hong Kong.For example, if occurred " Hong Kong " x time in the document, " New Territory " y time, then the coefficient that can be represented by notion Hong Kong of this article is: 1 * x+0.5 * y.Table 1 is as follows:
Figure C20071004545100141
The 5th, theme is represented: feature selecting and two contents of method of weighting represented to comprise in theme.The theme representation module uses the algorithm of concept extraction module, and one piece of document can extract a series of notion, and these notions all have certain role of delegate to document.Feature selecting is to choose one group of notion can representing one piece of document, and forms a vector.The weighting algorithm of present embodiment has adopted the representative coefficient of notion to document.For example, one piece of document may comprise notion " Hong Kong ", and representing coefficient is 1.5; Notion " politics ", representing coefficient is 10; Notion " election ", representing coefficient is 0.5; Then the subject heading list of the document is shown: (politics, 10; Hong Kong, 1.5; Election, 0.5; )
The present embodiment expandability is embodied in the user can be integrated new that document format arrives this automotive engine system, to support the special file format analysis processing; The service diversity, application program can obtain the service of different levels as required from this engine.The present embodiment system can be applicable to semantic and internet content safety analysis classes project (for example, spam crime prevention system, Chinese autoabstract system, internet public feelings analysis and monitoring system etc.), and has reached actual application level.

Claims (5)

1, a kind of open type document isomorphism engines system is characterized in that, comprising: physical arrangement module, logical organization module, morphology and syntactic analysis module, concept extraction module, theme representation module, wherein:
Described physical arrangement module definition the physics arrangement and the layout of document various piece, it accepts the input of various documents, and the physical arrangement of document exported to the logical organization module, the physical arrangement module also provides specification data for total system, this physical arrangement module comprises that paragraph standardization submodule, format information standardization submodule, the submodule that abates the noise, article feature identification submodule, Sub-title Recognition submodule, subhead error correction submodule and logical structure tree generate submodule, wherein:
The input of described paragraph standardization submodule contains the document lack of standardization of misapplying hard return, removes hard return use lack of standardization in the file structure, and the document that will revise after the hard return misuse is exported to format information standardization submodule;
Described format information standardization submodule is accepted the input of paragraph standardization submodule, and the format information that the physical arrangement layer is obtained carries out the coarsegrain unification in the logical organization module, and the document behind the standardized format is exported to the submodule that abates the noise;
The described submodule that abates the noise is accepted the input of format information standardization submodule, remove the non-text message part in the article, and the document that will remove behind these noises is exported to article feature identification submodule;
The logic classification of each paragragh is judged in accept the to abate the noise input of submodule of described article feature identification submodule, and will indicate other document of paragragh logic class and export to the Sub-title Recognition submodule;
Described Sub-title Recognition submodule is accepted the input of article feature identification submodule, utilize automat identification that the label subhead is arranged, utilize feature identification unlabelled subhead, and will clearly indicate the document that label subhead and unlabelled subhead are arranged and export to subhead error correction submodule;
Described subhead error correction submodule is accepted the input of Sub-title Recognition submodule, corrects original text author's clerical mistake, and the document after the error correction is exported to logical structure tree generation submodule;
Described logical structure tree generates the input that submodule is accepted subhead error correction submodule, document logical structure is described as the logical structure tree form, and the logical structure tree of document is exported to the logical organization module;
Described logical organization module has been stipulated each logical elements and the classification thereof of document, it receives the information of physical arrangement module input, and this information handled the logical organization that obtains document, the logical organization module is input to morphology and syntactic analysis module with the logical organization of the document, promptly discern the logic classification of original document various piece with the method for machine learning, identify subheads at different levels, and subhead is carried out rank determine and correction process that formation can be expressed the logical structure tree of original text hierarchical relationship;
Described morphology and syntactic analysis module provide the speech dividing mark of each sentence in the text, part-of-speech tagging and sentence structure mark, the information of its receive logic construction module input, and this information handled document after being handled by analysis, morphology and syntactic analysis module are with the document input concept extraction module that obtains, promptly according to the keyword dictionary that has attribute description, adopting lexical analysis and syntactic analysis to combine analyzes the sentence in the text, mark, described lexical analysis has provided a plurality of candidates' word segmentation and part-of-speech tagging sequence, described analysis of sentence method is an operation part of speech modified relationship on basis of lexical analysis, sentence pattern marks out the composition of sentence and promptly leads, meaning, the guest, sentence pattern is represented with probability, calculate analysis of sentence result's correct probability, according to the correct probability of the analysis of sentence, can from candidate's speech analysis result, select a sequence to come out conversely;
Described concept extraction module summarizes the notion that document comprises automatically, it receives the information of morphology and the input of syntactic analysis module, and this information handled obtain the notion and the concept attribute that transform out by the speech in the document, this notion that the concept extraction module will obtain and concept attribute input theme representation module, the concept extraction key problem of this concept extraction module is a conceptual base structure and to the visit of conceptual base, described conceptual base organizational form is: notion clauses and subclauses and zero, one, two, three grades of expansion word strings have higher synonym degree, notion clauses and subclauses and four, the synonym degree of Pyatyi expansion word string is lower, the synonym degree of notion clauses and subclauses and six grades of expansion word strings is minimum, adopted salted hash Salted that zero level and one-level expansion word string are arranged by the dictionary preface, and each word string all can be mapped to corresponding notion clauses and subclauses;
Described theme representation module calculates the weight of each notion according to user's selection, and the vector space model that provides the document then represents that its receives the information of concept extraction module input, and this information is handled to obtain with the notion be the document subject matter of unit.
2, open type document isomorphism engines system according to claim 1, it is characterized in that, described physical arrangement module, its input comprises the electronic document information with form of all kinds, electronic document has comprised the isomery information of multi-format, the physical arrangement module carry out isomorphismization with these isomery information, promptly represent these isomery information with unified standard, the physical arrangement of the document of physical arrangement module output is by unformatted character, the format information of character correspondence, profile information is formed, physical arrangement can identify the new line symbol, in addition, physical arrangement is also indicated the languages of original document.
3, open type document isomorphism engines system according to claim 1 and 2, it is characterized in that, described physical arrangement module is the format information that is gone out plain text, text correspondence by marker extraction, and neglect junk information, the format information of described text correspondence is divided into two kinds: character formatting information and paragraph format information, character formatting information is used for describing single character, and paragraph format information is used for the section of description.
4, open type document isomorphism engines system according to claim 1, it is characterized in that, described concept extraction module, its output is that the notion that transformed out by the speech in the document and several attributes of notion are frequency, notion position in the text, the distributivity of notion that notion occurs in the text, the concept extraction module is to know that net, speech net, " synonym speech woods " are the base configuration conceptual base, based on conceptual base, obtain the notion that document comprises in conjunction with transfer algorithm, and provide the association attributes of notion.
5, open type document isomorphism engines system according to claim 1, it is characterized in that, described theme representation module, adopt notion frequency, notion position, boolean's weight, word frequency-anti-document frequency type weight or based on the weight of the weight calculation notion of information entropy, then document is represented in the mode of vector space that dimension reduction method adopts the mode of threshold values control to realize.
CNB2007100454517A 2007-08-30 2007-08-30 Open type document isomorphism engines system Expired - Fee Related CN100552673C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100454517A CN100552673C (en) 2007-08-30 2007-08-30 Open type document isomorphism engines system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100454517A CN100552673C (en) 2007-08-30 2007-08-30 Open type document isomorphism engines system

Publications (2)

Publication Number Publication Date
CN101114281A CN101114281A (en) 2008-01-30
CN100552673C true CN100552673C (en) 2009-10-21

Family

ID=39022630

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100454517A Expired - Fee Related CN100552673C (en) 2007-08-30 2007-08-30 Open type document isomorphism engines system

Country Status (1)

Country Link
CN (1) CN100552673C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092422B2 (en) 2009-12-30 2015-07-28 Google Inc. Category-sensitive ranking for text
CN102486787B (en) * 2010-12-02 2014-01-29 北大方正集团有限公司 Method and device for extracting document structure
CN102567291B (en) * 2010-12-31 2014-09-10 北大方正集团有限公司 Method and device for deleting lace characters in format document
CN102682049B (en) * 2011-10-31 2014-04-23 天脉聚源(北京)传媒科技有限公司 Method for extracting candidate keywords of text
CN104871151A (en) * 2012-10-26 2015-08-26 惠普发展公司,有限责任合伙企业 Method for summarizing document
CN103793474B (en) * 2014-01-04 2017-01-11 北京理工大学 Knowledge management oriented user-defined knowledge classification method
CN105718585B (en) * 2016-01-26 2019-02-22 中国人民解放军国防科学技术大学 Document and label word justice correlating method and its device
CN107358208B (en) * 2017-07-14 2018-07-13 北京神州泰岳软件股份有限公司 A kind of PDF document structured message extracting method and device
CN110083608B (en) * 2019-04-26 2021-10-15 北京零秒科技有限公司 Content management method and device based on knowledge base
CN112802569B (en) * 2021-02-05 2023-08-08 北京嘉和海森健康科技有限公司 Semantic information acquisition method, device, equipment and readable storage medium
CN113361275A (en) * 2021-08-10 2021-09-07 北京优幕科技有限责任公司 Speech draft logic structure evaluation method and device

Also Published As

Publication number Publication date
CN101114281A (en) 2008-01-30

Similar Documents

Publication Publication Date Title
CN100552673C (en) Open type document isomorphism engines system
JP5008024B2 (en) Reputation information extraction device and reputation information extraction method
US20090019015A1 (en) Mathematical expression structured language object search system and search method
WO2017080090A1 (en) Extraction and comparison method for text of webpage
CN107590219A (en) Webpage personage subject correlation message extracting method
US20150100304A1 (en) Incremental computation of repeats
CN107679035B (en) Information intention detection method, device, equipment and storage medium
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
US8745093B1 (en) Method and apparatus for extracting entity names and their relations
JP4911599B2 (en) Reputation information extraction device and reputation information extraction method
CN112667940B (en) Webpage text extraction method based on deep learning
CN111061882A (en) Knowledge graph construction method
CN113312922B (en) Improved chapter-level triple information extraction method
Kumar et al. Automated ontology generation from a plain text using statistical and NLP techniques
CN103440315A (en) Web page cleaning method based on theme
Milicka et al. Information extraction from web sources based on multi-aspect content analysis
Siklósi Using embedding models for lexical categorization in morphologically rich languages
CN105574066A (en) Web page text extraction and comparison method and system thereof
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN111274354B (en) Referee document structuring method and referee document structuring device
Chen et al. Perception-oriented online news extraction
Wong et al. iSentenizer: An incremental sentence boundary classifier
Sirsat Extraction of core contents from web pages
Chanod et al. From legacy documents to xml: A conversion framework
CN105426388A (en) Apparatus for extracting and comparing webpage text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20091021

Termination date: 20130830