CN101539906A

CN101539906A - System and method for automatically analyzing patent text

Info

Publication number: CN101539906A
Application number: CN200810085054A
Authority: CN
Inventors: 张国明
Original assignee: 亿维讯软件（北京）有限公司
Priority date: 2008-03-17
Filing date: 2008-03-17
Publication date: 2009-09-23

Abstract

The invention provides a system for automatically analyzing a patent text, which comprises an expert knowledge processor, an ontology processor, a language knowledge base, an expert knowledge base and an ontology knowledge base; the working relation between the expert knowledge processor and the ontology processor is the parallel relation; and the expert knowledge base and the ontology knowledge base also have the parallel relation. The invention also provides a method for automatically analyzing the patent text; with the language knowledge base, the expert knowledge processor is utilized to carry out extraction and structuralized expression on patent full-text data of a patent database to generate the expert knowledge base and automatically update the expert knowledge base; and with the language knowledge base, the ontology processor is utilized to extract ontology from the patent full-text data of the patent database and identify the ontology relation to generate the ontology knowledge base, and automatically update the ontology knowledge base.

Description

A kind of system and method for automatically analyzing patent text

Technical field

The present invention relates to a kind of system and method that patent text (the particularly open text of application for a patent for invention and mandate) is analyzed automatically, can be used in and improve the user inquiring effect.

Background technology

The invention that Patent Law is alleged is meant the new technical scheme that product, method or its improvement are proposed.Owing to have certain legal document characteristic, patent documentation embodies formal Specification, the rigorous language feature of language, and its tediously long length, complicated style greatly reduce the intelligibility and the knowledge sharing usefulness of patent.Utilize the natural language technology that patent is handled, can play the effect that improves the patent service efficiency, promotes the patent effective utilization.

The form of patent text and Writing method are relatively unified and fixing, and term is standard comparatively also.Often comprise some fixedly sentence patterns in the patent documentation, these sentence pattern templates are fit to the automatic processing of machine.And the standardization of patent term makes that carrying out Knowledge Discovery in patent becomes possibility.

Existing patent text analytical technology comprises: patent text translation, patent information extraction, patent classification and cluster, patent automatic abstract, patent generation, patent valve estimating and raising patent readability etc.Above technology many places do not have ripe commercial product to produce in the experimental phase as yet at present.

Chinese patent notification number CN99813079, denomination of invention discloses a kind of computer based software systems and method for the application of " the document semantic analysis with knowledge generative capacity is selected ", be used for natural language request in process user input semantically, subject-action-object (SAO) structure with identification and storage language, adopt this structure to search for this locality as keyword/phrase and based on the database of WWW, so that download candidate's natural language document, the candidate documents text is treated to candidate documents SAO structure semantically, and only selects and store its SAO structure to comprise relevant documentation with the coupling of the request SAO structure of being stored.Further feature comprises the relation of analysis between relevant documentation SAO structure, and generate according to this relation and can produce new knowledge concepts and thought, and produce and show the natural language summary according to relevant documentation SAO structure for the new SAO structure that is shown to the user., the document SAO representation of its proposition represents that help improving the document precision ratio and can utilize SAO to generate the document summary automatically, its weak point is that matching method makes recall ratio to guarantee though having simplified document.

Chinese patent application number is 200410078337.0, denomination of invention discloses a kind of in the semantic processes module for the application of " method of using ontology and user inquiring treatment technology to deal with problems ", knowledge/data are represented and handled based on the ontology method, thus a kind of system, the method and computer program of technical solution problem.The basic element of character of semantic processes module comprises a semantic knowledge-base, an ontology knowledge base, and/or an expert knowledge library.Described method comprises a storage user search formula structural description or semi-structured description, non-structured retrieval type is carried out a kind of formal semantic expressiveness formula that semantic analysis forms retrieval type, formal semantic retrieval formula is carried out semantic extension, retrieval type after the expansion is used for searching relevant solution at expert knowledge library, and according to semantic relation the solution that finds is classified.Though described system can realize that to the parsing of user inquiring request and query expansion the Query Result that provides can satisfy user's demand to large extent.But still there is weak point in it: described expert knowledge library, ontology knowledge base are as the core calculations resource, if its structure relies on manual type, will be unusual complicated and hard to tackle, comprise flood tide work, and administering and maintaining also is a big problem.

Summary of the invention

The system and method that the purpose of this invention is to provide a kind of automatically analyzing patent text, described system and method is intended to utilize natural language processing technique that the full patent texts data are handled, expert knowledge library, the required data knowledge of ontology knowledge base are provided, reduce the acquisition cost and the maintenance cost of expert knowledge library, ontology knowledge base as far as possible.

The present invention proposes a kind of system that patent text (referring to patent of invention especially) is analyzed automatically, mainly comprise a language processing system, the basic element of character of this system comprises a language knowledge base 1, expert knowledge library 2, ontology knowledge base 3, expertise processor 10, this body processor 11.It is expert knowledge library 2, ontology knowledge base 3 that the present invention can obtain two big specific knowledge storehouses based on patent data, thereby, realize the full patent texts in the patent database 8 is handled for the technical matters that solves (but being not limited to) inventive problem or user provides the support of knowledge aspect.

Described language knowledge base 1 can provide language analysis and its formal semantic expressiveness, i.e. the technical matters settling mode that is embodied by " Verb (verb)-Parameter (parameter)-Object (object) (VPO) " of a user search formula.Described language knowledge base 1 can comprise, but the rule that is not limited to analyze, the lemmatization dictionary, logic of language, classification with the noun phrase, can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge, and can provide the user search request pairing formal semantic expressiveness.The form of patent text and Writing method are relatively unified and fixing, and term is standard comparatively also.Often comprise some fixedly sentence patterns in the patent text, as " the objective of the invention is X ", " the described X of claim N is characterized in that Y ", wherein X, Y can be any word or sentence, and N is any number combination.These sentence pattern templates are fit to the automatic processing of machine, are the important component parts that constitutes language knowledge base 1.

Described expert knowledge library 2 is meant the solution knowledge base for the technical solution problem, and it derives from many text documents, is mainly derived from patent data, generates after expertise processor 10 is handled.Solution in the expert knowledge library 2 can be expressed as SVPO (subject term-verb-parameter-object) form, and wherein S is a subject term, or perhaps the solution of the defined technical functionality of vpo.

Described ontology knowledge base 3 comprises certain knowledge of world around, represents with the many words (notion and verb) and the semantic relation of these words of different kens, for example: synonymy, race relation (also being hierarchical relational), incidence relation.

Described expertise processor 10, this body processor 11 are all the ingredient of language processor system, and its work relationship is a coordination.

Described expertise processor 10 is a kind of extraction patent core contents, and then sets up the device of structurized expert knowledge library 2, and expert knowledge library 2 is as the carrier of technical matters solution, and using for the knowledge of application layer provides data resource to support.Described expertise processor 10 comprises pretreater, is used to carry out morphology identification and sentence and splits; The morphological processing device is used to mark out part of speech; Syntactic processor is used to discern syntactic structure; Semantic processor is used to mark out the represented semanteme of each main syntactic structure, thereby obtains marking the patent text of complex language information; The natural language compositor is used to generate a structurized knowledge entry, it is imported to expert knowledge library, and foundation/renewal is based on the semantic indexing of SVPO.The function of expertise processor 10 is that the full patent texts data are extracted and structured representation, thereby obtains required expert knowledge library 2.

The course of work of described expertise processor 10 can be expressed as follows: for one piece of patent text in the patent database 8, under the guidance of language knowledge base 1, through pretreater 12, morphological processing device 13, syntactic processor 14, the semantic processor 15 in the expertise processor 10, obtain marking the patent text of complex language information, and then, by natural language compositor 16, generate required solution knowledge base, import to expert knowledge library 2, and foundation/renewal is based on the semantic indexing of SVPO.

Described body processor 11 is to concern between a kind of automatic identification ontologies and body, and realizes dynamically updating the device of ontology knowledge base 3, and ontology knowledge base 3 provides support for the semantic extension of application layer and knowledge organization.Described body processor 11 comprises pretreater, is used to carry out morphology identification and sentence and splits; The body recognizer is used to extract body; Relationship identifier is used to discern the body relation; The body renovator is used for the ontology knowledge base is upgraded automatically.The function of this body processor 11 is from full patent texts extracting data body, identification body relation, and ontology knowledge base 3 is upgraded automatically.

The course of work of described body processor 11 can be expressed as follows: for one piece of patent text in the patent database 8, under the guidance of language knowledge base 1, through pretreater 17, body recognizer 18, the relationship identifier 19 in this body processor 11, obtain concerning between body (notion and verb) that the text comprises and the body in the text, via body renovator 20, body is imported ontology knowledge base 3.Body renovator 20 will be realized detection and the location of obtaining body in the ontology knowledge base.

Described patent database 8 can be the irrelevant databases of languages, stores the patent text of some.It can be the full patent texts database, also can be the patent claims database.Aspect languages, both can be English patent, also can be Chinese patent.

The present invention proposes a kind of method that patent text (referring to patent of invention especially) is analyzed automatically, comprising:

By language knowledge base, utilize the expertise processor that the full patent texts data in the patent database are extracted and structured representation, generate expert knowledge library, and expert knowledge library is upgraded automatically;

By language knowledge base, utilize full patent texts extracting data body, the identification body relation of this body processor from patent database, generate the ontology knowledge storehouse, and the ontology knowledge storehouse is upgraded automatically.

The described expert knowledge library step of obtaining comprises: pretreater carries out morphology identification and sentence splits; The morphological processing device marks out part of speech; Syntactic processor identification syntactic structure; Semantic processor marks out the represented semanteme of each main syntactic structure, thereby obtains marking the patent text of complex language information; The natural language compositor generates a structurized knowledge entry, and it is imported to expert knowledge library, and sets up or the renewal semantic indexing.Described semantic indexing is based on subject term-verb-parameter-object (SVPO) form.Solution in the described expert knowledge library is expressed as subject term-verb-parameter-object (SVPO) form.

The described ontology knowledge storehouse step of obtaining comprises: pretreater carries out morphology identification and sentence splits; The body recognizer extracts body; Relationship identifier identification body relation; The body renovator upgrades automatically to the ontology knowledge base.Described body renovator can also be realized detection and the location of obtaining body in the ontology knowledge base.

Described language knowledge base comprises the rule of analysis at least, the lemmatization dictionary, the classification of logic of language and noun phrase, can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge, and can provide the user search request pairing formal semantic expressiveness.

Described patent database is and the irrelevant database of languages, stores the patent text of some.Be full patent texts database or patent claims database.

Use technical scheme of the present invention, can realize:

1) to the automatic extraction of patent text, the auxiliary expert knowledge library (solution) that generates;

2) discern body and the technical term that occurs in the patent automatically, determine the relationship type between body and the term, and realize dynamically updating the ontology knowledge base.

3) based on 1) expert knowledge library, 2 set up) the ontology knowledge base that obtained, can provide support for realizing important application such as intelligent solution search.

Description of drawings

Fig. 1 represents according to one embodiment of present invention, the module work relationship figure of language processor system;

Fig. 2 represents according to one embodiment of present invention, an example fragment of expert knowledge library.

Fig. 3 represents according to one embodiment of present invention, an example fragment of ontology knowledge base;

Fig. 4 represents that a kind of typical case's application according to the invention process achievement is the main process flow diagram of knowledge retrieval.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

One embodiment of the present of invention provide a kind of knowledge acquisition system and method based on patent database.In one embodiment, the language processor system is provided as required expert knowledge library 2 and the ontology knowledge base 3 of search technique of finding accurate and complete solution and adopting.

Fig. 1 is according to one embodiment of present invention, provides to realize required expert knowledge library 2 and the ontology knowledge base 3 of accurate and complete search technique.As shown in Figure 1, expertise processor 10 receives from one piece of patent text in the patent database 8, by language knowledge base 1, with pretreater 12 it is carried out morphology identification and sentence fractionation, then mark out part of speech with morphology processor 13, then use syntactic processor 14 identification syntactic structures, based on this, use semantic processor 15 to mark out the represented semanteme of each main syntactic structure, thereby obtain marking the patent text of complex language information, and then, by natural language compositor 16, generate a structurized knowledge entry, promptly required solution knowledge, it is imported to expert knowledge library 2, and foundation/renewal is based on the semantic indexing of SVPO.

In one embodiment, patent database 8 is stored the patent text of some.Every piece of patent text all possesses specific structure, is example with the United States Patent (USP), comprises that " Title ", " Abstract ", " IssueDate ", " Claims " etc. must content and printed words.In addition, patent database 8 of the present invention requires every piece of patent text to have higher representativeness and differs from one another on affiliated technical field and/or solution.

In one embodiment, language knowledge base can comprise, the rule of analysis, and the lemmatization dictionary, the classification of logic of language and noun phrase can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge.With the language construction knowledge of patent text, be description, for example: " the objective of the invention is X " to peculiar logic of language of patent and expression way as analytic target, " the described X of claim N; it is characterized in that Y ", wherein X, Y can be any word or sentence, N is any number combination.Language knowledge base is that the patent text processing provides support.

Shown in Figure 2 is a fragment/example of expert knowledge library 2, has embodied the structure and the content of expert knowledge library 2.The generation of a knowledge entry is the processing procedure of expertise processor.

Each knowledge entry in the expert knowledge library 2 is all represented a solution.Studies show that most of inventions can be expressed as the form of a kind of being called " technical functionality ", VPO form just, it has represented the formal characteristic of a problem.As the semantic meaning representation to this knowledge entry, each solution all is the sentence expression with a natural language, comprises four fields, corresponding the basic function of " SVPO ".A solution of S problem of representation, problem has VPO to represent, and wherein V represents verb, and P represents parameter, the O indicated object.Knowledge entry as shown in Figure 2 " Calcium sulfateprevents absorption of fat ", its SVPO is expressed as:

SVPO：S(Calcium?sulfate)V(prevent)P(absorption)O(fat)。

Shown in Figure 3 is a fragment of ontology knowledge base 3, has embodied the structure and the content of ontology knowledge base 3.The ontology knowledge base can be the word hierarchical data base of different kens, in notion of this used " word " expression.Relation between the word of ontology knowledge base comprises three kinds, is respectively synonymy, race relation and incidence relation.

Semantic relation between that synonymy is meant in given context the expression identical meanings or two morphology structures, comprise direct synonym, as " clear ", " rectify ", " purify ", " refine " etc., also comprise the sentence structure synonym, the different syntactic structure of identical for representing (or close) implication is as " dehydrate ", " decrease relative humidity " etc.

Race relation also claims parent relation/subclass relation, shows two speech of parent notion/subclass notion of fixed one group of notion or the semantic relation between two morphology structures.As: " water-〉channel ", " water-〉bay ", " physical thing-〉water " etc.

Incidence relation refers to have each other two speech of incidence relation or the semantic relation between two morphology structures.Two speech or morphology structure with incidence relation have identical parent relation, are the subclass notions under the same parent notion, as " channel＜-bay ".

In one embodiment, in this body processor 11, body that extracts from one piece of patent text and relation will be submitted to body renovator 20, be realized new body and relation, will be had the contrast between body, the relation by this module, thereby finish the renewal of ontology knowledge base.Particularly, if from one piece of patent text, get access to two bodies " territorial waters " and " waterfall ", whether body renovator 20 will be present in ontology library to two bodies is judged, and be located in the ontology library, can know separately hypernym, synonym behind the location, hypernym as " territorial waters ", " waterfall " all is " water ", and the synonym of " waterfall " is " falls ".

The resulting achievement of one embodiment of the present of invention, promptly described expert knowledge library 2 and ontology knowledge base 3 are applied to the process flow diagram of knowledge retrieval, as shown in Figure 4.

Fig. 4 represents that a kind of typical case's application according to the invention process achievement is the main process flow diagram of knowledge retrieval, be the 26S Proteasome Structure and Function block diagram that is used to solve the language processing module of inventive problem and user's technical matters, a kind of typical case who embodies expert knowledge library 2, ontology knowledge base 3 uses.

In one embodiment, language knowledge base can comprise, the rule of analyzing, the lemmatization dictionary, logic of language, with the classification of noun phrase, can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge, and can provide the user search request pairing formal semantic expressiveness.Under the help of language knowledge base 1, can provide that the pairing formal semantic expressiveness-verb of user search request (verb)-parameter (parameter)-object (object) (vpo); Under the help of ontology knowledge base 3, can finish parsing and semantic extension, and the solution that retrieves is classified the user search formula; Under the help of expert knowledge library 2, can determine the solution of specific retrieval type.In one embodiment, the output of language processing module at user's request shown in Figure 4 is these solutions of arranging according to semanteme.

Be the processing procedure to the user search formula shown in Figure 4 below:

Retrieval type for example: How to measure thickness of ice

Structured form: V (measure) P (thickness) O (ice)

A user search formula by analysis can be the VPO structure, as above example.This structure can be submitted to the retrieval enlargement module, uses ontological hierarchy to finish semantic extension, so that retrieve the solution relevant with problem as much as possible.

The retrieval type of VPO uses any variable mode to expand.Correspondingly to carry out following expansion:

Synonym expansion (verb, parameter and object are expanded);

Kind expansion (be expansion up and down, only object expanded); And/or

Related expansion (only object being expanded)

During the synonym expansion, each speech of user search formula is all substituted by synonym, as above example:

Structured form: V (measure) P (thickness) O (ice)

Output (synonym expansion):

V(measure，detect，gage，gauge，log，measure?out，meter，quantify，register)

P (not having synonym)

O(water?ice)

The kind expansion is that the hierarchical relational of the term in the retrieval type with term substituted.The expansion of two kinds of kinds is arranged, and a kind of is bottom-up (by special case to general), as

Structured form: V (measure) P (thickness) O (ice)

Output (it is bottom-up that kind is expanded, and only object is carried out the father concerns expansion):

O(dimension)

Another kind of expansion is top-down (by general to special case), as

Structured form: V (measure) P (thickness) O (ice)

Output (it is bottom-up that kind is expanded, and only object carried out the subrelation expansion):

O(half?thickness，half-value?thickness，half-thickness)

Kind retrieval can retrieve the solution of special case more, more general or more heterogeneous pass.

Incidence relation is that term is substituted with incidence relation.As:

Retrieval type for example: How to measure thickness of ice

Structured form: V (measure) P (thickness) O (ice)

Output (only object O being carried out the association expansion)

O(creaminess，soupiness，critical?thickness，……)

Target to solution retrieval is to search solution according to the retrieval type after the expansion in expert knowledge library 2, and enumerates solution according to the result who searches, and search engine is the VPO field in the expert knowledge library 2 and the retrieval type after the expansion relatively.The corresponding relation of these fields will retrieve relevant solution.Because these results' character need be classified to it according to semantic relation, the result is:

(1) accurate scheme: the initial VO/VPO that forms of the VO/VPO field of these solutions and retrieval type fits like a glove.

For example: V (heat) O (water)

Solution: S (coil) V (increase) P (temperature) O (water)

(2) special case scheme: at least one in the VO/VPO field of these solutions is a special case of relevant field in the retrieval type.

For example: V (measure) P (thickness) O (ice)

Solution: S (ultrasonic probe) V (measure) P (thickness) O (frost)

(3) general scheme:

For example: V (neutralize) O (hydrochloric acid)

Solution: S (alkali) V (neutralize) O (acid)

(4) analogy scheme:

For example: V (neutralize) O (hydrochloric acid)

Solution: S (alkali) V (neutralize) O (nitric acid)

In the above example, the solution thinking of S representative " descriptor " or problem.

A kind of special circumstances when embodiment of the present invention is the invention process, protection scope of the present invention is not limited only to this.

Processing of the present invention, calculating, judgement or the like all are to a kind of operation of data and conversion.

Embodiments of the invention comprise finishes these apparatus operating.

Although described some embodiments of the present invention above, it should be understood that these embodiment are object lessons more of the invention process, should not the restriction of protection domain of the present invention.Protection scope of the present invention should not limited by the description of instructions, and should be limited by claims and their equivalent.The change that those skilled in the art do the embodiment of the invention according to above-mentioned description and explanation, all should protection scope of the present invention within.

Claims

1. the system of an automatically analyzing patent text is characterized in that, comprising:

The expertise processor is used for the full patent texts data of patent database are extracted and structured representation, generates expert knowledge library, and expert knowledge library is upgraded automatically;

This body processor is used for full patent texts extracting data body, identification body relation from patent database, generates the ontology knowledge storehouse, and the ontology knowledge storehouse is upgraded automatically;

Language knowledge base, being used to provides the language analysis of a user search formula and its formal semantic expressiveness, assists the work of expertise processor and this body processor;

Expert knowledge library is the solution knowledge base of technical solution problem, derives from many text documents, is mainly derived from patent data, generates after the expertise processor processing;

The ontology knowledge storehouse comprises certain knowledge of world around, represents with the many words of different kens and the semantic relation of these words, generates after the body processor processing;

The work relationship of described expertise processor, this body processor is a coordination, and described expert knowledge library and ontology knowledge storehouse also are coordination.

2. system according to claim 1 is characterized in that, described expertise processor comprises:

Pretreater is used to carry out morphology identification and sentence and splits;

The morphological processing device is used to mark out part of speech;

Syntactic processor is used to discern syntactic structure;

Semantic processor is used to mark out the represented semanteme of each main syntactic structure, thereby obtains marking the patent text of complex language information;

The natural language compositor is used to generate a structurized knowledge entry, and it is imported to expert knowledge library, and sets up or the renewal semantic indexing.

3. system according to claim 2 is characterized in that, described semantic indexing is based on subject term-verb-parameter-object (SVPO) form.

4. system according to claim 1 is characterized in that, described body processor comprises:

The body recognizer is used to extract body;

Relationship identifier is used to discern the body relation;

The body renovator is used for body is imported the ontology knowledge base, and the ontology knowledge base is upgraded automatically.

5. system according to claim 1 is characterized in that, described body renovator can also be realized detection and the location of obtaining body in the ontology knowledge base.

6. system according to claim 1 is characterized in that the semantic relation of described word comprises synonymy, race relation and incidence relation at least.

7. system according to claim 1 is characterized in that, the solution in the described expert knowledge library is expressed as subject term-verb-parameter-object (SVPO) form.

8. system according to claim 1, it is characterized in that, described language knowledge base comprises the rule of analysis at least, the lemmatization dictionary, logic of language, with the classification of noun phrase, can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge, and can provide the user search request pairing formal semantic expressiveness.

9. system according to claim 1 is characterized in that, described patent database is and the irrelevant database of languages, stores the patent text of some.

10. system according to claim 1 is characterized in that, described patent database is full patent texts database or patent claims database.

11. the method for an automatically analyzing patent text is characterized in that, may further comprise the steps:

12. method according to claim 11 is characterized in that, the described expert knowledge library step of obtaining comprises:

Pretreater carries out morphology identification and sentence splits;

The morphological processing device marks out part of speech;

Syntactic processor identification syntactic structure;

Semantic processor marks out the represented semanteme of each main syntactic structure, thereby obtains marking the patent text of complex language information;

The natural language compositor generates a structurized knowledge entry, and it is imported to expert knowledge library, and sets up or the renewal semantic indexing.

13. method according to claim 12 is characterized in that, described semantic indexing is based on subject term-verb-parameter-object (SVPO) form.

14. method according to claim 11 is characterized in that, the described ontology knowledge storehouse step of obtaining comprises:

Pretreater carries out morphology identification and sentence splits;

The body recognizer extracts body;

Relationship identifier identification body relation;

The body renovator upgrades automatically to the ontology knowledge base.

15. method according to claim 11 is characterized in that, described body renovator can also be realized detection and the location of obtaining body in the ontology knowledge base.

16. method according to claim 11 is characterized in that, the solution in the described expert knowledge library is expressed as subject term-verb-parameter-object (SVPO) form.

17. method according to claim 11, it is characterized in that, described language knowledge base comprises the rule of analysis at least, the lemmatization dictionary, logic of language, with the classification of noun phrase, can provide the language analysis of carrying out patent text required word knowledge and language construction knowledge, and can provide the user search request pairing formal semantic expressiveness.

18. method according to claim 11 is characterized in that, described patent database is and the irrelevant database of languages, stores the patent text of some.

19. method according to claim 11 is characterized in that, described patent database is full patent texts database or patent claims database.