CN102682109B

CN102682109B - Patent information analysis method and device

Info

Publication number: CN102682109B
Application number: CN201210142700.5A
Authority: CN
Inventors: 谢国利
Original assignee: BEIJING BIZSOLUTION INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING BIZSOLUTION INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-05-09
Filing date: 2012-05-09
Publication date: 2014-07-16
Anticipated expiration: 2032-05-09
Also published as: CN102682109A

Abstract

The invention provides a patent information analysis method and device, wherein the method comprises the steps of: selecting analyzed patent information from a database to serve as basic data, and acquiring HTML (Hypertext Markup Language)-format webpages of the patent information from a website; specific to each data item in the basic data, respectively acquiring character strings capable of uniquely locating all the data items from the acquired HTML-format webpages, and respectively formatting the character strings into regular expressions for analyzing all data items; and analyzing the patent information from the unanalyzed HTML-format webpage of the website by using the regular expressions for analyzing all the data items, and saving the patent information obtained through analysis into the database. The method and the device disclosed by the invention can be used for adaptively establishing an analysis rule of the patent information so as to ensure that the analysis rule of the patent information can be updated automatically even though the HTML formats of the webpages change, so that the patent information is analyzed correctly, manpower is saved and the efficiency is increased.

Description

A kind of patent information analytic method and device

[technical field]

The present invention relates to computer information technology field, particularly a kind of patent information analytic method and device.

[background technology]

Along with developing rapidly of Internet technology, network becomes the Main Means of people's obtaining information, and patent information is also like this.Worldwide nearly all patent information is all issued by internet, makes people can obtain more easily patent information, thereby promotes technological innovation and development.Now increasing enterprise customer by patent searching information on the internet and resolved to accurate data and be kept in local data base, deeply uses thereby form own patent information storehouse.

When the patent data of issuing with HTML (Hypertext Markup Language) (HTML) form is resolved, the patent information of customer analysis html format normally, write out the regular expression that can accurately locate each data item (such as patent descriptive entry), form the rule that computer program can be identified, then by computer program, according to this rule, from the patent information of html format, parsed data item content accurately.

Although the analysis mode of this patent information has higher analyzing efficiency; but thereby site owners on internet often can be adjusted html format and makes the effect that web displaying is different; this adjustment must cause the regular expression rule that user arranges to lose efficacy, thereby the data that cause above-mentioned analysis mode to parse are made mistakes and maybe cannot be parsed data.Unless user reanalyses html format, again to write out the regular expression rule that can accurately locate each data item, and be updated in computer program, this obviously brings huge workload for user, waste of manpower, and inefficiency.

[summary of the invention]

The invention provides a kind of patent information analytic method and device, so that also can realize the automatic parsing of patent information after html format adjustment, save manpower, raise the efficiency.

Concrete technical scheme is as follows:

A patent information analytic method, the method comprises:

S1, from database, select the patent information completed parsing as basic data, and from website, obtain the html format webpage of described patent information;

S2, for each data item in described basic data, from the html format webpage getting, obtain respectively can each data item of unique location character string, and be formatted as respectively the regular expression of resolving each data item;

S3, utilize the regular expression of described each data item of parsing, the html format webpage of not resolving from described website, resolve patent information, and deposit the patent information being resolved in described database.

According to one preferred embodiment of the present invention, whether the html format that regularly detects described website changes, if html format detected, changes, and triggers and carries out described step S1; Or,

Manual activation is carried out described step S1; Or,

No matter whether the html format of described website changes, and all regularly triggers and carries out described step S1.

According to one preferred embodiment of the present invention, described step S2 specifically comprises:

S21, obtain in described basic data, do not carry out described step S2 a data item as current data item;

In S22, the html format webpage that obtains at step S1, determine the position of current data item;

S23, from this position the character string of the default intercepted length of intercepting forward and backward respectively, filter after the non-html tag in the character string of intercepting, former and later two string formats are changed into regular expression;

Whether S24, the regular expression that obtains of check can unique location current data items, if so, record the regular expression of current data item correspondence, go to described step S21; Otherwise, increase described intercepted length, again go to described step S23.

According to one preferred embodiment of the present invention, in described step S23, former and later two string formats being changed into regular expression specifically comprises:

The metacharacter of each character in former and later two character strings after filtering in regular expression, the common character that is close to current data item in the non-html tag filtering retains in regular expression, and the other guide of filtration replaces with regular expression wild symbol in regular expression.

Whether the regular expression that according to one preferred embodiment of the present invention, in described step S24, check obtains can be specially by unique location current data item:

In the html format webpage that the described regular expression obtaining is obtained to described step S1 or information extraction in other html format webpages, judge whether uniquely to obtain the content of current data item, if so, explanation can unique location current data item.

According to one preferred embodiment of the present invention, between described step S2 and step S3, also comprise:

S41, from described database, alternative is selected a patent information that has completed parsing, obtains alternative select the html format webpage of patent information from described website;

S42, utilize html format webpage that the regular expression of each data item that step S2 obtains obtains from step S41 to extract the patent information of each data item, and whether the patent information that judgement is extracted is consistent with the patent information of storing in described database, if consistent, determine and be verified, continue to carry out described step S3; Otherwise indication is revised the regular expression of inconsistent data item.

A patent information resolver, this device comprises:

Basic data acquiring unit, for selecting the patent information that has completed parsing as basic data from database;

Webpage acquiring unit, for obtaining the html format webpage of the selected patent information of described basic data acquiring unit from website;

Rule schemata unit, for each data item for described basic data, the html format webpage obtaining from described webpage acquiring unit respectively, obtain can each data item of unique location character string, and be formatted as respectively the regular expression of resolving each data item;

Information analysis unit, for utilizing the regular expression of described each data item of parsing, resolves patent information the html format webpage of not resolving from described website, and deposits the patent information being resolved in described database.

According to one preferred embodiment of the present invention, this device also comprises:

Whether trigger control unit, change for regularly detecting the html format of described website, if html format detected, changes, and triggers described basic data acquiring unit executable operations; Or, under manual control, trigger described basic data acquiring unit executable operations; Or no matter whether the html format of described website changes, all regularly trigger described basic data acquiring unit executable operations.

According to one preferred embodiment of the present invention, described rule schemata unit specifically comprises:

Data item is obtained subelement, for obtaining a data item that described basic data was not acquired as current data item;

Locator unit, determines the position of current data item for the html format webpage that obtains at described webpage acquiring unit;

Format subelement, intercepts forward and backward respectively for the position definite from described locator unit the character string of presetting intercepted length, after the non-html tag in the character string of filtration intercepting, former and later two string formats is changed into regular expression;

Check subelement, for checking the regular expression that described format subelement obtains whether can unique location current data item, if so, records the regular expression of current data item correspondence, triggers described data item and obtains subelement; Otherwise, increase described intercepted length, trigger described format subelement.

According to one preferred embodiment of the present invention, described format subelement is when changing into regular expression by former and later two string formats, the concrete metacharacter of each character in former and later two character strings after filtering in regular expression, the common character that is close to current data item in the non-html tag filtering retains in regular expression, and the other guide of filtration replaces with regular expression wild symbol in regular expression.

According to one preferred embodiment of the present invention, in described check subelement obtains the described regular expression obtaining html format webpage to described webpage acquiring unit or information extraction in other html format webpages, judge whether uniquely to obtain the content of current data item, if so, explanation can unique location current data item.

According to one preferred embodiment of the present invention, this device also comprises: regular authentication unit, for obtaining resolving after the regular expression of each data item in described rule schemata unit, from described database, alternative is selected a patent information that has completed parsing, and from described website, obtain and select the html format webpage of selecting patent information else, the html format webpage that utilizes regular expression that described rule schemata unit obtains resolving each data item to select patent information from described alternative extracts the patent information of each data item, and whether the patent information that judgement is extracted is consistent with the patent information of storing in described database, if consistent, determine and be verified, trigger described information analysis unit, otherwise indication is revised the regular expression of inconsistent data item.

As can be seen from the above technical solutions, the present invention utilizes and in database, has completed the patent information of parsing and the html format webpage at place thereof, adopt the mode of reverse resolution to determine the regular expression of resolving each data item in current html format webpage, then according to definite regular expression, the html format webpage of not resolving is resolved to patent data.That is to say, the present invention can set up out the resolution rules of patent information adaptively, even if the html format of webpage is changed, also can automatically upgrade the resolution rules of patent information, realize the correct parsing of patent information, save manpower, and improved efficiency.

[accompanying drawing explanation]

The patent information analytic method process flow diagram that Fig. 1 provides for the embodiment of the present invention one;

Fig. 2 extracts respectively the process flow diagram of regular expression for what the embodiment of the present invention one provided for each data item in basic data;

The patent information resolver structural drawing that Fig. 3 provides for the embodiment of the present invention two.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

Because patent data is normally relatively-stationary, when the html format of website changes, patent data is constant in fact, the present invention's this feature based on patent proposes, use patent data to carry out reverse resolution, automatically extract the regular expression of resolving patent, and form new resolution rules, according to this resolution rules, carry out patent information parsing.Below by a pair of the method for embodiment, be described.

Embodiment mono-,

The patent information analytic method process flow diagram that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:

Step 101: select the patent information that has completed parsing as basic data from database, and obtain the html format webpage of this patent information from website.

The method providing due to the embodiment of the present invention is actually when the html format of the page changes, utilize the patent information completed parsing to carry out reverse resolution and obtain regular expression, that is to say, when changing, html format can start the execution of this flow process, once obtain regular expression because resolve, as long as html format does not change, can continue to use the parsing that this regular expression carries out patent information always.Therefore, whether the html format that can regularly detect website changes, once html format be detected, changes, and triggers the execution of this flow process; Also execution that can this flow process of manual activation; No matter whether html format changes, and all triggers termly the execution of this flow process.

In this step, can from complete the patent information of parsing, select the patent information of a patent as basic data, how this basic data webpage format changes does not change, then from website, obtain the current html format webpage of this patent information, thereby analyze suitable resolution rules, continue to carry out following steps.

For the resolution rules that makes to extract is more accurate, in view of authorized patent information can not change substantially, in the time of can selecting basic data in this step, from the patent of having authorized, choose.

Step 102: for each data item in basic data, from the html format webpage getting, obtain respectively can each data item of unique location character string, and be formatted as respectively the regular expression of resolving each data item.

The data item relating in the embodiment of the present invention is actually one or more in the patent descriptive entry that user is concerned about, for example denomination of invention, inventor's information, applicant's information, the applying date, open day, priority information, application number, patent No. etc.

This step is actually for each data item in basic data and extracts respectively regular expression, carries out flow process as shown in Figure 2, as shown in Figure 2, specifically comprises the following steps:

Step 201: obtain in basic data, do not carry out this flow process a data item as current data item.

This step is actually each data item of obtaining one by one in basic data and carries out following flow process, all from basic data, extract each time a data item of not carrying out this flow process and start to carry out following flow process, until all data item all execute this flow process, just completed the regular expression of all data item in basic data and extracted.

Step 202: the position of determining current data item in the html format webpage obtaining.

In html format webpage, find the position at the content place of current data item, with this, can obtain in html format webpage character string before current data item and afterwards.

Step 203: intercept the character string of presetting intercepted length forward and backward respectively from this position, filter after the non-html tag in the character string intercepting, former and later two string formats are changed into regular expression.

In step 202, determine behind the position of current data item, forward and intercept respectively backward the character string of certain length, first this intercepted length can be default initial value, for example 50 characters.Character, the new line that in this character string, may comprise html tag, patent information, text formatting accord with etc., wherein only have the information of labels class along with patent is different, not change, there is location meaning, therefore need to filter the non-html tag that does not possess location meaning.

Utilize former and later two character strings of intercepting to format, after filtering, in remaining two character strings, are all html tags, can not change along with patent difference, therefore as the metacharacter in regular expression, the common character that is close to current data item in the non-html tag filtering retains, the other guide filtering replaces with regular expression wild symbol, thereby obtains regular expression corresponding to this data item

Step 204: whether the regular expression that obtains of check can unique location current data item, if so, records the regular expression of current data item correspondence, goes to step 201; Otherwise, increase intercepted length and again go to step 203.

In this step, can to html format webpage, (can be the current html format webpage obtaining by the regular expression obtaining, also can choose the html format webpage of other patents) middle information extraction, judge whether uniquely to obtain corresponding data item, if, explanation can this data item of unique location, otherwise explanation can not this data item of unique location.

For example, the regular expression that the patent name obtaining is corresponding mates in the html format webpage of this patent, determines whether uniquely to extract patent name, and if so, explanation can unique location patent name.If also extracted other guide except patent name, explanation can not unique location patent name, need to increase intercepted length and again go to step the new regular expression of 203 formats.

Or the regular expression that the patent name obtaining is corresponding mates in the html format webpage of other patents, determine whether uniquely to extract patent name, if so, explanation can unique location patent name.If also extracted other guide except patent name, explanation can not unique location patent name, need to increase intercepted length, again goes to step the new regular expression of 203 formats.

If determine that the current regular expression obtaining can not unique location current data item, go to step 203 forward backward the longer html tag of intercepting extract regular expression, until can unique location current data item.

Give one example, suppose from completing the basic data that the patent information of parsing selects as follows:

Patent name: a kind of vehicle safety travel device

Application number: 201020685875

The applying date: on Dec 23rd, 2010

The html format webpage that obtains this patent information from website is as follows:

Suppose first using number of patent application/patent No. as current data item, obtain the position of current data item " 201020685875 " in html format webpage, the character string of getting forward 50 above characters (it should be noted that, the character string of the default intercepted length of getting at this can be a default intercepted length scope, but conventionally need to guarantee that html tag completes, for example at more than 50 character but guarantee the complete shortest character string of html tag) be:

<div?style＝″width：100％″>

<div class=" t1 " > vehicle safety travel device </div>

<div class=" t2 " > application number/patent No.:

The character string of getting backward 50 above characters is:

</div>

<table?width＝″100％″border＝″0″cellspacing＝″1″cellpadding＝″0″class＝″tb″>

<tr><td class=" td1 " the > applying date: </td><tdGreatT.G reaT.GT2010 </td></trGreatT. GreaT.GT on Dec 23

After filtering out the non-html tag that does not possess location meaning, be: <div class=" t2 " > " suitable regular expression " </div>, to obtain forward and backward regular expression as follows in format respectively:

Before: <div style=" width:100% " ><div class=" t1 " >[s S] *? </div><div class=" t2 " > application number/patent No.:

Rear: </div><table width=" 100% " border=" 0 " cellspacing=" 1 " cellpadding=" 0 " class=" tb " >

<tr><td?class＝″td1″>[\s\S]*？</td><td>[\s\S]*？</td></tr>。

Wherein, in regular expression above, the non-html tag " application number/patent No.: " filtering is due to next-door neighbour's data item, therefore retained original character string, the non-html tag of filtration " a kind of vehicle safety travel device " replace with regular expression wild symbol " [s S] *? "

In regular expression below, are not all close to data item in the non-html tag " applying date: " of filtration and " on Dec 23rd, 2010 ", therefore replace with respectively regular expression wild symbol " [s S] *? "

The regular expression that utilization obtains is to position that whether can unique definite patent name number of patent application/patent No. " 201020685875 " in the current html format webpage obtaining, if can, record regular expression corresponding to number of patent application/patent No.; If can not, intercepted length can be lengthened, get backward forward respectively longer character string, after filtering, be formatted into regular expression, repeat this flow process until the regular expression obtaining can unique definite number of patent application/patent No. " 201020685875 " position.

For each data item in basic data, all repeat said process one by one, can obtain regular expression corresponding to each data item, resolve the regular expression of each data item, thereby set up complete resolution rules.

Continuation is referring to Fig. 1, and step 103: the regular expression of each data item is resolved in utilization, resolves patent information in the html format webpage of never resolving, and deposit the patent information being resolved in database.

Obtaining resolving after the regular expression of each data item, in the html format webpage that just can never resolve, resolving respectively the patent information that obtains each data item.The patent information of resolving each data item as for the regular expression that how to utilize each data item from html format webpage is prior art, repeats no more.

In flow process shown in Fig. 1, can select a patent information that has completed parsing as basic data, after execution step 101 and step 102, obtain resolving the regular expression set of each data item, more preferably, can also from database, select else and select a patent information that has completed parsing, and from website, obtain the html format webpage of this patent information, the regular expression set obtaining is verified, utilize the regular expression set obtaining to resolve selecting the html format webpage of the patent information of selecting else, whether the patent information that judgement parsing obtains is consistent with this patent information in database, if consistent, illustrate that this regular expression set is by checking, otherwise, indication is revised for the regular expression of inconsistent data item wherein.The correction of carrying out herein can adopt the mode of manual intervention to revise, and after being indicated, technician can check the regular expression of this data item, and this regular expression is revised.

Be more than the description that method provided by the present invention is carried out, below by bis-pairs of devices provided by the present invention of embodiment, be described in detail.

Embodiment bis-,

The patent information resolver structural drawing that Fig. 3 provides for the embodiment of the present invention two, as shown in Figure 3, this device can comprise: basic data acquiring unit 310, webpage acquiring unit 320, rule schemata unit 330 and information analysis unit 340.

Basic data acquiring unit 310 selects the patent information that has completed parsing as basic data from database.

Can from complete the patent information of parsing, select the patent information of a patent as basic data, how this basic data webpage format changes does not change, then from website, obtain the current html format webpage of this patent information, thereby analyze suitable resolution rules.For the resolution rules that makes to extract is more accurate, in view of authorized patent information can not change substantially, in the time of can selecting basic data in this step, from the patent of having authorized, choose.

Webpage acquiring unit 320 obtains the html format webpage of the selected patent information of basic data acquiring unit from website.

If after html format webpage changes, trigger this device and start executable operations, the html format webpage that webpage acquiring unit 320 now obtains is actually the webpage of form after changing.

Rule schemata unit 330 is for each data item in basic data, the html format webpage obtaining from webpage acquiring unit 320 respectively, obtain can each data item of unique location character string, and be formatted as respectively the regular expression of resolving each data item.

Wherein each data item in basic data is one or more in the patent descriptive entry be concerned about of user, for example denomination of invention, inventor's information, applicant's information, the applying date, open day, priority information, application number, patent No. etc.The concrete structure of rule schemata unit 330 will be in subsequent detailed description.

In the mode by said units reverse resolution, obtain resolving after the regular expression of each data item, information analysis unit 340 utilizes the regular expression of resolving each data item, the html format webpage of not resolving from website, resolve patent information, and deposit the patent information being resolved in database.

In addition, the device providing due to the embodiment of the present invention is actually when the html format of the page changes, utilize the patent information completed parsing to carry out reverse resolution and obtain regular expression, that is to say, when changing, html format can start the execution of this flow process, once obtain regular expression because resolve, as long as html format does not change, can continue to use the parsing that this regular expression carries out patent information always.In order to realize the reverse resolution of regular expression, this device also comprises: trigger control unit 300, for regularly detecting the html format of website, whether change, and if html format detected, change, trigger basic data acquiring unit executable operations; Or, under manual control, trigger basic data acquiring unit executable operations; Or no matter whether the html format of website changes, and all regularly triggers basic data acquiring unit executable operations.

Below the concrete structure of rule schemata unit 330 is described, this rule schemata unit 330 can specifically comprise: data item is obtained subelement 331, locator unit 332, format subelement 333 and check subelement 334.

Data item is obtained subelement 331 and is obtained a data item not being acquired in basic data as current data item.

In the html format webpage that locator unit 332 obtains at webpage acquiring unit 320, determine the position of current data item.

Format subelement 333 position definite from locator unit 332 intercepts forward and backward respectively the character string of presetting intercepted length, after the non-html tag in the character string of filtration intercepting, former and later two string formats changed into regular expression.Wherein, first intercepted length can be a default initial value.

Character, the new line that in the character string of intercepting, may comprise html tag, patent information, text formatting accord with etc., the information of wherein only having labels class is along with patent is different, not change, there is location meaning, therefore need to filter the non-html tag that does not possess location meaning.

Particularly, format subelement 333 is when changing into regular expression by former and later two string formats, the concrete metacharacter of each character in former and later two character strings after filtering in regular expression, the common character that is close to current data item in the non-html tag filtering retains in regular expression, and the other guide of filtration replaces with regular expression wild symbol in regular expression.

Whether the regular expression that check subelement 334 check format subelements 333 obtain can unique location current data item, if, record the regular expression of current data item correspondence, trigger data item obtains subelement 331, continues to start the reverse resolution of next data item; Otherwise, increase intercepted length, trigger format subelement 333, intercept forward and backward the regular expression that the character string of big-length more obtains this data item.

Particularly, in the check subelement 334 html format webpage that the regular expression obtaining can be obtained to webpage acquiring unit or information extraction in other html format webpages, judge whether uniquely to obtain the content of current data item, if so, explanation can unique location current data item.

Further, this device can also comprise: regular authentication unit 350, the regular expression of each data item obtaining for proof rule formatting unit 330, be specially: in rule schemata unit 330, obtain resolving after the regular expression of each data item, from database, alternative is selected a patent information that has completed parsing, and from website, obtain and select the html format webpage of selecting patent information else, the html format webpage that utilizes regular expression that rule schemata unit 330 obtains resolving each data item to select patent information from alternative extracts the patent information of each data item, and whether the patent information that judgement is extracted is consistent with the patent information of storing in database, if consistent, determine and be verified, trigger message resolution unit 340, otherwise indication is revised the regular expression of inconsistent data item.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a patent information analytic method, is characterized in that, the method comprises:

S1, from database, select the patent information completed parsing as basic data, and from website, obtain the HTML (Hypertext Markup Language) html format webpage of described patent information;

S2, for each data item in described basic data, carry out following steps S21 to S24, obtain resolving the regular expression of each data item:

Whether S24, the regular expression that obtains of check can unique location current data items, if so, record the regular expression of current data item correspondence, go to described step S21; Otherwise, increase described intercepted length, again go to described step S23;

2. method according to claim 1, is characterized in that, whether the html format that regularly detects described website changes, if html format detected, changes, and triggers and carries out described step S1; Or,

Manual activation is carried out described step S1; Or,

3. method according to claim 1, is characterized in that, in described step S23, former and later two string formats is changed into regular expression and specifically comprises:

4. method according to claim 1, is characterized in that, whether the regular expression that in described step S24, check obtains can be specially by unique location current data item:

5. method according to claim 1, is characterized in that, between described step S2 and step S3, also comprises:

6. a patent information resolver, is characterized in that, this device comprises:

Information analysis unit, for utilizing the regular expression of described each data item of parsing, resolves patent information the html format webpage of not resolving from described website, and deposits the patent information being resolved in described database;

Described rule schemata unit specifically comprises:

7. device according to claim 6, is characterized in that, this device also comprises:

8. device according to claim 6, it is characterized in that, described format subelement is when changing into regular expression by former and later two string formats, the concrete metacharacter of each character in former and later two character strings after filtering in regular expression, the common character that is close to current data item in the non-html tag filtering retains in regular expression, and the other guide of filtration replaces with regular expression wild symbol in regular expression.

9. device according to claim 6, it is characterized in that, in described check subelement obtains the described regular expression obtaining html format webpage to described webpage acquiring unit or information extraction in other html format webpages, judge whether uniquely to obtain the content of current data item, if so, explanation can unique location current data item.

10. device according to claim 6, it is characterized in that, this device also comprises: regular authentication unit, for obtaining resolving after the regular expression of each data item in described rule schemata unit, from described database, alternative is selected a patent information that has completed parsing, and from described website, obtain and select the html format webpage of selecting patent information else, the html format webpage that utilizes regular expression that described rule schemata unit obtains resolving each data item to select patent information from described alternative extracts the patent information of each data item, and whether the patent information that judgement is extracted is consistent with the patent information of storing in described database, if consistent, determine and be verified, trigger described information analysis unit, otherwise indication is revised the regular expression of inconsistent data item.