CN101770505B - Information capturing method and capturer reestablishing method and system - Google Patents
Information capturing method and capturer reestablishing method and system Download PDFInfo
- Publication number
- CN101770505B CN101770505B CN 200910259007 CN200910259007A CN101770505B CN 101770505 B CN101770505 B CN 101770505B CN 200910259007 CN200910259007 CN 200910259007 CN 200910259007 A CN200910259007 A CN 200910259007A CN 101770505 B CN101770505 B CN 101770505B
- Authority
- CN
- China
- Prior art keywords
- information extraction
- information
- extraction device
- unusual
- confidence values
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an information capturing method and a method and a capturer reestablishing system. The information capturing method is applied to capturing a piece of dynamic information. The information capturing method comprises the following steps: capturing corresponding reference values from a plurality of information sources for providing the dynamic information through a plurality of information capturers; and judging the most credible value corresponding to the dynamic information according to captured results. The information capturing method also comprises a step of verifying whether each information capturer cannot normally capture information and removing abnormal information capturers. The information capturing method also comprises a step of reestablishing new information capturers so as to substitute for the removed abnormal information capturers. Therefore, the reliable dynamic information can be effectively captured and the operation of the information capturers can be effectively maintained.
Description
Technical field
The present invention relates to a kind of information extraction method and system and computer program that authentic communication is provided and has self-Reconstruction of The Function.
Background technology
Along with the development of the Internet, so that all kinds of multidate information (for example, Weather information, stock market information etc.) can be downloaded easily via the Internet.The information extraction device is exactly a kind of technology that can be used for acquisition particular data from information source (for example, webpage).
Although the technology of information extraction device can allow the user easily capture required multidate information from information source.Yet, in case the form of information source is had some change when (for example, webpage correcting), the information extraction device usually must cooperate new form upgrade its capture regular, otherwise this information extraction device can't be again from the Data Source of correspondence acquisition data correctly.
Because the form of information source may be often and the renewal of not timing ground, so be difficult and bothersome with manpower maintenance information acquisition device.In addition, if when needing the multiple multidate information of acquisition, safeguard that with manual type the information extraction device (for example, stock market's short-swing buy information extraction device, Taibei city temperature information extraction device etc.) of each kind is unrealistic especially.Moreover, may can't guarantee its confidence level because of unexpected factor (the information source place that for example, links is this multidate information of real-time update not) from customizing messages source capturing multidate information.Therefore, provide a cover can provide reliable multidate information and can self-regeneration or the mechanism of rebuilding the abnormal information acquisition device be the target that those skilled in the art endeavour.
Summary of the invention
Exemplary embodiment of the present invention provides a kind of information extraction method, and it can provide reliable multidate information, detects unusual information extraction device and rebuild unusual information extraction device from a plurality of information sources.
Exemplary embodiment of the present invention provides a kind of information extraction system, and it can provide reliable multidate information, detects unusual information extraction device and rebuild unusual information extraction device from a plurality of information sources.
Exemplary embodiment of the present invention provides a kind of computer program, and it has an information extraction program, and it can provide reliable multidate information, detects unusual information extraction device and rebuild unusual information extraction device from a plurality of information sources.
Exemplary embodiment of the present invention provides a kind of information extraction to think highly of construction method, and it can detect unusual information extraction device and rebuild unusual information extraction device.
Exemplary embodiment of the present invention provides a kind of information extraction to think highly of the system of building, and it can detect unusual information extraction device and rebuild unusual information extraction device.
Exemplary embodiment of the present invention provides a kind of computer program, and it has an information extraction program, and it can detect unusual information extraction device and rebuild unusual information extraction device.
Exemplary embodiment of the present invention proposes a kind of information extraction method, be applicable to capture a multidate information, the method comprises to be set up a plurality of information sources that a plurality of information extraction devices link provides this multidate information, sets the weighted value of each information extraction device, and is recorded in each information extraction device in the very first time point and capture about the first reference value of this multidate information from each information source of correspondence and decide the first confidence values of this multidate information of correspondence according to described the first reference value.The method also is included in the second time point and uses each information extraction device to capture the second reference value about this multidate information from each information source of correspondence, and judges the second confidence values of corresponding this multidate information when the second time point according to the weighted value of information extraction device and the second reference value that captures.The method comprises that also whether each information extraction device of checking is unusual, wherein then removes unusual information extraction device when being unusual when the authorization information acquisition device.
Exemplary embodiment of the present invention proposes a kind of information extraction system, is applicable to capture a multidate information.This information extraction system comprises that the information extraction device sets up unit, storage element, information extraction and integral unit and information extraction device authentication unit.It is to set up that a plurality of information extraction devices link a plurality of information sources that this multidate information is provided and in order to set the weighted value of each information extraction device that the information extraction device is set up the unit.Storage element in order to be stored in the very first time point each information extraction device from each information source of correspondence acquisition about the first reference value of this multidate information and the first confidence values of corresponding this multidate information.Information extraction and integral unit are in order to the second confidence values of corresponding this multidate information when the second time point captures about the second reference value of this multidate information from each corresponding information source with each information extraction device and judges at the second time point according to the weighted value of information extraction device with the second reference value that captures.Information extraction device authentication unit with since each information extraction device of checking whether be unusual, wherein then remove unusual information extraction device when being unusual when the authorization information acquisition device.
Exemplary embodiment of the present invention proposes a kind of computer program with information extraction program, can finish above-mentioned information extraction method after a computing machine loads this information extraction program and carries out.
Exemplary embodiment of the present invention proposes a kind of information extraction and thinks highly of construction method, be applicable to an information extraction system, wherein this information extraction system is from by acquisition one multidate information and each information extraction utensil a plurality of information sources of a plurality of information extraction devices bindings one weighted value being arranged.This information method for reconstructing is included in the second time point and uses each information extraction device to capture the second reference value about this multidate information from each information source of correspondence, and judges the second confidence values of corresponding this multidate information when the second time point according to the weighted value of information extraction device and the second reference value that captures.This information extraction is thought highly of construction method and is also comprised according to above-mentioned the second confidence values for not setting up the alternative information acquisition device with each information source of information extraction device binding.This information extraction is thought highly of construction method and also is included in the 3rd time point and uses each information extraction device and the alternative information acquisition device that rebulids to capture the 3rd reference value about this multidate information from each corresponding information source, and judges the 3rd confidence values of corresponding this multidate information when the 3rd time point according to the weighted value of information extraction device and the 3rd reference value that captures.This information extraction is thought highly of construction method and is comprised also whether the alternative information acquisition device of verifying described information extraction device or rebuilding is unusual, wherein when verifying that described information extraction device or the alternative information acquisition device of rebuilding then remove unusual information extraction device or alternative information acquisition device when being unusual.
Exemplary embodiment of the present invention proposes a kind of information extraction and thinks highly of the system of building, be applicable to an information extraction system, wherein this information extraction system is from by acquisition one multidate information and each information extraction utensil a plurality of information sources of a plurality of information extraction devices bindings one weighted value being arranged.This information extraction is thought highly of the system of building and is comprised that information extraction and integral unit, information extraction are thought highly of and build unit, information extraction device authentication unit and information extraction device authentication unit.Information extraction and integral unit are in order to the second confidence values of corresponding this multidate information when the second time point captures about the second reference value of this multidate information from each corresponding information source with each information extraction device and judges at the second time point according to the weighted value of information extraction device with the second reference value that captures.Information extraction is thought highly of and is built unit in order to according to above-mentioned the second confidence values, for each information source that links with the information extraction device, not set up the alternative information acquisition device, wherein above-mentioned information extraction and integral unit more in order at the 3rd time point, use each information extraction device and the alternative information acquisition device of rebuilding from acquisition each corresponding information source the 3rd confidence values of corresponding this multidate information about the 3rd reference value of this multidate information and while judging at the 3rd time point according to the weighted value of each information extraction device and the alternative information acquisition device rebuild with the 3rd reference value that captures.Whether information extraction device authentication unit is unusual in order to authorization information acquisition device or alternative information acquisition device, wherein then removes unusual information extraction device or alternative information acquisition device when being unusual when authorization information acquisition device or alternative information acquisition device.
Exemplary embodiment of the present invention proposes a kind of computer program that the program of building is thought highly of in information extraction that has, when a computing machine load this information extraction think highly of the program of building and carry out after can finish above-mentioned information extraction and think highly of construction method.
The present invention proposes a kind of information extraction method, is applicable to capture a multidate information.This information extraction method comprises that setting up a plurality of information extraction devices links a plurality of information sources and set a weighted value and a unusual inferior numerical value for each information extraction device, and wherein each information source provides this multidate information.This information extraction method is also included within one second time point and uses each information extraction device to capture the second reference value about this multidate information from each information source of correspondence, and according to one second confidence values of described weighted value and the second reference value judgement that captures corresponding this multidate information when the second time point.This information extraction method also comprises judges respectively whether the second reference value that described information extraction device captures is different from the second confidence values of judging and the unusual inferior numerical value of information extraction device with second reference value that is different from the second confidence values is counted.This information extraction method comprises that also whether the unusual inferior numerical value of judging respectively described information extraction device is less than a unusual frequency threshold value, and the information extraction device that will have non-unusual inferior numerical value less than described unusual frequency threshold value is verified as unusually, removes simultaneously and is verified as unusual information extraction device.
Exemplary embodiment of the present invention proposes a kind of information extraction system, is applicable to capture a multidate information.This information extraction system comprises that the information extraction device sets up unit, storage element, information extraction and integral unit and information extraction device authentication unit.The information extraction device is set up the unit and is linked a plurality of information sources and a weighted value and a unusual inferior numerical value in order to set each information extraction device in order to set up a plurality of information extraction devices, and wherein each information source provides this multidate information.Information extraction and integral unit in order to use at one second time point each information extraction device from each corresponding information source, capture about one second reference value of this multidate information and according to described weighted value with the second reference value judgement that captures one second confidence values of corresponding this multidate information during at the second time point.Information extraction device authentication unit is in order to judge respectively the second reference value that described information extraction device captures and whether be different from the second confidence values of judging and the unusual inferior numerical value of information extraction device with second reference value that is different from the second confidence values is counted.In addition, information extraction device authentication unit judges that respectively whether the unusual inferior numerical value of described information extraction device is less than a unusual frequency threshold value, the information extraction device that will have non-unusual inferior numerical value less than unusual frequency threshold value is verified as unusually, and removes and be verified as unusual information extraction device.
Exemplary embodiment of the present invention proposes a kind of computer program with information extraction program, can finish above-mentioned information extraction method after a computing machine loads this information extraction program and carries out.
Based on above-mentioned, exemplary embodiment of the present invention uses a plurality of information extraction devices to capture multidate information from a plurality of information sources and according to the weighted value of information extraction device and the confidence values of upgrading threshold value and decide corresponding multidate information, and detect unusual information extraction device and to the information source reconstruction information acquisition device of abnormal, the thus value of reliably updating multidate information according to judged result.
For above-mentioned feature and advantage of the present invention can be become apparent, embodiment cited below particularly, and be described with reference to the accompanying drawings as follows.
Description of drawings
Fig. 1 is the synoptic diagram that example the first embodiment illustrates the information extraction running according to the present invention.
Fig. 2 is the summary calcspar that example the first embodiment illustrates the information extraction system according to the present invention.
Fig. 3 is that exemplary embodiment illustrates an example that determines confidence values according to the present invention.
Fig. 4 is that exemplary embodiment illustrates an example of judging information extraction device abnormal according to the present invention.
Fig. 5 A and Fig. 5 B are the process flow diagrams that illustrates the information extraction method according to first embodiment of the invention
Fig. 6 is the summary calcspar that the second exemplary embodiment illustrates the information extraction system according to the present invention.
Fig. 7 is the process flow diagram that the second exemplary embodiment illustrates the information extraction method according to the present invention.
Fig. 8 is the detail flowchart that illustrates step S701 among Fig. 7.
Fig. 9 is the summary calcspar that the 3rd exemplary embodiment illustrates the information extraction system according to the present invention.
Figure 10 is the acquisition example that the 3rd exemplary embodiment illustrates according to the present invention.
Figure 11 is the process flow diagram that illustrates the information extraction method according to third embodiment of the invention.
Figure 12 is the detail flowchart that illustrates step S1101 among Figure 11.
Figure 13 is the summary calcspar that the 4th exemplary embodiment illustrates the information extraction system according to the present invention.
Figure 14 is an example of the judgement information extraction device abnormal that the 4th exemplary embodiment illustrates according to the present invention.
Figure 15 A and 15B are the process flow diagrams that illustrates the information extraction method according to fourth embodiment of the invention.
The reference numeral explanation
1: user's terminal
100,600,900: the information extraction system
102,104,106,108,110,102-1,102-2,102-3: information extraction device
112,114,116,118,120: webpage
150: the Internet
202,202 ': the information extraction device is set up the unit
204: storage element
206: information extraction and integral unit
208,208 ': information extraction device authentication unit
S501, S503, S505, S507, S509, S511, S513, S515, S517, S519, S521, S523, S701, S801, S803, S805, S807, S809, S811, S813, S1101, S1201, S1203, S1205, S1207: information extraction step
602: information extraction device weighted value updating block
902: information extraction is thought highly of and is built the unit
S1501, S1503, S1505, S1507, S1509, S1511, S1513, S1515, S1517, S1519, S1521, S1523, S1525, S1527, S1529, S1531, S1533, S1535, S1537, S1539, S1541, S1543: information extraction step
Embodiment
[the first exemplary embodiment]
Fig. 1 is the synoptic diagram that example the first embodiment illustrates the information extraction running according to the present invention.
Please refer to Fig. 1, the user (for example can operate user's terminal 1 in this exemplary embodiment, personal computer) information extraction system 100 captures the multidate information about Taibei city rainfall probability from a plurality of information sources, wherein information extraction system 100 captures the information of Taibei city rainfall probability by setting up information extraction device 102,104,106,108 and 110 via the Internet 150 from webpage 112,114,116,118 and 120.Specifically, webpage 112,114,116,118 and 120 is that the webpage that provides about the multidate information of Taibei city rainfall probability is all arranged, and information extraction system 100 can set up the information extraction device of a correspondence and the multidate information that each information extraction device can be linked to corresponding webpage acquisition Taibei city rainfall probability for each webpage.For example, information extraction device 102,104,106,108 and 110 captures the information about Taibei city rainfall probability respectively from webpage 112,114,116,118 and 120.
It is worth mentioning that, information extraction device 102,104,106,108 and 110 is that the webpage 112,114,116,118 and 120 that will be connected is with document dbject model (Document Object Model, DOM) tree represents and parsing, and captures thus the information about Taibei city rainfall probability.Yet, it must be appreciated to the invention is not restricted to this.In another embodiment of the present invention, the information extraction device also can use finite state machine (finite state machine) or normal representation formula (regular expression) to resolve the Data Source that links.In addition, must understand, exemplary embodiment of the present invention is that the information that captures about Taibei city rainfall probability describes, yet the invention is not restricted to this, and information extraction system 100 can be applicable to all kinds of multidate informations of acquisition from various information sources.
Fig. 2 is the summary calcspar that example the first embodiment illustrates the information extraction system according to the present invention.
Please refer to Fig. 2, information extraction system 100 comprises that the information extraction device sets up unit 202, storage element 204, information extraction and integral unit 206 and information extraction device authentication unit 208.
It is to set up the information extraction device to link information source that the information extraction device is set up unit 202.That is to say that the information extraction device is set up unit 202 can set up the corresponding informance acquisition device according to the webpage that user's wish links.As shown in Figure 1, the information extraction device is set up unit 202 and is set up respectively corresponding informance acquisition device 102,104,106,108 and 110 for webpage 112,114,116,118 and 120.
Information extraction is solicited message acquisition device acquisition datas from corresponding information source with integral unit 206.For example, information extraction and integral unit 206 meeting indication information acquisition devices 102,104,106,108 and 110 information that from webpage 112,114,116,118 and 120, capture about Taibei city rainfall probability.Particularly, information extraction and integral unit 206 can be judged according to the result that the information extraction device captures the confidence values about the multidate information of wish acquisition.Specifically, as mentioned above because webpage (that is, information source) may be because of the frequency of Data Update, and can't provide in real time correct data.Therefore, the information extraction method in this exemplary embodiment is to use the identical multidate information of acquisition from a plurality of information sources, and judges in the present time point confidence values about multidate information according to the data that a plurality of information sources provide.For example, information extraction and integral unit 206 solicited message acquisition devices 102,104,106,108 and 110 capture the reference value about Taibei city rainfall probability from webpage 112,114,116,118 and 120 when special time, and judge confidence values about Taibei city rainfall probability according to information extraction device 102,104,106,108 and 110 reference values that capture.
Specifically, in exemplary embodiment of the present invention, the information extraction device is set up unit 202 can set a weighted value for each information extraction device, and information extraction and integral unit 206 reference value that can capture when the special time according to the information extraction device and its weighted value calculate one and upgrade weighted value and judge the confidence values of wish captured when this upgraded judgment value and whether surpasses a renewal threshold value and decide at this particular point in time multidate information.
In exemplary embodiment of the present invention, weighted value is the weight of representative information acquisition device when information is integrated, and for example weighted value is the value that is designed between 0~1.In this exemplary embodiment, the weighted value of each information extraction device is to be set up on their own when setting up the information extraction device by the user, wherein the weighted value of each information extraction device can be identical can also be different.Be in order to determining to upgrade the time point of confidence values and upgrade threshold value, and this upgrade threshold value and can set suitable numerical value according to the design of weighted value voluntarily by the user.At this, confidence values must have been upgraded its multidate information (for example, Taibei city rainfall probability) Shi Caihui in more information sources and has been updated when this upgrades Threshold higher.In this exemplary embodiment, upgrading threshold value is to be set as 1.5.
Fig. 3 is that exemplary embodiment illustrates an example that determines confidence values according to the present invention.For example, suppose information extraction device 102 when time point T0,104,106,108 and 110 from webpage 112,114,116, acquisition all is 10% about the reference value of Taibei city rainfall probability and is under 10% the example about the confidence values of Taibei city rainfall probability when the time point T0 in 118 and 120, if information extraction device 102 when time point T1,104,106,108 and 110 weighted value all is 1 and from webpage 112,114,116, acquisition is respectively 11% about the reference value of Taibei city rainfall probability in 118 and 120,10%, 10%, 10%, in the time of 10%, because the reference value that only has information extraction device 102 to capture is updated to 11%, thus information extraction and integral unit 206 can to calculate Taibei city rainfall probability be that 11% renewal weighted value is 1 and upgrades weighted value this moment less than the renewal threshold value.So the confidence values of Taibei city rainfall probability was 10% when information extraction and integral unit 206 can be judged time point T1.
If information extraction device 102,104,106,108 and 110 is when acquisition is respectively 11%, 11%, 10%, 10%, 10% about the reference value of Taibei city rainfall probability from webpage 112,114,116,118 and 120 again when time point T2, because information extraction device 102 is updated to 11% with the reference value that information extraction device 104 captures, thus information extraction and integral unit 206 can to calculate Taibei city rainfall probability be that 11% renewal weighted value is 2 and upgrades weighted value this moment greater than the renewal threshold value.So the confidence values of Taibei city rainfall probability was 11% when information extraction and integral unit 206 can be judged time point T2.
Referring again to Fig. 2, information extraction device authentication unit 208 is to come in order to the reference value that captures according to the information extraction device respectively whether the authorization information acquisition device is unusual, and when information extraction device when being unusual, then information extraction device authentication unit 208 can remove unusual information extraction device.
Specifically, as mentioned above, because information source may be carried out the correcting of file (for example, webpage), therefore the previous information extraction device of setting up may capture specific dynamic information and capture unusual state with same acquisition rule again.In this example, information extraction device authentication unit 208 can be judged whether abnormal state of information extraction device according to the acquisition course of information extraction device.For example, when the reference value that the information extraction device captured at present time point or last time was different from the confidence values of present time point or last time point, this information extraction device can be regarded as the abnormal state.
Fig. 4 is that exemplary embodiment illustrates an example of judging information extraction device abnormal according to the present invention.
Please refer to Fig. 4, for example, suppose information extraction device 102 when time point T0,104,106,108 and 110 from webpage 112,114,116, acquisition all is 10% about the reference value of Taibei city rainfall probability and is under 10% the example about the confidence values of Taibei city rainfall probability when the time point T0 in 118 and 120, if information extraction device 102 when time point T1,104,106,108 and 110 weighted value all is 1 and to upgrade Threshold be 1.5 o'clock, when information extraction device 102,104,106,108 and 110 from webpage 112,114,116, acquisition is respectively 90% about the reference value of Taibei city rainfall probability in 118 and 120,10%, 10%, 10%, in the time of 10%, the confidence values of Taibei city rainfall probability was 10% when information extraction and integral unit 206 can be judged time point T1.
If information extraction device 102,104,106,108 and 110 is when acquisition is respectively 90%, 11%, 11%, 10%, 10% about the reference value of Taibei city rainfall probability from webpage 112,114,116,118 and 120 again when time point T2, the confidence values of Taibei city rainfall probability was 11% when information extraction and integral unit 206 can judgement time point T2.At this moment, because the reference value " 90% " that captures at time point T1 of information extraction device 102 is different from the confidence values " 10% " of time point T1 and is different from the confidence values " 11% " of time point T2, moreover be 90% also to be different from the confidence values " 10% " of time point T1 and to be different from the confidence values " 11% " of time point T2 in the reference value that time point T2 captures, so information extraction and integral unit 206 can be judged abnormal state and it is removed of information extraction devices 102.Describe the information extraction step of the exemplary embodiment according to the present invention in detail below with reference to accompanying drawing.
In another embodiment of the present invention, information extraction system 100 also comprise output unit (not illustrating) export judge the confidence values of corresponding Taibei city's rainfall probability.
Fig. 5 A and 5B are the process flow diagrams that illustrates the information extraction method according to first embodiment of the invention, and wherein Fig. 5 A illustrates the step and Fig. 5 B that set up the information extraction device to illustrate the step of carrying out information extraction.
Please refer to Fig. 5 A, corresponding a plurality of information sources (for example in step S501, webpage 112,114,116,118 and 120) (for example sets up a plurality of information extraction devices, information extraction device 102,104,106,108 and 110), wherein said information extraction device can capture the multidate information (for example, Taibei city rainfall probit value) in described information source.Then, in step S503, set the weighted value of each information extraction device.Then, can be recorded in the at present confidence values of reference value that each information extraction device of present time point captures and corresponding this multidate information of record at step S505.Thus, finish the initialization of information extraction device by step S501, S503 and S505.
Then, please refer to Fig. 5 B, in step S507, can from a plurality of information sources, capture reference value by the information extraction device.
Afterwards, can judge whether that in step S509 the reference value that any information extraction device captures is updated.If in step S509, judge when reference value that all information extraction devices capture all is not updated, then in step S511 meeting before the confidence values of a time point as the confidence values of present time point.If in step S509, judge when the reference value that has the information extraction device to capture is updated, then in step S513, can calculate according to the weighted value of information extraction device the renewal weighted value of the reference value of having upgraded.Specifically, in step S513, only can calculate at this moment between the renewal weighted value of the reference value that is updated in the point, and the reference value that is not updated will not listed the calculating of upgrading weighted value in.
Afterwards, in step S515, can judge whether to exist the renewal weighted value of the reference value of having upgraded to surpass the renewal threshold value.All do not surpass when upgrading threshold value if in step S515, judge the renewal weighted value of the reference value of having upgraded, then can execution in step S511.
Surpass when upgrading threshold value if in step S515, judge the renewal weighted value of the reference value upgraded, then can be to upgrade the highest reference value of having upgraded of weighted value as the confidence values of this multidate information in step S517.Under the situation that the confidence values of corresponding this multidate information has been updated, the information extraction method of exemplary embodiment of the present invention can be verified each information extraction device.
In step S519, can judge whether to still have the information extraction device to verify.If all information extraction devices are all verified, then can finish the flow process shown in Fig. 5 (b).If when still having the information extraction device to verify, then in step S521 can according to the information extraction device at present with the front reference value that once captures and at present and previous confidence values judge whether abnormal (judgment mode as shown in Figure 4) of information extraction device.If when in step S521, judging information extraction device abnormal, then in step S523, can remove the information extraction device, and return step S519.
Be to use a plurality of information extraction devices from a plurality of information sources, to capture the multidate information of wanting to obtain in this exemplary embodiment, can guarantee thus the fiduciary level of the information that captures.In addition, judge the confidence values of corresponding this multidate information by the design of upgrading threshold value, can upgrade rapidly this multidate information under the fiduciary level of information taking into account thus.
[the second exemplary embodiment]
The weighted value of each information extraction device is being fixed in the process of acquisition data through behind the initial setting in the present invention's the first exemplary embodiment, yet the weighted value of each information extraction device can also be the capturing result along with each time point to be dynamically updated, can make thus the renewal of confidence values of multidate information more accurate with fast.
Fig. 6 is the summary calcspar that the second exemplary embodiment illustrates the information extraction system according to the present invention.
Please refer to Fig. 6, compared to information extraction system 100, information extraction system 600 also comprises information extraction device weighted value updating block 602.Except information extraction device weighted value updating block 602, the structure of the other parts of information extraction system 600 and function are to be same as information extraction system 100, in this not repeat specification.
Information extraction device weighted value updating block 602 is the weighted values that dynamically upgrade each information extraction device.For example, the weighted value of each information extraction device is to calculate with real-time degree according to the confidence level of each information extraction device in this exemplary embodiment.
The confidence level of information extraction device is the confidence level of expression user reference value that the information extraction device is captured.In exemplary embodiment of the present invention, confidence level is the value that is designed between 0~1.For example, when the user sets up the information extraction device in corresponding customizing messages source by the information extraction device with setting up unit 202 initialization, the confidence level of information extraction this moment device can be set as 1, when the confidence values that the reference value that the time point that wherein upgrades in each confidence values captures when the information extraction device and information extraction and integral unit 206 are judged was identical, then information extraction device weighted value updating block 602 was understood through types (1) and is come the confidence level of this information extraction device is upgraded:
R
n,t=R
n,t-1×α+1.0×(1-α) (1)
R wherein
N, tRepresent that n information extraction device is at the confidence level of time point t, R
N, t-1Represent that n information extraction device represents confidence level adjustment parameter at confidence level and the α of time point (t-1).In this exemplary embodiment, the height of α value can affect the amplitude that confidence level is adjusted, and the amplitude that wherein confidence level increases when the α value is lower is larger, and the amplitude that confidence level increases when the α value is higher is less.For example, because the numerical value figure place of rainfall probit value is less, therefore the information extraction device captures identical value but the possibility of the specific dynamic information of non-wish acquisition is relatively high, so in order to acquisition during about the multidate information of Taibei city rainfall probability, the α value can be set higher value to avoid confidence level to increase fast at the information extraction device.For example, the α value is to be set as 0.75 in this exemplary embodiment.Yet, in another exemplary embodiment of the present invention, when the information extraction device is when capturing the more stock index of numerical value figure place, the possibility of the specific dynamic information of non-wish acquisition is relatively low because the information extraction device captures identical value, and therefore the α value is to be set as 0.5 in this example.It must be appreciated that above-mentioned setting for the α value only is example, the user can be under spirit of the present invention, setting α value voluntarily.
The renewal of the real-time degree of the information extraction device reference value that to be expression information extraction device obtain from information source is spent in real time.That is to say that then it in time spent higher when its renewal speed of the reference value that the information extraction device obtains was faster.In exemplary embodiment of the present invention, degree is the value that is designed between 0~1 in real time.For example, when the user sets up the information extraction device in corresponding customizing messages source by the information extraction device with setting up unit 202 initialization or rebulids, this moment, the real-time degree of information extraction device can be set as 0.5, the time point information acquisition device weighted value updating block 602 that wherein upgrades in each confidence values can according to the acquisition course of information extraction device come to its in real time degree upgrade.
For example, when the reference value that captures at last time point when the information extraction device is same as the confidence values of present time point, this information extraction device can be judged as renewal speed faster the real-time degree of information extraction device and this information extraction device can upgrade with formula (2); When the reference value that captures at present time point when the information extraction device was same as the confidence values of last time point, the real-time degree that this information extraction device can be judged as the slower information extraction device of renewal speed and this information extraction device can upgrade with formula (3); And the out of Memory acquisition device can be judged as and not belong to renewal speed comparatively fast or slower information extraction device and the real-time degree of this information extraction device can upgrade with formula (4):
T
n,t=T
n,t-1×β+1.0×(1-β) (2)
T
n,t=T
n,t-1×β+0.0×(1-β) (3)
T
n,t=T
n,t-1×β+0.5×(1-β) (4)
T wherein
N, tRepresent that n information extraction device is at the real-time degree of time point t, T
N, t-1Represent that n information extraction device represents that at real-time degree and the β of time point (t-1) degree is adjusted parameter in real time.In this exemplary embodiment, the height of β value can affect the amplitude that real-time degree is adjusted, and wherein the amplitude of in real time degree adjustment is larger when the β value is lower, and the amplitude that real-time degree is adjusted when the β value is higher is less.It is 0~1 arbitrary value that the β value can be set up on their own by the user, and wherein the β value is to be set as 0.67 in exemplary embodiment of the present invention.
For example, please refer to Fig. 3, when time point T2, information extraction device 102 can be judged as renewal speed faster the real-time degree of information extraction device and information extraction device 102 can upgrade with formula (2), information extraction device 106,108 and 110 can be judged as the slower information extraction device of renewal speed and information extraction device 106,108 and 110 real-time degree can upgrade with formula (3), and information extraction device 104 to be non-renewals very fast with upgrade slower out of Memory acquisition device and the real-time degree of information extraction device 104 can upgrade with formula (4).
Based on above-mentioned, calculate the weighted value that can determine the information extraction device behind the confidence level of information extraction device and the real-time degree when information extraction and integral unit 206.In exemplary embodiment of the present invention, the weighted value of each information extraction device can be represented by formula (5):
W
n,t=R
n,t×γ+T
n,t×(1-γ) (5)
W wherein
N, tRepresent that n information extraction device is at the weighted value of time point t.It is 0~1 arbitrary value that the γ value can be set up on their own by the user, and wherein the γ value is to be set as 0.75 in exemplary embodiment of the present invention.
It must be appreciated that the confidence level of each information extraction device of above-mentioned calculating, real-time degree only are an example with the mode of weighted value, and unrestricted the present invention.Those skilled in the art can otherwise give each information extraction device suitable weight not breaking away under the spirit of the present invention.
Fig. 7 is the process flow diagram that the second exemplary embodiment illustrates the information extraction method according to the present invention, wherein the process flow diagram that illustrates of Fig. 7 also is included in when judging information extraction device no exceptions among the step S521 than the process flow diagram of Fig. 5 B, then in step S701, can upgrade the weighted value of this information extraction device, and return step S519.
Fig. 8 is the detail flowchart that illustrates step S701 among Fig. 7.
Please refer to Fig. 8, in step S801, can use above-mentioned formula (1) to upgrade the confidence level of this information extraction device.Yet, can judge in step S803 reference value that the last time point of this information extraction device captures whether be same as the confidence values of present time point.When if the reference value that the last time point of this information extraction device captures is same as the confidence values of present time point, then in step S805, can upgrade with above-mentioned formula (2) the real-time degree of this information extraction device.
If the reference value that the last time point of this information extraction device captures not during the confidence values of identical present time point, then can be judged reference value that this information extraction device captures at present time point and whether be same as the confidence values of last time point in step S807.When if the reference value that this information extraction device captures at present time point is same as the confidence values of last time point, then in step S809, can upgrade with above-mentioned formula (3) the real-time degree of this information extraction device.When if the reference value that this information extraction device captures at present time point is different from the confidence values of last time point, then in step S811, can upgrade with above-mentioned formula (4) the real-time degree of this information extraction device.
At last, in step S813, can calculate and the renewal weighted value that stores this information extraction device with above-mentioned formula (5) with real-time degree according to the new confidence level of this information extraction device.
The base this, information extraction system 600 can decide the confidence values of next time point according to the reference value that dynamically updates weighted value and capture of information extraction device at next time point, can make thus the renewal of confidence values of multidate information more accurate with fast.
[the 3rd exemplary embodiment]
In the second exemplary embodiment, when the information extraction device is judged as when unusual, then this information extraction device can be removed.And in this exemplary embodiment, then be can be to not carrying out the reconstruction of information extraction device with the information source of any information extraction device binding after removing the information extraction device.
Fig. 9 is the summary calcspar that the 3rd exemplary embodiment illustrates the information extraction system according to the present invention.
Please refer to Fig. 9, compared to information extraction system 600, information extraction system 900 comprises also that information extraction is thought highly of and builds unit 902.Think highly of except information extraction and to build the unit 902, the structure of the other parts of information extraction system 900 and function are to be same as information extraction system 600, in this not repeat specification.
Information extraction is thought highly of and is built unit 902 in order to not set up the alternative information acquisition device for the information source that links with the information extraction device after information extraction device authentication unit 208 removes unusual information extraction device.
Specifically, after the information extraction device in corresponding informance source was removed, information extraction was thought highly of and is built the file that unit 902 can the resolving informations source and attempt for this reason that information source rebulids alternative information extraction device.
For example, take Fig. 4 as example, when when time point T2 information extraction device authentication unit 208 authorization information acquisition devices 102 are unusual and remove information extraction device 102, information extraction is thought highly of and is built unit 902 and can come analyzing web page 112 according to the files that information extraction device 102 when time point T2 is downloaded and resolved with dom tree from webpage 112, and the field that has the reference value of the confidence values that is same as time point T2 or time point T1 in the webpage 112 is set up respectively corresponding alternative information acquisition device.
For example, if it is 11% that the value of 3 fields is arranged in the webpage 112 when time point T2, because it may be the reference value of corresponding Taibei city's rainfall probability that these 3 fields are connected to, thus information extraction think highly of build unit 902 for this reason 3 fields rebulid corresponding alternative information acquisition device.Afterwards, the reference value that the reference value that the alternative information acquisition device of corresponding these 3 fields captures can capture with original information extraction device in the information extraction system 600 is together with the confidence values that decides corresponding Taibei city's rainfall probability, and verifies this alternative information acquisition device and its weighted value of renewal according to the first embodiment with the described mode of the second embodiment.The initial trusted degree of alternative information acquisition device can be set as 0 and 0.5 with real-time degree.Particularly, when same information source had a plurality of information extraction devices to link, for the weight of this information source of balance, the weighted value of described information extraction device can the number of the information extraction device of information source comes on average to link so far, shown in (6):
W
n,t=(R
n,t×γ+T
n,t×(1-γ))/N
n,t (6)
Middle N
N, tBe illustrated in the number of information extraction device corresponding to information source that time point t n information extraction device links.
Figure 10 is the acquisition example that the 3rd exemplary embodiment illustrates according to the present invention.
Please refer to Figure 10, acquisition process shown in Figure 10 is the acquisition process of hookup 4, for part no longer be described at this.
As mentioned above, when time point T2, information extraction device 102 is verified as unusual and is removed, therefore be linked in the situation of webpage 112 without any information extraction device, information extraction is thought highly of and is built unit 902 and hunt out 3 possible fields and rebulid alternative information extraction device 102-1,102-2 and 102-3 (shown in time point T2 ') according to the content of webpage 112.
Afterwards, information extraction device 102-1,102-2,102-3,104,106,108 and 110 be when acquisition is respectively 12%, 11%, 11%, 12%, 12%, 11% and 11% about the reference value of Taibei city rainfall probability from webpage 112,114,116,118 and 120 again when time point T3, and the confidence values of Taibei city rainfall probability was 12% when information extraction was understood according to aforesaid way judgement time point T3 with integral unit 206.Afterwards, information extraction device 102-1,102-2,120-3,104,106,108 and 110 be when acquisition is respectively 13%, 11%, 11%, 13%, 13%, 12% and 12% about the reference value of Taibei city rainfall probability from webpage 112,114,116,118 and 120 again when time point T4, and the confidence values of Taibei city rainfall probability was 13% when information extraction was understood according to aforesaid way judgement time point T4 with integral unit 206.At this moment, because the reference value " 11% " that information extraction device 102-2 and information extraction device 102-3 capture at time point T3 is different from the confidence values " 12% " of time point T3 and is different from the confidence values " 13% " of time point T4, moreover the reference value that captures at time point T4 also is different from the confidence values " 12% " of time point T3 for " 11% " and is different from the confidence values " 13% " of time point T4, so information extraction can judge that with integral unit 206 information extraction device 102-2 and 102-3 remove unusually and with it.
Therefore, can be effectively in this exemplary embodiment for the information source that does not link with the information extraction device rebulids alternative information extraction device, realize thus the function of self-regeneration.
Figure 11 is the process flow diagram that illustrates the information extraction method according to third embodiment of the invention.
Please refer to Figure 11, the process flow diagram that Figure 11 illustrates also is included in afterwards reconstruction information acquisition device (step S1101) of step S523 than the process flow diagram of Fig. 7, then returns step S519.
Figure 12 is the detail flowchart that illustrates step S1101 among Figure 11.
Please refer to Figure 12, in step S1201, can judge information source corresponding to information extraction device that remove whether without any information extraction device and its binding.When if information source links therewith without any information extraction device, then can be according to the field of resolving dom tree that this information source obtains and judge whether to have the multidate information of the corresponding wish acquisition of possibility in step S1203.If judge in step S1203 when having this field, then the field of the multidate information of the corresponding wish acquisition of corresponding described possibility rebulids alternative information extraction device in step S1205.Afterwards, in step S1207, can set initial trusted degree and spend in real time for described alternative information extraction device.
[the 4th exemplary embodiment]
Figure 13 is the summary calcspar that the 4th exemplary embodiment illustrates the information extraction system according to the present invention.
Please refer to Figure 13, information extraction system 1300 comprise the information extraction device set up unit 202 ', storage element 204, information extraction and integral unit 206, information extraction device authentication unit 208 ', information extraction device weighted value updating block 602 thinks highly of with information extraction and builds unit 902.
In the information extraction system 900 of the 3rd exemplary embodiment, information extraction device authentication unit 208 be by intersect comparison at present with previous confidence values and information extraction device come authorization information acquisition device whether unusual (judgment mode as shown in Figure 4) with the front reference value that once captures at present.Yet, in the information extraction system 1300 of the present invention's the 4th exemplary embodiment, whether the information extraction device is set up unit 202 ' also and can be set a unusual inferior numerical value and information extraction device authentication unit 208 ' be for each information extraction device and come the authorization information acquisition device unusual according to the present reference value that captures and the unusual inferior numerical value of present confidence values and information extraction device.Except above-mentioned difference, the information extraction device set up unit 202 ' with information extraction device authentication unit 208 ' structure and function be to be same as respectively the information extraction device to set up unit 202 and the structure of information extraction device authentication unit 208 and structure and the function of function.Below will partly be described in detail for the difference of the 4th exemplary embodiment and the 3rd exemplary embodiment.
Except setting weighted value, the information extraction device is set up unit 202 ' also can set a unusual inferior numerical value for each information extraction device of setting up, and information extraction device authentication unit 208 ' meeting verifies by unusual inferior numerical value and the unusual frequency threshold value of each information extraction device of comparison whether each information extraction device is unusual.Unusual frequency threshold value is that expression information extraction system 1300 can the wrong number of times of acquisition occur the tolerant information acquisition device.At this, unusual frequency threshold value can be for any greater than zero value.Below will with an example describe in detail information extraction device authentication unit 208 ' running.
Figure 14 is an example of the judgement information extraction device abnormal that the 4th exemplary embodiment illustrates according to the present invention.In this example, unusual frequency threshold value is set to 2, and draws together the unusual inferior numerical value of numerical value representative each the information extraction device in each time point in the symbol.
Please refer to Figure 14, information extraction device 102,104,106,108 and 110 acquisition from webpage 112,114,116,118 and 120 all is 10% about the reference value of Taibei city rainfall probability when time point T0; Information extraction and integral unit 206 judges that when time point T0 the confidence values about Taibei city rainfall probability is 10%; And information extraction device 102,104,106,108 and 110 unusual inferior numerical value are all 0.
When time point T1 information extraction device 102,104,106,108 and 110 from webpage 112,114,116,118 and 120, capture reference value about Taibei city rainfall probability be respectively 90%, 10%, 10%, 10%, 10% and information extraction and integral unit 206 when judging time point T1 the confidence values of Taibei city rainfall probability be 10%.In time point T1, information extraction device authentication unit 208 ' can judge that confidence values does not change (namely, the confidence values of time point T1 is same as the confidence values of time point T0), and the replacement of the unusual inferior numerical value of information extraction device is carried out in information extraction device authentication unit 208 ' meeting.Specifically, judging the present reference value that captures of information extraction device when information extraction device authentication unit 208 ' meeting is when being same as at present confidence values, the information extraction device authentication unit 208 ' meeting unusual inferior numerical value (for example, unusual inferior numerical value being made zero) of (reset) this information extraction device of resetting then.In time point T1, the reference value that information extraction device 102 captures is to be different from confidence values, so the unusual inferior numerical value that information extraction device authentication unit 208 ' not can reset information acquisition device 102; And information extraction device 104,106,108 and 110 reference values that capture are to be same as confidence values, so the unusual inferior numerical value of information extraction device authentication unit 208 ' meeting reset information acquisition device 104,106,108 and 110.
When time point T2 information extraction device 102,104,106,108 and 110 from webpage 112,114,116,118 and 120, capture reference value about Taibei city rainfall probability be respectively 90%, 11%, 11%, 10%, 10% and information extraction and integral unit 206 when judging time point T2 the confidence values of Taibei city rainfall probability be 11%.In time point T2, information extraction device authentication unit 208 ' can judge that confidence values changes (namely, the confidence values of time point T2 is not same as the confidence values of time point T1), and replacement and the counting of the unusual inferior numerical value of information extraction device are carried out in information extraction device authentication unit 208 ' meeting.Specifically, whether the information extraction device authentication unit 208 ' meeting one by one present reference value that captures of comparison information acquisition device is same as at present confidence values, wherein when the present reference value that captures of information extraction device is not same as at present confidence values, information extraction device authentication unit 208 ' meeting is to the unusual inferior numerical value counting (for example, the numerical value with unusual inferior numerical value adds 1) of information extraction device; And when the present reference value that captures of information extraction device is when being same as at present confidence values, information extraction device authentication unit 208 ' meeting unusual inferior numerical value of (reset) this information extraction device of resetting then.In addition, information extraction device authentication unit 208 ' meeting judges whether the unusual inferior numerical value of the information extraction device after upgrading equals unusual frequency threshold value, wherein when the unusual inferior numerical value of information extraction device non-during less than unusual frequency threshold value then information extraction device authentication unit 208 ' meeting verify that this information extraction device is unusual.
In time point T2, information extraction device 102,108 and 110 reference values that capture are to be different from confidence values, so information extraction device authentication unit 208 ' meeting changes to " 1 " with information extraction device 102,108 and 110 unusual inferior numerical value by " 0 "; And the reference value that information extraction device 104 and 106 captures is to be same as confidence values, so information extraction device authentication unit 208 ' meeting is maintained " 0 " with the unusual inferior numerical value of information extraction device 104 and 106.In addition, because the unusual inferior numerical value of all information extraction devices all less than unusual frequency threshold value, therefore can be verified as without the information extraction device unusually.
When time point T3 information extraction device 102,104,106,108 and 110 from webpage 112,114,116,118 and 120, capture reference value about Taibei city rainfall probability be respectively 91%, 12%, 11%, 11%, 11% and information extraction and integral unit 206 when judging time point T3 the confidence values of Taibei city rainfall probability be 11%.In time point T3, information extraction device authentication unit 208 ' can judge that confidence values does not change (namely, the confidence values of time point T3 is same as the confidence values of time point T2), and the replacement of the unusual inferior numerical value of information extraction device is carried out in information extraction device authentication unit 208 ' meeting.
In time point T3, the reference value that information extraction device 102 and 104 captures is to be different from confidence values, so information extraction device authentication unit 208 ' not can reset information acquisition device 102 and 104 unusual inferior numerical value; And information extraction device 106,108 and 110 reference values that capture are to be same as confidence values, so the unusual inferior numerical value of information extraction device authentication unit 208 ' meeting reset information acquisition device 106,108 and 110.
When time point T4 information extraction device 102,104,106,108 and 110 from webpage 112,114,116,118 and 120, capture reference value about Taibei city rainfall probability be respectively 91%, 12%, 12%, 11%, 11% and information extraction and integral unit 206 when judging time point T4 the confidence values of Taibei city rainfall probability be 12%.In time point T4, information extraction device authentication unit 208 ' meeting judges that the confidence values of time point T4 changes (namely, the confidence values of time point T4 is not same as the confidence values of time point T3), and carry out the replacement of unusual number of times of information extraction device and the checking of counting and abnormal information acquisition device.
In time point T4, the reference value that information extraction device 104 and 106 captures is to be same as confidence values, so the unusual inferior numerical value of information extraction device authentication unit 208 ' meeting reset information acquisition device 104 and 106.In addition, information extraction device 102,108 and 110 reference values that capture are to be different from confidence values, therefore information extraction device authentication unit 208 ' meeting counts to information extraction device 102,108 and 110 unusual inferior numerical value that (that is, the unusual inferior numerical value of information extraction device 102 can change to " 2 " by " 1 "; The unusual inferior numerical value of information extraction device 108 can change to " 1 " by " 0 "; And the unusual inferior numerical value of information extraction device 110 can change to " 1 " by " 0 ").In addition, because the unusual inferior numerical value right and wrong of information extraction device 102 are less than unusual frequency threshold value, so information extraction device authentication unit 208 ' meeting judgement information extraction device 102 is unusual.
Shown in the example of Figure 14, in this exemplary embodiment, information extraction device authentication unit 208 ' be to come by the unusual inferior numerical value of safeguarding (that is, resetting and counting) information extraction device whether the authorization information acquisition device is unusual.
Figure 15 A and 15B are the process flow diagrams that illustrates the information extraction method according to fourth embodiment of the invention, and wherein Figure 15 A illustrates the step and Figure 15 B that set up the information extraction device to illustrate the step of carrying out information extraction.
Please refer to Figure 15 A, corresponding a plurality of information sources (for example in step S1501, webpage 112,114,116,118 and 120) (for example sets up a plurality of information extraction devices, information extraction device 102,104,106,108 and 110), wherein said information extraction device can capture the multidate information (for example, Taibei city rainfall probit value) in described information source.Then, in step S1503, set weighted value and the unusual inferior numerical value of each information extraction device.Then, can be recorded in the at present confidence values of reference value that each information extraction device of present time point captures and corresponding this multidate information of record at step S1505.Thus, finish the initialization of information extraction device by step S1501, S1503 and S1505.
Then, please refer to Figure 15 B, in step S507, can from a plurality of information sources, capture reference value by the information extraction device.
Afterwards, can judge whether that in step S1509 the reference value that any information extraction device captures is updated.If in step S1509, judge when reference value that all information extraction devices capture all is not updated, then in step S1511 meeting before the confidence values of a time point as the confidence values of present time point.If in step S1509, judge when the reference value that has the information extraction device to capture is updated, then in step S1513, can calculate according to the weighted value of information extraction device the renewal weighted value of the reference value of having upgraded.Specifically, in step S1513, only can calculate at this moment between the renewal weighted value of the reference value that is updated in the point, and the reference value that is not updated will not listed the calculating of upgrading weighted value in.
Afterwards, in step S1515, can judge whether to exist the renewal weighted value of the reference value of having upgraded to surpass the renewal threshold value.All do not surpass when upgrading threshold value if in step S1515, judge the renewal weighted value of the reference value of having upgraded, then can execution in step S1511.
Surpass when upgrading threshold value if in step S1515, judge the renewal weighted value of the reference value upgraded, then can be to upgrade the highest reference value of having upgraded of weighted value as the confidence values of this multidate information in step S1517.After producing the confidence values of present time, the information extraction method of exemplary embodiment of the present invention can be reset or counts the unusual inferior numerical value of each information extraction device, and according to the unusual inferior numerical value after upgrading the information extraction device is verified unusually.
In step S1519, can judge whether to still have the information extraction device to verify.If all information extraction devices are all verified, then can finish the flow process shown in Figure 15 B.If when still having the information extraction device to verify, can judge in step S1521 then whether confidence values changes.
If in step S1521, judge when confidence values does not change, can judge in step step S1523 then whether the present reference value that captures of information extraction device to be verified is same as at present confidence values.If in step S1523, judge when the present reference value that captures of information extraction device to be verified is same as at present confidence values, then in step S1525, understand the unusual inferior numerical value (that is, the unusual inferior numerical value with information extraction device to be verified makes zero) of the information extraction device to be verified of resetting.
If in step S1521, judge when confidence values has changed, can judge in step S1527 then whether the present reference value that captures of information extraction device to be verified is same as at present confidence values.If in step S1527, judge when the present reference value that captures of information extraction device to be verified is same as at present confidence values the unusual inferior numerical value of the information extraction device to be verified of then in step S1529, can resetting.
Then, the weighted value of meeting lastest imformation acquisition device in step S1531.Specifically, the confidence level of meeting lastest imformation acquisition device and real-time degree in step S1535, and according to the confidence level of upgrading and real-time new weighted value of spending the computing information acquisition device.At this, the detailed running of step S1535 is to be same as step S701 (as shown in Figure 8), in this not repeat specification.
If in step S1527, judge when the present reference value that captures of information extraction device to be verified is not same as at present confidence values, then in step S1533, can treat the unusual inferior numerical value of authorization information acquisition device and (for example carry out counting, unusual inferior numerical value is added 1), and meeting judge that whether the unusual inferior numerical value counting of information extraction device to be verified is less than unusual frequency threshold value in step S1535.
If judging the unusual inferior numerical value counting of information extraction device to be verified in step S1535 is during less than unusual frequency threshold value, then execution in step S1531.If in step S1535, judge when the unusual inferior numerical value of information extraction device to be verified is counted right and wrong less than unusual frequency threshold value, then in step S1537, can remove the information extraction device.Afterwards, meeting reconstruction information acquisition device in step S1539, and return step S1519.At this, step S1537 is same as step S523 and step S1539 is same as step S1101, is not repeated in this description at this.
It is worth mentioning that, finished by corresponding Fig. 2, Fig. 6, Fig. 9 and information extraction system shown in Figure 13 in the present invention first, second, third and the described information extraction method of the 4th exemplary embodiment.Yet, the invention is not restricted to this, above-mentioned information extraction method can also a software program be encoded and is stored in the Storage Media, wherein can finish above-mentioned information extraction step when the terminal with processor unit is carried out this software program.In addition, must understand, Fig. 5 A, Fig. 5 B, Fig. 7, Fig. 8, Figure 11, Figure 12, Figure 15 A and the described flow process of Figure 15 B only are examples, the invention is not restricted to its described execution sequence and step.
In sum, the present invention uses a plurality of information extraction devices to capture the reference value of corresponding multidate information and according to the weighted value of information extraction device and the confidence values of upgrading threshold value and decide corresponding multidate information, judge reliably thus the value of multidate information from a plurality of information sources.In addition, the present invention dynamically adjusts the weighted value of each information extraction device according to the acquisition course of each information extraction device, can make thus the renewal of confidence values of multidate information more accurate with fast.Moreover, the invention provides an information extraction and think highly of the mechanism of building, can make thus the information extraction system have the function of self-regeneration, to avoid can't continuing acquisition information because of the correcting of information source.
Although the present invention discloses as above with embodiment; so it is not to limit the present invention; those skilled in the art can do some changes and retouching under the premise without departing from the spirit and scope of the present invention, so protection scope of the present invention is to be as the criterion with claim of the present invention.
Claims (33)
1. an information extraction method is applicable to capture a multidate information, comprising:
Set up a plurality of information extraction devices and link a plurality of information sources, wherein each described information source provides described multidate information;
For each information extraction device is set a weighted value;
Be recorded in each described information extraction device in the very first time point and from each described information source of correspondence, capture one first confidence values that one first reference value and described the first reference value of foundation about described multidate information decide corresponding described multidate information;
Use each described information extraction device from each described information source of correspondence, to capture one second reference value about described multidate information at one second time point;
The described weighted value of foundation and described the second reference value are judged one second confidence values of corresponding described multidate information when described the second time point; And
Verify that whether described information extraction device is unusual, wherein when verifying that described information extraction device then removes unusual described information extraction device when being unusual,
Verify that wherein whether described information extraction device is that unusual step comprises according to described the second reference value and described the second confidence values and verifies whether described information extraction device is unusual.
2. information extraction method as claimed in claim 1 also comprises according to a confidence level and of each described information extraction device and spends in real time the weighted value that calculates each described information extraction device.
3. information extraction method as claimed in claim 2 also comprises the confidence level of dynamically upgrading each described information extraction device, in real time degree and weighted value.
4. information extraction method as claimed in claim 1 also is included in and removes unusual described information extraction device afterwards for not setting up at least one alternative information acquisition device with each described information source of described information extraction device binding.
5. information extraction method as claimed in claim 4 also comprises:
Use each described information extraction device and described at least one alternative information acquisition device from each corresponding described information source, to capture one the 3rd reference value about described multidate information at one the 3rd time point;
Judge one the 3rd confidence values of corresponding described multidate information when described the 3rd time point according to described the 3rd reference value; And
Described the second reference value of foundation and described the 3rd reference value and described the second confidence values and described the 3rd confidence values verify whether described information extraction device or described at least one alternative information acquisition device are unusual, wherein when verifying that described information extraction device or described at least one alternative information acquisition device then remove unusual described information extraction device or described at least one alternative information acquisition device when being unusual.
6. information extraction method as claimed in claim 1 is wherein set up step that described information extraction device links described information source and is comprised with document object model tree, finite state machine or regular expression and resolve the described information source that links.
7. information extraction method as claimed in claim 1, verify wherein whether described information extraction device is that unusual step comprises:
Described the first reference value of foundation and described the second reference value and described the first confidence values and described the second confidence values verify whether described information extraction device is unusual.
8. information extraction method as claimed in claim 1, verify wherein whether described information extraction device is that unusual step also comprises:
Judge whether the second reference value that described information extraction device captures is different from described the second confidence values and the unusual inferior numerical value of information extraction device with second reference value that is different from described the second confidence values is counted; And
Whether judge the unusual inferior numerical value of described information extraction device less than a unusual frequency threshold value, and the information extraction device that will have non-unusual inferior numerical value less than described unusual frequency threshold value is verified as unusually.
9. information extraction method as claimed in claim 8 judges wherein whether the second reference value that described information extraction device captures is different from described the second confidence values and step that the unusual inferior numerical value of information extraction device with second reference value that is different from described the second confidence values is counted is to carry out when described the first confidence values is different from described the second confidence values.
10. an information extraction system is applicable to capture a multidate information, and described information extraction system comprises:
One information extraction device is set up the unit, and a plurality of information extraction devices link a plurality of information sources and in order to set a weighted value of each described information extraction device, wherein each described information source provides described multidate information in order to set up;
One storage element, in order to be stored in the very first time point each described information extraction device from each described information source of correspondence acquisition about one first reference value of described multidate information and one first confidence values of corresponding described multidate information;
One information extraction and integral unit, in order to use at one second time point each described information extraction device from each described information source of correspondence, capture about one second reference value of described multidate information and according to described weighted value with described the second reference value judgement one second confidence values of corresponding described multidate information during at described the second time point; And
Whether one information extraction device authentication unit is unusual in order to verify described information extraction device according to described the second reference value and described the second confidence values, wherein when verifying that described information extraction device then removes unusual described information extraction device when being unusual.
11. information extraction as claimed in claim 10 system, wherein the weighted value of each described information extraction device be according to the confidence level and of each described information extraction device in real time degree calculate.
12. information extraction as claimed in claim 11 system also comprises an information extraction device weighted value updating block, in order to the confidence level of dynamically upgrading each described information extraction device, in real time degree and weighted value.
13. information extraction as claimed in claim 10 system, comprising also that an information extraction is thought highly of builds the unit, and wherein said information extraction is thought highly of and built the unit also in order to not set up at least one alternative information acquisition device for each the described information source that links with described information extraction device after described information extraction device authentication unit removes unusual described information extraction device.
14. information extraction as claimed in claim 13 system, wherein said information extraction and integral unit are more in order to use each described information extraction device and described at least one alternative information acquisition device to capture about one the 3rd reference value of described multidate information from each corresponding described information source and one the 3rd confidence values of corresponding described multidate information when judging at described the 3rd time point according to described the 3rd reference value at one the 3rd time point.
15. information extraction as claimed in claim 14 system, wherein said information extraction device authentication unit is more in order to verify with described the 3rd confidence values whether described information extraction device or described at least one alternative information acquisition device are unusually according to described the second reference value and described the 3rd reference value and described the second confidence values, wherein when verifying that described information extraction device or described at least one alternative information acquisition device then remove unusual described information extraction device or described at least one alternative information acquisition device when being unusual.
16. information extraction as claimed in claim 10 system, wherein said information extraction device is to resolve the described information source that links with document object model tree, finite state machine or regular expression.
17. information extraction as claimed in claim 10 system, wherein said information extraction device authentication unit verifies according to described the first reference value and described the second reference value and described the first confidence values and described the second confidence values whether described information extraction device is unusual.
18. information extraction as claimed in claim 10 system, wherein said information extraction device authentication unit judges whether the second reference value that described information extraction device captures is different from described the second confidence values and the unusual inferior numerical value of information extraction device with second reference value that is different from described the second confidence values is counted
Whether wherein said information extraction device authentication unit judges the unusual inferior numerical value of described information extraction device less than a unusual frequency threshold value, and the information extraction device that will have non-unusual inferior numerical value less than described unusual frequency threshold value is verified as unusually.
19. information extraction as claimed in claim 18 system, wherein said information extraction device authentication unit is to judge when described the second confidence values is different from described the first confidence values whether the second reference value that described information extraction device captures is different from described the second confidence values and the unusual inferior numerical value of information extraction device with second reference value that is different from described the second confidence values is counted.
20. construction method is thought highly of in an information extraction, be applicable to an information extraction system, wherein said information extraction system is that acquisition one multidate information and each information extraction utensil have a weighted value from a plurality of information sources that linked by a plurality of information extraction devices, and described information extraction is thought highly of construction method and comprised:
Use each described information extraction device from each described information source of correspondence, to capture one second reference value about described multidate information at one second time point;
The described weighted value of foundation and described the second reference value are judged one second confidence values of corresponding described multidate information when described the second time point;
Described the second confidence values of foundation is not for setting up at least one alternative information acquisition device with each described information source of described information extraction device binding;
Use each described information extraction device and described at least one alternative information acquisition device from each corresponding described information source, to capture one the 3rd reference value about described multidate information at one the 3rd time point;
The described weighted value of foundation and described the 3rd reference value are judged one the 3rd confidence values of corresponding described multidate information when described the 3rd time point; And
Verify whether described information extraction device or described at least one alternative information acquisition device are unusual, wherein when verifying that described information extraction device or described at least one alternative information acquisition device then remove unusual described information extraction device or described at least one alternative information acquisition device when being unusual
Verify wherein whether described information extraction device or described at least one alternative information acquisition device are that unusual step comprises:
Verify according to described the 3rd reference value and described the 3rd confidence values whether described information extraction device or described at least one alternative information acquisition device are unusual.
21. construction method is thought highly of in information extraction as claimed in claim 20, also comprises according to a confidence level and of each described information extraction device spending in real time the weighted value that calculates each described information extraction device.
22. construction method is thought highly of in information extraction as claimed in claim 21, also comprises the confidence level of dynamically upgrading each described information extraction device, in real time degree and weighted value.
23. construction method is thought highly of in information extraction as claimed in claim 20, the step of wherein setting up described at least one alternative information acquisition device comprises with document object model tree, finite state machine or regular expression resolves the described information source that links.
24. construction method is thought highly of in information extraction as claimed in claim 20, verifies wherein whether described information extraction device or described at least one alternative information acquisition device are that unusual step comprises:
Described the second reference value of foundation and described the 3rd reference value and described the second confidence values and described the 3rd confidence values verify whether described information extraction device or described at least one alternative information acquisition device are unusual.
25. construction method is thought highly of in information extraction as claimed in claim 20, verifies wherein whether described information extraction device or described at least one alternative information acquisition device are that unusual step comprises:
Judge whether the 3rd reference value that described information extraction device or described at least one alternative information acquisition device capture is different from described the 3rd confidence values and the unusual inferior numerical value of information extraction device with the 3rd reference value that is different from described the 3rd confidence values is counted; And
Whether judge the unusual inferior numerical value of described information extraction device or described at least one alternative information acquisition device less than a unusual frequency threshold value, and information extraction device or the alternative information acquisition device that will have non-unusual inferior numerical value less than described unusual frequency threshold value are verified as unusually.
26. construction method is thought highly of in information extraction as claimed in claim 25, judges wherein whether the 3rd reference value that described information extraction device or described at least one alternative information acquisition device capture is different from described the 3rd confidence values and step that the unusual inferior numerical value of information extraction device with the 3rd reference value that is different from described the 3rd confidence values is counted is to carry out when described the second confidence values is different from described the 3rd confidence values.
27. the system of building is thought highly of in an information extraction, be applicable to an information extraction system, wherein said information extraction system is that acquisition one multidate information and each information extraction utensil have a weighted value from a plurality of information sources that linked by a plurality of information extraction devices, and described information extraction is thought highly of the system of building and comprised:
One information extraction and integral unit, in order to use at one second time point each described information extraction device from each described information source of correspondence, capture about one second reference value of described multidate information and according to described weighted value with described the second reference value judgement one second confidence values of corresponding described multidate information during at described the second time point;
One information extraction is thought highly of and is built the unit, in order to not setting up at least one alternative information acquisition device according to described the second confidence values for each the described information source that links with described information extraction device, wherein said information extraction and integral unit be one the 3rd confidence values of corresponding described multidate information when using each described information extraction device and described at least one alternative information acquisition device to capture about one the 3rd reference value of described multidate information and the described weighted value of foundation and the judgement of described the 3rd reference value at described the 3rd time point from each corresponding described information source at one the 3rd time point more; And
One information extraction device authentication unit, in order to verify according to described the 3rd reference value and described the 3rd confidence values whether described information extraction device or described at least one alternative information acquisition device are unusual, wherein when verifying that described information extraction device or described at least one alternative information acquisition device then remove unusual described information extraction device or described at least one alternative information acquisition device when being unusual.
28. the system of building is thought highly of in information extraction as claimed in claim 27, wherein the weighted value of each described information extraction device is to spend in real time according to a confidence level and of each described information extraction device to calculate.
29. the system of building is thought highly of in information extraction as claimed in claim 28, also comprises an information extraction device weighted value updating block, in order to the confidence level of dynamically upgrading each described information extraction device, in real time degree and weighted value.
30. the system of building is thought highly of in information extraction as claimed in claim 27, wherein said at least one alternative information acquisition device is resolved the described information source that links with document object model tree, finite state machine or regular expression.
31. the system of building is thought highly of in information extraction as claimed in claim 27, wherein said information extraction device authentication unit described the second reference value of foundation and described the 3rd reference value and described the second confidence values and described the 3rd confidence values verify whether described information extraction device or described at least one alternative information acquisition device are unusual.
32. the system of building is thought highly of in information extraction as claimed in claim 27, wherein said information extraction device authentication unit judges whether the 3rd reference value that described information extraction device or described at least one alternative information acquisition device capture is different from described the 3rd confidence values and information extraction device with the 3rd reference value that is different from described the 3rd confidence values or the unusual inferior numerical value of alternative information acquisition device are counted
Whether wherein said information extraction device authentication unit judges the unusual inferior numerical value of described information extraction device or described at least one alternative information acquisition device less than a unusual frequency threshold value, and information extraction device or the alternative information acquisition device that will have non-unusual inferior numerical value less than described unusual frequency threshold value are verified as unusually.
33. the system of building is thought highly of in information extraction as claimed in claim 32, wherein said information extraction device authentication unit is to judge when described the 3rd confidence values is different from described the second confidence values whether the 3rd reference value that described information extraction device captures is different from described the 3rd confidence values and information extraction device with the 3rd reference value that is different from described the 3rd confidence values or the unusual inferior numerical value of alternative information acquisition device are counted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200910259007 CN101770505B (en) | 2008-12-31 | 2009-12-09 | Information capturing method and capturer reestablishing method and system |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810190377.2 | 2008-12-31 | ||
CN200810190377 | 2008-12-31 | ||
CN 200910259007 CN101770505B (en) | 2008-12-31 | 2009-12-09 | Information capturing method and capturer reestablishing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101770505A CN101770505A (en) | 2010-07-07 |
CN101770505B true CN101770505B (en) | 2013-03-13 |
Family
ID=42503368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200910259007 Expired - Fee Related CN101770505B (en) | 2008-12-31 | 2009-12-09 | Information capturing method and capturer reestablishing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101770505B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107369081B (en) * | 2017-07-19 | 2021-07-27 | 无锡企业征信有限公司 | System and method for determining data validity by using dynamic influence factors of data source |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1808425A (en) * | 2005-01-21 | 2006-07-26 | 林修平 | Real-time data search system applied in communication system |
CN101197790A (en) * | 2007-12-27 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Method and device for acquiring latest dynamic information of users in instant communication |
-
2009
- 2009-12-09 CN CN 200910259007 patent/CN101770505B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1808425A (en) * | 2005-01-21 | 2006-07-26 | 林修平 | Real-time data search system applied in communication system |
CN101197790A (en) * | 2007-12-27 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Method and device for acquiring latest dynamic information of users in instant communication |
Also Published As
Publication number | Publication date |
---|---|
CN101770505A (en) | 2010-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110300984B (en) | Changing smart contracts recorded in a blockchain | |
US11914449B2 (en) | Methods and apparatus for characterizing memory devices | |
CN108174296A (en) | Malicious user recognition methods and device | |
CN107329741A (en) | A kind of software distribution upgrade method and device based on fingerprint recognition | |
US20060184590A1 (en) | Maintaining and managing metering data for a subsidized computer | |
CN103473503A (en) | Dynamic Software Authorization Platform and Method | |
CN110958493A (en) | Bullet screen adjusting method and device, electronic equipment and storage medium | |
CN101635734A (en) | Method and device for storing and managing downloaded data on non-volatile storage medium | |
KR20180072793A (en) | Push Information Approximate Selection Alignment Method, Device and Computer Storage Medium | |
CN110851535B (en) | Data processing method and device based on block chain, storage medium and terminal | |
CN103916702A (en) | Method and terminal for intercepting advertisements | |
CN110401660A (en) | Recognition methods, device, processing equipment and the storage medium of false flow | |
CN112612775A (en) | Data storage method and device, computer equipment and storage medium | |
CN101770505B (en) | Information capturing method and capturer reestablishing method and system | |
CN107707621B (en) | A kind of method and device for realizing intelligent buffer | |
TWI426405B (en) | Information extraction method and extractor rebuilding method, and system and computer program product thereof | |
CN115660073B (en) | Intrusion detection method and system based on harmony whale optimization algorithm | |
CN116610336A (en) | Firmware upgrading method, system, device and readable storage medium | |
CN111382028A (en) | Method and device for processing date switching errors of batch processing system and server | |
CN113343577B (en) | Parameter optimization method, device, equipment and medium based on machine learning | |
CN108628642B (en) | Method and apparatus for switching versions of system and storage medium | |
CN114296775B (en) | Intelligent operation and maintenance method and system based on big data | |
CN113655958A (en) | Application data storage method | |
CN113610535B (en) | Risk monitoring method and device suitable for consumption stage business process | |
CN109101843A (en) | A kind of filing secure storage method of data and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130313 Termination date: 20211209 |