WO2024078070A1 - 数据采集资源量控制方法、装置、设备及存储介质 - Google Patents

数据采集资源量控制方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024078070A1
WO2024078070A1 PCT/CN2023/106837 CN2023106837W WO2024078070A1 WO 2024078070 A1 WO2024078070 A1 WO 2024078070A1 CN 2023106837 W CN2023106837 W CN 2023106837W WO 2024078070 A1 WO2024078070 A1 WO 2024078070A1
Authority
WO
WIPO (PCT)
Prior art keywords
collection
data
period
cycle
historical
Prior art date
Application number
PCT/CN2023/106837
Other languages
English (en)
French (fr)
Inventor
盛国军
陈录城
王勇
鲁效平
王迷珍
Original Assignee
卡奥斯工业智能研究院(青岛)有限公司
卡奥斯物联科技股份有限公司
海尔数字科技(青岛)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 卡奥斯工业智能研究院(青岛)有限公司, 卡奥斯物联科技股份有限公司, 海尔数字科技(青岛)有限公司 filed Critical 卡奥斯工业智能研究院(青岛)有限公司
Publication of WO2024078070A1 publication Critical patent/WO2024078070A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention belongs to the field of Internet information technology, and specifically relates to a data acquisition resource quantity control method, device, equipment and storage medium.
  • the inventors have found that the related technology has at least the following technical problems: since the data of each website may change at any time, using fixed resources to obtain data from a specified website will result in the obtained data not being the latest data and having a problem of poor timeliness.
  • the present application provides a data acquisition resource quantity control method, device, equipment and storage medium to solve the problem of poor timeliness of acquired data.
  • the present invention provides a data acquisition resource quantity control method, comprising:
  • the collection object includes a website
  • the collection data includes the content in the collected website
  • the historical collection cycle is any collection cycle before the current collection cycle
  • determine the collection status of any collection object based on the preset expected collection cycle and at least one historical collection cycle
  • calculate the comprehensive heat of any collection object based on the matching data volume, data views, collection data volume, and expected collection cycle
  • determine the target number of resources for any collection object based on the collection status, historical collection cycle, expected collection cycle, comprehensive heat, and number of allocated resources of any collection object; and obtain data of any collection object by allocating resources of the target number.
  • the collection state of the collection object is determined according to the expected collection cycle and at least one historical collection cycle, and the comprehensive heat of the collection object is calculated by the matching data volume, the amount of views, the amount of collected data, and the expected collection cycle, and the target number of resources is obtained according to the collection state, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of the collection object, and the resource acquisition of the target number of resources is allocated.
  • the data of any collection object Since the collection status and comprehensive heat of the collection object are used to update the number of resources used to obtain the data of the collection object, the timeliness of the obtained data is improved.
  • the acquisition state of any acquisition object is determined according to a preset expected acquisition cycle and at least one historical acquisition cycle, including: subtracting the expected acquisition cycle from the average value of at least one historical acquisition cycle of any acquisition object to obtain a cycle difference; if the ratio of the cycle difference to the expected acquisition cycle is greater than or equal to a first preset value, the acquisition state of any acquisition object is determined to be a broken line state; if the ratio of the cycle difference to the expected acquisition cycle is less than or equal to a second preset value, the acquisition state of any acquisition object is determined to be an idle state; if the ratio of the cycle difference to the expected acquisition cycle is less than the first preset value and greater than the second preset value, the acquisition state of any acquisition object is determined to be a normal state.
  • the cycle difference is obtained by subtracting the expected collection cycle from the average value of the preset historical collection cycles of the collection object, and the cycle difference is compared with the size of the first preset value and the second preset value.
  • the collection state is determined as a broken line state; when it is less than or equal to the second preset value, the collection state is determined as an idle state; when it is greater than the second preset value and less than the first preset value, the collection state is determined as a normal state.
  • the comprehensive heat of any collection object is calculated according to the matched data volume, data browsing volume, collected data volume and expected collection period, including: calculating the historical heat of any collection object according to the matched data volume, data browsing volume and collected data volume; determining a preset number of historical collection periods as a recording period; subtracting the collected data volume at the beginning of the first recording period from the collected data volume at the end of the first recording period to obtain the collected data volume of the first recording period, wherein the first recording period is the Nth recording period before the current time, wherein N is a positive integer; subtracting the collected data volume at the end of the second recording period from the collected data volume at the beginning of the first recording period to obtain the collected data volume of the first recording period, wherein the first recording period is the Nth recording period before the current time, wherein N is a positive integer; subtracting the collected data volume at the end of the second recording period from the collected data volume at the beginning of the first recording period to obtain the collected data volume of the first recording period.
  • the amount of collected data at the beginning of the second recording period is subtracted from the amount of collected data at the beginning of the second recording period to obtain the amount of collected data in the second recording period, where the second recording period is the N+1th recording period before the current time; the amount of collected data in the first recording period is subtracted from the amount of collected data in the second recording period to obtain the amount of newly added data; the amount of newly added data is divided by the expected collection period and the logarithm is taken to obtain the actual heat of any collection object; the historical heat and the actual heat are respectively mapped into the preset interval in a preset manner to obtain the mapped historical heat and the mapped actual heat; the mapped historical heat and the mapped actual heat are weightedly summed to obtain the comprehensive heat of any collection object.
  • the amount of new data is obtained by subtracting the amount of data collected in the first recording period from the amount of data collected in the second recording period.
  • the actual heat is obtained based on the amount of new data and the expected collection period. After mapping the actual heat and the historical heat, the comprehensive heat is determined.
  • the historical heat and actual heat of the collection object can be further considered comprehensively, so that the number of target resources obtained subsequently is more in line with the data heat, thereby increasing the timeliness of the data.
  • the historical popularity of any collected object is calculated based on the amount of matched data, the amount of browsing, and the amount of collected data.
  • the formula used is as follows:
  • hot history represents the historical popularity of any collected object
  • num match represents the amount of matched data in the collected data
  • read num represents the number of data views of the collected data
  • record num represents the amount of collected data
  • A, B, and C all represent constants
  • log represents taking the logarithm.
  • the target number of resources for any collection object is determined according to the collection state, historical collection cycle, expected collection cycle, comprehensive heat and allocated resource number of any collection object, including: dividing the historical collection cycle of each collection object by the expected collection cycle to obtain the time limit excess ratio of each collection object; multiplying the comprehensive heat of each collection object by the time limit excess ratio to obtain the product, and taking the logarithm of the product to obtain the excess heat value of each collection object; determining the resource number difference according to the maximum and minimum values among the comprehensive heat, historical collection cycle, expected collection cycle of any collection object and the excess heat values of all collection objects; if the collection state of any collection object is a broken line state, then adding the allocated resource number of any collection object to the resource number difference to obtain the target number of resources for any collection object; if the collection state of any collection object is an idle state, then subtracting the allocated resource number of any collection object from the resource number difference to obtain the target number of resources for any collection object.
  • the time limit ratio of each collection object is obtained, and the comprehensive heat of each collection object is multiplied by the time limit ratio to obtain the product, and the logarithm of the product is taken to obtain the limit limit heat value of each collection object.
  • the resource number difference is calculated, and the number of allocated resources is added or subtracted from the resource number difference according to the collection status of the collection object to obtain the target number of resources of any collection object, and further increase the number of resources used by the collection objects in the broken line state, give priority to more resource allocation adjustments to the targets with high comprehensive heat and serious timeout, reduce the number of resources used by the collection objects in the idle state, and give priority to less resource allocation adjustments to the targets with low comprehensive heat and no timeout.
  • the resource number difference is determined according to the maximum and minimum values of the comprehensive heat of any collection object, the historical collection cycle, the expected collection cycle, and the over-limit heat values of all collection objects.
  • the formula used is as follows:
  • represents the difference in the number of resources
  • V max represents the maximum value of the over-limit heat values of all collection objects
  • V min represents the minimum value of the over-limit heat values of all collection objects
  • hot combine represents the comprehensive heat of any collection object
  • t real represents the historical collection cycle
  • t expect represents the expected collection cycle
  • D, E, F, and G all represent constants
  • log represents taking the logarithm.
  • after acquiring data of any collection object with resources of the target number of resources it also includes: subtracting a new historical collection period from an expected collection period to obtain a new period difference; if the ratio of the new period difference to the expected collection period is less than a preset ratio, using the target number of resources as a fixed number of resources to acquire data of any collection object with resources of the fixed number of resources; if the ratio of the new period difference to the expected collection period is greater than or equal to the preset ratio, and the amount of new data of any collection object within the preset number of periods is greater than or equal to the preset value, repeating the step of adjusting the target number of resources; if the ratio of the new period difference to the expected collection period is greater than or equal to the preset ratio, and the amount of new data of any collection object within the preset number of periods is less than the preset value, outputting an error report.
  • the cycle difference is obtained, which can reflect the change in the extension or shortening of the cycle.
  • the target number of resources is used as the fixed number of resources, and the fixed number of resources is used to obtain data in the subsequent period. If the ratio is greater than or equal to the preset ratio, and the amount of new data in the preset cycle is greater than or equal to the preset value, the step of adjusting the target number of resources is repeated. If the ratio is greater than or equal to the preset ratio and the amount of newly added data is less than the preset value, an error report is output. Further, when the target number of resources matches the collection object, the target number of resources is used for data collection. When the data of the collection object increases significantly, the target number of resources is adjusted. When the data of the collection object increases less but the new cycle time is longer than the original historical collection cycle, it is determined to be an error and an error report is output to prompt the user to conduct manual investigation.
  • the present application also provides a data collection resource quantity control device, including: a first acquisition module, used to obtain the amount of collected data within a preset time corresponding to any collection object, the amount of data in the collected data that matches the preset hotspot, and the amount of data views obtained by the collection, and read the pre-stored historical collection cycles corresponding to any collection object and the number of allocated resources for the current collection cycle, wherein the collection object includes a website, the collection data includes the content in the collected website, and the historical collection cycle is any collection cycle before the current collection cycle; a first determination module, used to determine the collection status of any collection object according to a preset expected collection cycle and at least one historical collection cycle; a calculation module, used to calculate the comprehensive heat of any collection object according to the matching data volume, data views, collection data volume and expected collection cycle; a second determination module, used to determine the target number of resources for any collection object according to the collection status, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of any collection object; a second acquisition module
  • the present application also provides an electronic device comprising: a processor, and a memory communicatively connected to the processor; the memory stores computer execution instructions; the processor executes the computer execution instructions stored in the memory, so that the processor executes the data acquisition resource quantity control method described in the first aspect.
  • the present application provides a computer-readable storage medium, in which computer execution instructions are stored.
  • the computer execution instructions are executed by a processor, they are used to implement the data acquisition resource quantity control method described in the first aspect.
  • the data acquisition resource quantity control method, device, equipment and storage medium provided in the present application make the number of resources used more in line with the data popularity, dynamically adjust the number of resources used by each acquisition object, give priority to giving more resources to acquisition objects with high comprehensive popularity and serious timeouts, reduce the number of resources used by idle collection objects, and improve the timeliness of the obtained data.
  • FIG1 is a schematic diagram of an application scenario of a data acquisition resource quantity control method provided in an embodiment of the present application
  • FIG2 is a flow chart of a data acquisition resource quantity control method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of a data acquisition resource quantity control device provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • the method of obtaining hot information in the related art is usually to obtain high-hot information data by using fixed resources to obtain data from a specified website or interface.
  • the related art has the following technical problems: since the data in each website or interface may change at any time, using fixed resources to obtain information data may result in the data obtained not being high-hot data, resulting in the problem of poor timeliness of the obtained data.
  • the inventors proposed the following technical concept: determine the collection status of the collection object through the historical collection cycle and expected collection cycle of the collection object, and calculate the comprehensive heat of the collection object; determine the target number of resources for the collection object based on the collection status, historical collection cycle, expected collection cycle, comprehensive heat and allocated resource number, and allocate resources equal to the target number of resources to obtain data of the collection object.
  • This application is applied to the scenario of controlling the amount of data collection resources.
  • the acquisition, storage and application of user personal information involved are in compliance with the provisions of relevant laws and regulations and do not violate public order and good customs.
  • Fig. 1 is a schematic diagram of an application scenario of a data acquisition resource quantity control method provided in an embodiment of the present application. As shown in Fig. 1 , the scenario includes: a first server 101 and a second server 102 .
  • the server 101 and the server 102 can be a single server or a cluster composed of multiple servers.
  • the connection between the server 101 and the server 102 can be a communication connection.
  • the first server 101 is used to obtain the data of the collection object from the second server 102, and to determine the collection status of the collection object through the historical collection cycle and the expected collection cycle of the collection object, and calculate the comprehensive heat of the collection object.
  • the target number of resources for the collection object is determined by the collection status, historical collection cycle, expected collection cycle, comprehensive heat and allocated resource number, and resources equal to the target number of resources are allocated to obtain the data of the collection object.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the data acquisition resource quantity control method.
  • the above architecture may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange the components differently, which can be determined according to the actual application scenario and is not limited here.
  • the components shown in Figure 1 can be implemented in hardware, software, or a combination of software and hardware.
  • FIG2 is a flow chart of a data acquisition resource quantity control method provided in an embodiment of the present application.
  • the execution subject of the embodiment of the present application can be the server 101 in FIG1 , or a computer and/or a mobile phone, etc., and this embodiment does not impose any particular limitation on this.
  • the method includes:
  • S201 Obtain the amount of collected data within a preset time corresponding to any collection object, the amount of data in the collected data that matches the preset hotspot, and the amount of data views obtained by the collection, and read the pre-stored historical collection cycles corresponding to any collection object and the number of allocated resources for the current collection cycle, wherein the collection object includes a website, the collected data includes the content in the collected website, and the historical collection cycle is any collection cycle before the current collection cycle.
  • the collected data obtained from the collected URL will be marked with the collection time or stored in a folder with a corresponding time mark.
  • the collected data and the corresponding time can be used to obtain the data collected within the preset time.
  • the amount of data collected within the preset time is the amount of collected data.
  • the amount of data in the collected data that matches the preset hotspot can be It is to search for data matching the preset hotspot in all collected data in advance, and combine the time corresponding to the collected data to obtain the data matching within the preset time, and take the number of data matching within the preset time as the amount of matched data.
  • the collected data browsing volume can be the number of times all collected data corresponding to the collection target are browsed within the preset time.
  • the number of browsing times can be recorded in real time, and the number of browsing times at the end of the preset time is determined as the first browsing number, and the number of browsing times at the beginning of the preset time is determined as the second browsing number. The number of browsing times within the preset time is obtained by subtracting the second browsing number from the first browsing number.
  • the number of allocated resources for each historical collection cycle and the current collection cycle can be pre-calculated and stored in the storage unit, or recorded in the storage unit at the beginning of each historical collection cycle.
  • the number of allocated resources can be the number of resources used for collection.
  • the acquired data can be stored in a table when or in the collected data, or can be stored in other formats.
  • To obtain the historical collection period corresponding to any collection object it can be to obtain the historical collection time corresponding to the collection object from the storage unit, and the collection time used by the collection object can be completely obtained once as a historical collection period.
  • the collected data can be the content of the collected website, for example: characters, images, videos, audio, etc. in the website.
  • the number of allocated resources in this collection period is the target number of resources calculated last time. After the target number of resources is calculated last time, it can be stored.
  • the preset hotspot can be a keyword logic expression composed of one or more words in place, time, person and event.
  • the collected data matching the preset hotspot can be the collected data that meets this keyword logic expression, or the collected data that can be queried by the keyword logic expression.
  • the amount of data matched in the collected data can be the amount of data that meets this keyword logic expression, or the amount of data that can be queried by the keyword logic expression in the collected data.
  • the collected data can be input into an independent data system for display and receive client browsing.
  • the number of browsing is the data browsing volume.
  • the collected data browsing volume can be the total browsing volume of all collected data corresponding to the collection object.
  • the historical collection period is: if the last collection time is 5 minutes, then the last historical collection period is 5 minutes. If the third collection period before the current collection period is 1 hour, then the third collection period before the current collection period is 1 hour.
  • the preset time is, for example, one day, three days, one week, two weeks, or one month.
  • S202 Determine a collection state of any collection object according to a preset expected collection cycle and at least one historical collection cycle.
  • the expected collection period of each collection object can be different.
  • the state of the collection object is determined to be a broken line state; if the expected collection period is greater than the average value of the historical collection periods, and the difference exceeds the preset value, the state of the collection object is determined to be an idle state.
  • S203 Calculate the comprehensive popularity of any collection object according to the matching data volume, data browsing volume, collection data volume and expected collection cycle.
  • the amount of matched data, the amount of browsing, the amount of collected data and the expected collection period within a preset time may be input into a preset formula to obtain the comprehensive popularity of any collection object.
  • the amount of matched data, browsing volume, and collected data within a preset time can be input into a first preset formula to obtain the historical popularity of the collection object.
  • the amount of newly added data and the expected number of collection cycles can be input into a second preset formula to obtain the actual popularity.
  • the historical popularity and actual popularity can be input into a third preset formula to obtain the comprehensive popularity.
  • the amount of collected data is the amount of data collected within a period of time (a preset time period, at least one collection cycle or at least one recording cycle), and the amount of newly added data is the difference between the amounts of data collected between two periods of time.
  • S204 Determine the target number of resources for any collection object according to the collection status, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of any collection object.
  • the collection status, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of any collection object may be input into a preset target resource number calculation formula to obtain the target resource number.
  • Collection objects whose collection status meets the preset standards may be periodically found and their target resource numbers may be changed.
  • S205 Allocate the target number of resources to obtain data of any collection object.
  • resources having a target number of resources may be called to obtain data of any of the above-mentioned collection objects.
  • the embodiments of the present application obtain the historical collection cycle, the number of allocated resources and the amount of collected data of the collection object, the amount of data in the collected data that matches the preset hotspot and the amount of data views obtained by the collection, and determine the collection state of the collection object according to the expected collection cycle and at least one historical collection cycle, calculate the comprehensive heat of the collection object by the matched data volume, the amount of views, the amount of collected data, and the expected collection cycle, and obtain the target number of resources according to the collection state, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of the collection object, and allocate resources of the target number of resources to obtain the data of any collection object. Since the number of resources used to obtain the data of the collection object is updated by the collection state and comprehensive heat of the collection object, the timeliness of the obtained data is improved.
  • determining the collection state of any collection object according to a preset expected collection period and at least one historical collection period includes:
  • S2021 Subtract the expected collection period from the average value of at least one historical collection period of any collection object to obtain a period difference.
  • the average value of the historical collection cycle is the length of the historical collection cycle. If at least two historical collection cycles are taken, the average value is obtained by averaging, for example, 2, 3, or 5 historical collection cycles. The expected collection cycle is subtracted from the average value to obtain the cycle difference.
  • the number of historical collection cycles used in this step can be preset.
  • the period difference is 30 seconds. If the three historical collection periods are 1 hour, 2 hours, and 1.5 hours, the average value is 1.5 hours, and the expected collection period is 2 hours, then the period difference is -0.5 hours.
  • the ratio of the period difference to the expected acquisition period may be obtained by dividing the period difference by the expected acquisition period.
  • the first preset value may be a decimal, a percentage, or the like.
  • the cycle difference is 30 seconds and the expected collection cycle is 2 minutes, the ratio is 25%. If the first preset value is 20%, the collection state is determined to be a broken line state.
  • the first preset value may also be 0.19%, 24%, etc., and this application does not impose any special limitation on this.
  • the second preset value may be the first preset value multiplied by -1, or may be independent of the first preset value.
  • the ratio is -25%. If the second preset value is -20%, the collection state is determined to be an idle state.
  • the second preset value can also be other values, such as -0.17, -15%, etc., and this application does not impose any special restrictions on this.
  • the method for calculating the ratio is similar to that in S2022 and S2023, and will not be repeated here.
  • the ratio is 2%, the first preset value is 10%, and the second preset value is -15%, then the ratio is less than the first preset value and greater than the second preset value, and the corresponding acquisition state is determined to be a normal state.
  • the ratio is -2%, the first preset value is 5%, and the second preset value is -10%, then the ratio is less than the first preset value and greater than the second preset value, and the corresponding acquisition state is determined to be a normal state.
  • the embodiments of the present application obtain a cycle difference by subtracting the expected collection cycle from the average value of the preset historical collection cycles of the collection object, and compare the cycle difference with the size of the first preset value and the second preset value.
  • the collection state is determined as a broken line state; when it is less than or equal to the second preset value, the collection state is determined as an idle state; when it is greater than the second preset value and less than the first preset value, the collection state is determined as a normal state.
  • the comprehensive popularity of any collection object is calculated according to the amount of matched data, the amount of browsing, the amount of collected data and the expected collection period within a preset time, including:
  • S2031 Calculate the historical popularity of any collection object based on the amount of matched data, data browsing volume, and collected data volume.
  • This step can be to input the amount of data matched within a preset time, the amount of data viewed, and the amount of data collected into a preset formula to obtain the historical popularity of any collection object.
  • hot history represents the historical heat of any collection object
  • num match represents the amount of matched data
  • read num represents the amount of data browsing of the collected data
  • record num represents the amount of collected data
  • A, B, and C all represent constants
  • log represents taking logarithms.
  • S2032 Determine a preset number of historical collection cycles as a recording cycle.
  • the preset number may be 3, 2, 5, etc.
  • S2033 Subtract the amount of collected data at the beginning of the first recording period from the amount of collected data at the end of the first recording period to obtain the amount of collected data for the first recording period, where the first recording period is the Nth recording period before the current time, where N is a positive integer.
  • the start time may be when the collection starts, and the end time may be when the collection is completed.
  • the amount of collected data corresponding to the start time of the recording cycle may be zero or the amount of existing collected data. Data collection is performed during the period, and the amount of collected data at the end will increase relative to the amount of collected data at the beginning. Therefore, the amount of collected data in the first recording period is obtained by subtracting the amount of collected data at the end of the first recording period from the amount of collected data at the beginning of the first recording period.
  • the amount of collected data in the first recording period is 100.
  • the amount of collected data at the end of the first recording period is 30 and the amount of collected data at the beginning of the first recording period is 5, then the amount of collected data in the first recording period is 25.
  • S2034 Subtract the amount of collected data at the beginning of the second recording period from the amount of collected data at the end of the second recording period to obtain the amount of collected data for the second recording period, where the second recording period is the N+1th recording period before the current time.
  • This step is similar to the above step S2033 and will not be repeated here.
  • S2035 Subtract the amount of collected data from the first recording period from the amount of collected data from the second recording period to obtain the amount of newly added data, where the first recording period is the Nth recording period before the current time, and the second recording period is the N+1th recording period before the current time, where N is a positive integer.
  • the amount of newly added data may be the average amount of newly added data in the recording period.
  • the first recording period can be the first recording period before the current time, that is, the recording period closest to the current time, or it can be another recording period.
  • the amount of collected data can be obtained by querying the database. If the Nth recording period is the most recent recording period, then the N+1th recording period is the previous recording period of the Nth recording period.
  • the amount of data in a recording period is the sum of the amount of data in the historical collection periods, and has nothing to do with the amount of data in the current collection period.
  • S2036 Divide the amount of newly added data by the expected collection period and take the logarithm to obtain the actual heat of any collection object.
  • the average amount of new data may be divided by the expected collection period to obtain the data growth rate, and the growth rate may be taken logarithmically to obtain the actual heat.
  • the average amount of new data may be the average amount of new data in one recording period or several recording periods.
  • hot real represents the actual heat
  • log represents the logarithm
  • R avg represents the average amount of new data
  • t expect represents the expected collection period.
  • This formula may be the second preset formula mentioned above.
  • S2037 Map the historical heat and the actual heat into the preset intervals in a preset manner to obtain the mapped historical heat and the mapped actual heat.
  • the historical heat can be input into a preset mapping function to obtain the mapped historical heat
  • the actual heat can be input into a preset mapping function to obtain the mapped actual heat.
  • the mapping function can also be input with the minimum and maximum heat values corresponding to all acquisition targets.
  • the heat can be the historical heat or the actual heat.
  • the mapping function principle is based on the range-limiting function scale(hot, minTarget, maxTarget), which limits hot between minTarget and maxTarget, where hot represents the historical heat or the actual heat, minTarget represents the minimum value of the mapping range, and maxTarget represents the maximum value of the mapping range.
  • mapping function is as follows:
  • hot′ represents the mapping of historical heat or the mapping of actual heat
  • hot represents the historical heat or the actual heat
  • hot max represents the maximum value among all historical heat or the actual heat
  • hot min represents the minimum value among all historical heat or the actual heat
  • H and I represent constants.
  • the maximum value or minimum value among the historical heat or the actual heat should correspond to the input historical heat or the actual heat.
  • This formula can be the third preset formula mentioned above.
  • H may represent the minimum value of the mapping range
  • I may represent the maximum value of the mapping range.
  • H is 1 and I is 100.
  • the collection object if it has no historical heat, it is mapped to a fixed range according to a pre-calibrated importance level to obtain a mapped historical heat.
  • the level of the collection object can be divided into 1 to 5, and the 5 levels can be mapped to 20 to 100 to obtain the mapping historical heat.
  • Level 1 can be mapped to 20, level 2 to 40, level 3 to 60, etc., or a preset function relationship can be used to input the level into the function to obtain the mapping historical heat.
  • the mapped historical heat may be multiplied by the first weight coefficient to obtain the weighted historical heat
  • the mapped actual heat may be multiplied by the second weight coefficient to obtain the weighted actual heat
  • the weighted historical heat and the weighted actual heat may be added to obtain the comprehensive heat.
  • the first weight coefficient may be 0.4, 0.35, 0.3, etc.
  • the second weight coefficient may be 0.6, 0.65, 0.7, etc.
  • the sum of the first weight coefficient and the second weight coefficient may be 1.
  • the weighted sum of the mapping history heat and the mapping actual heat is used to obtain the comprehensive heat of any collection object.
  • hot combine represents the comprehensive heat of any collected object
  • hot real represents the mapping of actual heat
  • hot history represents the mapping of historical heat
  • ⁇ and ⁇ represent weight coefficients.
  • the embodiments of the present application obtain the newly added data volume by subtracting the collected data volume of the first recording period from the collected data volume of the second recording period, obtain the actual heat according to the newly added data volume and the expected collection period, and after mapping the actual heat and the historical heat, determine the comprehensive heat.
  • the historical heat and actual heat of the collection object can be comprehensively considered to make the target resource number obtained subsequently more in line with the data heat, thereby increasing the timeliness of the data.
  • the target number of resources for any collection object is determined according to the collection state, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of any collection object, including:
  • S2041 Divide the historical collection period of each collection object by the expected collection period to obtain the time limit excess ratio of each collection object.
  • the historical collection cycle may be an average value of the historical collection cycles in S2021 above, or may be a preset Xth historical collection cycle.
  • S2042 Multiply the comprehensive heat of each collection object by the time limit-exceeding ratio to obtain a product, and take the logarithm of the product to obtain the limit-exceeding heat value of each collection object.
  • V represents the excess heat value
  • hot combine represents the comprehensive heat of any collection object
  • t real represents the historical collection period
  • t expect represents the expected collection period
  • S2043 Determine the resource quantity difference according to the maximum and minimum values of the comprehensive heat of any collection object, the historical collection cycle, the expected collection cycle, and the over-limit heat values of all collection objects.
  • represents the difference in the number of resources
  • V max represents the maximum value of the over-limit heat values of all collection objects
  • V min represents the minimum value of the over-limit heat values of all collection objects
  • hot combine represents the comprehensive heat of any collection object
  • t real represents the historical collection cycle
  • t expect represents the expected collection cycle
  • D, E, F, and G all represent constants
  • log represents taking the logarithm.
  • D and E can be estimated and adjusted according to system resources and the magnitude of the objects to be captured.
  • D is 1
  • E is 10
  • F and G are 1
  • F and G can also take values that are smaller than t real or t expect , such as one percent of the smaller value of the two, or one tenth of the smaller value of the two.
  • the calculated resource number difference may be rounded.
  • the target number of resources is 9.
  • the number of allocated resources is 9, and the difference in the number of resources is 3, then the target number of resources is 12.
  • the acquisition state is a broken line state
  • the number of allocated resources is 5, and the difference in the number of resources is 1, then the target number of resources is 6.
  • the target number of resources is 5.
  • the acquisition state is the broken line state
  • the number of allocated resources is 9, and the difference in the number of resources is 3, then the target number of resources is 6.
  • the acquisition state is the broken line state
  • the number of allocated resources is 4, and the difference in the number of resources is 1, then the target number of resources is 3.
  • the above steps S2041 to S2045 may be performed periodically.
  • the embodiments of the present application obtain the time limit ratio of each collection object by dividing the historical collection period of each collection object by the expected collection period, multiplying the comprehensive heat of each collection object by the time limit ratio to obtain the product, and taking the logarithm of the product to obtain the limit heat value of each collection object.
  • the resource number difference is calculated, and the allocated resource number is added or subtracted from the resource number difference according to the collection status of the collection object to obtain the target resource number of any collection object, so as to increase the number of resources used by the collection objects in the broken line state, give priority to more resource allocation adjustments to the targets with high comprehensive heat and serious timeouts, and reduce idle time.
  • the number of resources used by the collection objects in the status will give priority to the targets with low comprehensive heat and no timeout, giving fewer resource allocation adjustments.
  • the following further includes:
  • the new historical collection period may be the time taken to obtain data once when using resources with the target number of resources, or may be the average time taken to obtain data multiple times when using resources with the target number of resources.
  • the target resource number is used as a fixed resource number to acquire data of any collection object using resources of the fixed resource number.
  • the preset ratio is, for example, 10%, 5%, 0.02, etc., and this application does not impose any special restrictions on this.
  • the step of adjusting the target number of resources may no longer be performed.
  • the step of adjusting the target number of resources may be the above steps S201 to S205.
  • the repetitive execution of steps S201 to S205 may be stopped.
  • the amount of newly added data of the collection object within the preset period may be the amount of newly added data of any one of the preset periods, or may be the average amount of newly added data of the preset periods.
  • the error report can be a text report or a preset prompt message.
  • the embodiment of the present application obtains the cycle difference by subtracting the new historical collection cycle from the expected collection cycle, which can reflect the change in the extension or shortening of the cycle.
  • the ratio of the cycle difference to the expected cycle is less than the preset ratio
  • the target number of resources is used as the fixed number of resources, and the fixed number of resources is used to obtain data in the subsequent period. If the ratio is greater than or equal to the preset ratio, and the amount of new data in the preset cycle is greater than or equal to the preset value, the step of adjusting the target number of resources is repeated. If the ratio is greater than or equal to the preset ratio, and the amount of new data is less than the preset value, an error report is output.
  • the target number of resources matches the collection object, the target number of resources is used for data collection.
  • the target number of resources is adjusted.
  • the data of the collection object increases less, but the new cycle time is longer than the original historical collection cycle, it is determined to be an error, and an error report is output to prompt the user to perform manual investigation.
  • the resource of the present application may be a thread, or bandwidth, memory, processor occupancy, etc.
  • the collection object, comprehensive heat, number of allocated resources, expected collection cycle, average amount of new data, historical collection cycle and/or task status, etc. in the present application may be stored in a table form, called a baseline table, and the target number of resources may be adjusted by periodically scanning the baseline table, such as Table 1.
  • FIG3 is a schematic diagram of the structure of a data acquisition resource quantity control device provided in an embodiment of the present application.
  • a data acquisition resource quantity control device 300 includes: a first acquisition module 301 , a first determination module 302 , a calculation module 303 , a second determination module 304 and a second acquisition module 305 .
  • the first acquisition module 301 is used to obtain the amount of collected data within a preset time corresponding to any collection object, the amount of data in the collected data that matches the preset hotspot, and the amount of data views obtained by the collection, and read the pre-stored historical collection cycles corresponding to any collection object and the number of allocated resources in the current collection cycle, wherein the collection object includes a website, the collection data includes the content in the collected website, and the historical collection cycle is any collection cycle before the current collection cycle.
  • the first determining module 302 is used to determine the collection state of any collection object according to a preset expected collection period and at least one historical collection period.
  • the calculation module 303 is used to calculate the comprehensive popularity of any collection object according to the matching data volume, data browsing volume, collection data volume and expected collection period.
  • the second determination module 304 is used to determine the target number of resources for any collection object according to the collection state, historical collection cycle, expected collection cycle, comprehensive heat and number of allocated resources of any collection object.
  • the second acquisition module 305 is used to allocate resources of a target number of resources to acquire data of any collection object.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the first determination module 302 is specifically used to subtract the expected acquisition period from the average value of at least one historical acquisition period of any acquisition object to obtain a period difference. If the ratio of the period difference to the expected acquisition period is greater than or equal to a first preset value, the acquisition state of any acquisition object is determined to be a broken line state. If the ratio of the period difference to the expected acquisition period is less than or equal to a second preset value, the acquisition state of any acquisition object is determined to be an idle state. If the ratio of the period difference to the expected acquisition period is less than the first preset value and greater than the second preset value, the acquisition state of any acquisition object is determined to be a normal state.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the calculation module 303 is specifically used to calculate the historical popularity of any collection object based on the amount of matched data, the amount of data browsing, and the amount of collected data.
  • a preset number of historical collection cycles is determined as a recording cycle.
  • the amount of collected data at the end of the first recording cycle is subtracted from the amount of collected data at the beginning of the first recording cycle to obtain the amount of collected data for the first recording cycle, where the first recording cycle is the Nth recording cycle before the current time, where N is a positive integer.
  • the amount of collected data at the end of the second recording cycle is subtracted from the amount of collected data at the beginning of the second recording cycle.
  • the amount of data collected at the beginning is used to obtain the amount of data collected in the second recording period, where the second recording period is the N+1th recording period before the current time; the amount of data collected in the first recording period is subtracted from the amount of data collected in the second recording period to obtain the amount of new data. Divide the amount of new data by the expected collection period and take the logarithm to obtain the actual heat of any collection object. Map the historical heat and the actual heat into the preset intervals in a preset manner to obtain the mapped historical heat and the mapped actual heat. Take the weighted sum of the mapped historical heat and the mapped actual heat to obtain the comprehensive heat of any collection object.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the calculation module 303 calculates the historical popularity of any collection object according to the amount of matched data, the amount of browsing, and the amount of collected data, using the following formula:
  • hot history represents the historical popularity of any collection object
  • num match represents the amount of matched data
  • read num represents the amount of data browsing
  • record num represents the amount of collected data
  • A, B, and C all represent constants
  • log represents taking the logarithm.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the second determination module 304 is specifically used to divide the historical collection period of each collection object by the expected collection period to obtain the time limit ratio of each collection object.
  • the comprehensive heat of each collection object is multiplied by the time limit ratio to obtain the product, and the logarithm of the product is taken to obtain the limit heat value of each collection object.
  • the resource number difference is determined according to the maximum and minimum values of the comprehensive heat, historical collection period, expected collection period and limit heat values of all collection objects of any collection object. If the collection state of any collection object is a broken line state, the number of allocated resources of any collection object is added to the resource number difference to obtain the target number of resources of any collection object. If the collection state of any collection object is an idle state, the number of allocated resources of any collection object is subtracted from the resource number difference to obtain the target number of resources of any collection object.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the second determination module 304 determines the resource number difference according to the maximum and minimum values of the comprehensive heat of any collection object, the historical collection cycle, the expected collection cycle, and the over-limit heat values of all collection objects, using the following formula:
  • represents the difference in the number of resources
  • V max represents the maximum value of the over-limit heat values of all collection objects
  • V min represents the minimum value of the over-limit heat values of all collection objects
  • hot combine represents the comprehensive heat of any collection object
  • t real represents the historical collection cycle
  • t expect represents the expected collection cycle
  • D, E, F, and G all represent constants
  • log represents taking the logarithm.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the data acquisition resource quantity control device 300 further includes: a difference acquisition module 306 , a third determination module 307 , a resource adjustment module 308 and a report output module 309 .
  • the difference acquisition module 306 is used to obtain a new cycle difference by subtracting the new historical acquisition cycle from the expected acquisition cycle.
  • the third determination module 307 is used to use the target resource number as a fixed resource number to acquire data of any acquisition object using resources of the fixed resource number if the ratio of the new cycle difference to the expected acquisition cycle is less than a preset ratio.
  • the resource adjustment module 308 is used to repeat the step of adjusting the target resource number if the ratio of the new cycle difference to the expected collection cycle is greater than or equal to the preset ratio, and the amount of new data of any collection object within the preset cycle is greater than or equal to the preset value.
  • the report output module 309 is used to output an error report if the ratio of the new cycle difference to the expected collection cycle is greater than or equal to a preset ratio, and the amount of new data of any collection object within the preset cycle is less than a preset value.
  • the device provided in this embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • the embodiment of the present application also provides an electronic device.
  • FIG4 it shows a schematic diagram of the structure of an electronic device 400 suitable for implementing an embodiment of the present application
  • the electronic device 400 may be a terminal device or a server.
  • the terminal device may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • PDAs personal digital assistants
  • PADs Portable Android Devices, PADs
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG4 is only an example and should not impose any restrictions on the functions and scope of use of the embodiments of the present application.
  • the electronic device 400 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 401, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 402 or a program loaded from a storage device 408 to a random access memory (RAM) 403.
  • a processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404.
  • An input/output (I/O) interface 405 is also connected to the bus 404.
  • the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 407 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 408 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 409.
  • the communication device 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data.
  • FIG. 4 shows an electronic device 400 having various devices, it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively.
  • an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer-readable storage medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication device 409, Either installed from the storage device 408 or installed from the ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above functions defined in the method of the embodiment of the present application are performed.
  • the computer-readable storage medium mentioned above in the present application may be a computer-readable signal medium or a computer storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code.
  • This propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer readable signal medium may also be any computer readable storage medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
  • the program code contained on the computer readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the computer-readable storage medium may be included in the electronic device, or may exist independently without being installed in the electronic device.
  • the computer-readable storage medium carries one or more programs.
  • the electronic device executes the method shown in the above embodiment.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., via the Internet using an Internet service provider
  • each box in the flowchart or block diagram may represent a module, a program segment, or a portion of a code, which contains one or more executable instructions for implementing a specified logical function.
  • the functions marked in the boxes may also occur in an order different from that marked in the accompanying drawings. For example, two boxes represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram and/or flowchart, as well as the block diagram and The blocks in the flowchart and/or combinations thereof may be implemented by a dedicated hardware-based system that performs the specified functions or operations, or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in this application can be implemented by software or hardware.
  • the name of the unit does not limit the module itself in some cases.
  • the first determination module can also be described as "a module for determining the collection state of any collection object".
  • exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chip
  • CPLDs complex programmable logic devices
  • the present application also provides a computer-readable storage medium, which stores computer execution instructions.
  • the processor executes the computer execution instructions, the technical solution of the data acquisition resource quantity control method in any of the above-mentioned embodiments is implemented.
  • the implementation principle and beneficial effects are similar to the implementation principle and beneficial effects of the data acquisition resource quantity control method. Please refer to the implementation principle and beneficial effects of the data acquisition resource quantity control method, which will not be repeated here.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing.
  • a more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the foregoing.
  • the present application also provides a computer program product, including a computer program.
  • the computer program When the computer program is executed by a processor, it implements the technical solution of the data acquisition resource quantity control method in any of the above-mentioned embodiments. Its implementation principle and beneficial effects are similar to the implementation principle and beneficial effects of the data acquisition resource quantity control method. Please refer to the implementation principle and beneficial effects of the data acquisition resource quantity control method, and no further details will be given here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明实施例提供一种数据采集资源量控制方法、装置、设备及存储介质,属于互联网信息技术领域,该方法包括:获取任一采集对象预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取各历史采集周期及本采集周期的分配资源数;根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态;根据匹配的数据量、数据浏览量、采集数据量、期望采集周期,计算任一采集对象的综合热度;根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及当前分配资源数,确定任一采集对象的目标资源数;分配目标资源数的资源获取任一采集对象的数据。本申请解决了获取的数据时效性差的问题。

Description

数据采集资源量控制方法、装置、设备及存储介质
本申请要求于2022年10月14日提交中国专利局、申请号为202211256657.5、申请名称为“数据采集资源量控制方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于互联网信息技术领域,具体涉及一种数据采集资源量控制方法、装置、设备及存储介质。
背景技术
随着计算机技术的发展,以及经济、文化的信息化的逐渐加深,人们希望更快的获得重要事件的相关资讯。
相关技术中,为了获取新鲜的资讯等数据,通常采用固定的资源获取指定网站中数据的方式获取热度较高的资讯数据。
然而,发明人发现相关技术至少存在如下技术问题:由于各网站的数据会随时变化,所以采用固定的资源获取指定网站中的数据会导致获取的数据不是最新数据,存在时效性差的问题。
发明内容
本申请提供一种数据采集资源量控制方法、装置、设备及存储介质,用以解决获取的数据时效性差的问题。
第一方面,本发明提供一种数据采集资源量控制方法,包括:
获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取任一采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中采集对象包括网址,采集数据包括采集得到的网址中的内容,历史采集周期为本采集周期以前的任一采集周期;根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态;根据匹配的数据量、数据浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度;根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数;分配目标资源数的资源获取任一采集对象的数据。
通过获取采集对象的历史采集周期、分配资源数和采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,根据期望采集周期及至少一个历史采集周期,确定采集对象的采集状态,由匹配的数据量、浏览量、采集数据量、期望采集周期计算采集对象的综合热度,并根据采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,得到目标资源数,并分配目标资源数的资源获取 任一采集对象的数据。由于采用了采集对象的采集状态、综合热度更新了获取采集对象的数据所使用的资源数,所以提高了获得的数据的时效性。
在一种可能的实现方式中,根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态,包括:将任一采集对象的至少一个历史采集周期的平均值减去期望采集周期,得到周期差值;若周期差值与期望采集周期的比值大于或等于第一预设值,则将任一采集对象的采集状态确定为破线状态;若周期差值与期望采集周期的比值小于或等于第二预设值,则将任一采集对象的采集状态确定为空闲状态;若周期差值与期望采集周期的比值小于第一预设值且大于第二预设值,则将任一采集对象的采集状态确定为正常状态。
通过将采集对象的预设个历史采集周期的平均值减去期望采集周期,得到周期差值,并将周期差值与第一预设值和第二预设之的大小作比较,在大于或等于第一预设值的情况下将采集状态确定为破线状态,在小于或等于第二预设值的情况下将采集状态确定为空闲状态,在大于第二预设值且小于第一预设值的情况下将采集状态确定为正常状态,进一步实现了由历史采集周期的平均值及预设采集周期大小,得到采集对象的采集状态的效果,便于后续根据采集状态变更采集使用的资源数。
在一种可能的实现方式中,根据匹配的数据量、数据浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度,包括:根据匹配的数据量、数据浏览量及采集数据量,计算任一采集对象的历史热度;将预设个数的历史采集周期确定为一个记录周期;将第一记录周期结束时的采集数据量减第一记录周期开始时的采集数据量,得到第一记录周期的采集数据量,其中第一记录周期为当前时间之前的第N个记录周期,其中N为正整数;将第二记录周期结束时的采集数据量减第二记录周期开始时的采集数据量,得到第二记录周期的采集数据量,第二记录周期为当前时间之前的第N+1个记录周期;将第一记录周期的采集数据量减第二记录周期的采集数据量,得到新增数据量;将新增数据量除以期望采集周期并取对数,得到任一采集对象的实际热度;将历史热度及实际热度以预设方式分别映射进预设区间内,得到映射历史热度及映射实际热度;将映射历史热度和映射实际热度加权求和,得到任一采集对象的综合热度。
通过将第一记录周期的采集数据量减第二记录周期的采集数据量,得到新增数据量,根据新增数据量及期望采集周期,得到实际热度,并将实际热度及历史热度映射后,确定了综合热度,可以进一步综合考虑采集对象的历史热度和实际热度,使后续得到的目标资源数更加符合数据热度,从而增加数据时效性。
在一种可能的实现方式中,根据匹配的数据量、浏览量及采集数据量,计算任一采集对象的历史热度,采用的公式如下:
式中,hothistory表示任一采集对象的历史热度,nummatch表示采集数据中匹配的数据量,readnum表示采集数据的数据浏览量,recordnum表示采集数据量,A、B、C均表示常数,log表示取对数。
在一种可能的实现方式中,根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数,包括:将各采集对象的历史采集周期除以期望采集周期,得到各采集对象的时间超限比;将各采集对象的综合热度与时间超限比相乘得到乘积,并取乘积的对数,得到各采集对象的超限热度值;根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差;若任一采集对象的采集状态为破线状态,则将任一采集对象的分配资源数与资源数差相加,得到任一采集对象的目标资源数;若任一采集对象的采集状态为空闲状态,则将任一采集对象的分配资源数与资源数差相减,得到任一采集对象的目标资源数。
通过将各采集对象的历史采集周期除以期望采集周期,得到各采集对象的时间超限比,将各采集对象的综合热度与时间超限比相乘得到乘积,并取乘积的对数,得到各采集对象的超限热度值。根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,计算得到资源数差,并根据采集对象的采集状态将分配资源数与资源数差相加或相减,得到任一采集对象的目标资源数,进一步实现增加破线状态的采集对象使用的资源数,将综合热度高且超时严重的目标优先给予更多的资源分配调整,降低空闲状态的采集对象使用的资源数,将综合热度低且不会超时的目标优先给予更少的资源分配调整。
在一种可能的实现方式中,根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差,采用的公式如下:
式中,Δ表示资源数差,Vmax表示所有采集对象的超限热度值中的最大值,Vmin表示所有采集对象的超限热度值中的最小值,hotcombine表示任一采集对象的综合热度,treal表示历史采集周期,texpect表示期望采集周期,D、E、F、G均表示常数,log表示取对数。
在一种可能的实现方式中,在以目标资源数的资源获取任一采集对象的数据之后,还包括:将新的历史采集周期与期望采集周期相减得到新的周期差值;若新的周期差值与期望采集周期的比值小于预设比值,则将目标资源数作为固定资源数,以采用固定资源数的资源获取任一采集对象的数据;若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量大于或等于预设值,则重复执行调整目标资源数的步骤;若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量小于预设值,则输出错误报告。
通过将新的历史采集周期与期望采集周期作差,得到周期差值,可以反应周期延长或缩短的变化量,在周期差值与期望周期的比值小于预设比值时,将目标资源数作为固定资源数,并在后续采用固定资源数的资源获取数据,若比值大于或等于预设比值,且预设个周期内新增数据量大于或等于预设值,则重复执行调整目标资源数的步 骤,若比值大于或等于预设比值,且新增数据量小于预设值,则输出错误报告。进一步实现在目标资源数与采集对象匹配时,使用目标资源数的资源进行数据采集,采集对象的数据增加较多时,调整目标资源数,在采集对象的数据增加较少,但新的周期用时比原有的历史采集周期更长时,确定为出错,输出错误报告以提示用户进行人工排查。
第二方面,本申请还提供了一种数据采集资源量控制装置,包括:第一获取模块,用于获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取任一采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中采集对象包括网址,采集数据包括采集得到的网址中的内容,历史采集周期为本采集周期以前的任一采集周期;第一确定模块,用于根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态;计算模块,用于根据匹配的数据量、数据浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度;第二确定模块,用于根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数;第二获取模块,用于分配目标资源数的资源获取任一采集对象的数据。
第三方面,本申请还提供了一种电子设备,包括:处理器,以及与处理器通信连接的存储器;存储器存储计算机执行指令;处理器执行存储器存储的计算机执行指令,使得处理器执行如第一方面描述的数据采集资源量控制方法。
第四方面,本申请提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现如第一方面描述的数据采集资源量控制方法。
结合上述技术方案,本申请提供的数据采集资源量控制方法、装置、设备及存储介质,使采用的资源数量更加符合数据热度,动态调整各采集对象使用的资源数,将综合热度高且超时严重的采集对象优先给予更多的资源,降低空闲状态的采集对象使用的资源数,提高了获得的数据的时效性。
附图说明
图1为本申请实施例提供的数据采集资源量控制方法的应用场景示意图;
图2为本申请实施例提供的数据采集资源量控制方法的流程示意图;
图3为本申请实施例提供的数据采集资源量控制装置的结构示意图;
图4为本申请实施例提供的电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
随着计算机技术的快速发展,计算机计算能力逐渐增加,经济文化信息已可以通过互联网进行传播,人们希望通过互联网快速的获取热点信息。
当前,相关技术中获取热点信息的方法,通常是采用固定的资源获取指定网站或接口的数据的方式获取热度较高的资讯数据。但是,发明人发现相关技术有以下技术问题:由于各网站或接口中的数据会随时变化,所以采用固定的资源获取资讯数据会导致获取的数据可能不是高热度的数据,造成得到的数据时效性差的问题。
针对上述技术问题,发明人提出如下技术构思:通过采集对象的历史采集周期及期望采集周期,确定采集对象的采集状态,并计算采集对象的综合热度,由采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定采集对象的目标资源数,分配与目标资源数等量的资源获取采集对象的数据。
本申请应用于对数据采集资源量控制的场景中。本申请的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
图1为本申请实施例提供的数据采集资源量控制方法的应用场景示意图。如图1,该场景中,包括:第一服务器101以及第二服务器102。
服务器101及服务器102均可以是单独的服务器,也可以是由多个服务器组成的集群。服务器101与服务器102之间的连接方式可以是通讯连接。
在具体实现过程中,第一服务器101用于从第二服务器102获取采集对象的数据,以及通过采集对象的历史采集周期及期望采集周期,确定采集对象的采集状态,并计算采集对象的综合热度,由采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定采集对象的目标资源数,分配与目标资源数等量的资源获取采集对象的数据。
可以理解的是,本申请实施例示意的结构并不构成对数据采集资源量控制方法的具体限定。在本申请另一些可行的实施方式中,上述架构可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置,具体可根据实际应用场景确定,在此不做限制。图1所示的部件可以以硬件,软件,或软件与硬件的组合实现。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图2为本申请实施例提供的数据采集资源量控制方法的流程示意图。本申请实施例的执行主体可以是图1中的服务器101,也可以是电脑和/或手机等,本实施例对此不作特别限制。如图2所示,该方法包括:
S201:获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取任一采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中采集对象包括网址,采集数据包括采集得到的网址中的内容,历史采集周期为本采集周期以前的任一采集周期。
其中,采集网址得到的采集数据会被标记采集的时间或存储在有对应时间标记的文件夹中。由采集数据和对应的时间,可以得到预设时间内采集得到的数据,预设时间内采集得到的数据的数量为采集数据量。采集数据中与预设热点匹配的数据量可以 是预先在所有采集数据中查找与预设热点匹配的数据,并结合采集数据对应的时间,得到预设时间内匹配的数据,将预设时间内匹配的数据的数量作为匹配的数据量。采集得到的数据浏览量,可以是采集目标对应的所有采集数据在预设时间内被浏览的次数,可以通过实时记录被浏览的次数,并将预设时间的结束时的浏览次数确定为第一浏览次数,将预设时间的起始时的浏览次数确定为第二浏览次数,将第一浏览次数减去第二浏览次数得到预设时间内被浏览的次数。各历史采集周期及本采集周期的分配资源数,可以是预先计算得到并储存在存储单元中的也可以是每个历史采集周期的开始时记录在存储单元中的,分配资源数可以是采集使用的资源数。
在本步骤中,获取的数据都可以是采集数据时或采集数据中存储在表格中的,也可以是以其他格式存储的。获取任一采集对象对应的历史采集周期,可以是从存储单元中获取采集对象对应的历史采集时间,可以完整获取一次采集对象所使用的采集时间为一个历史采集周期。采集数据可以是采集得到的网址中的内容,例如:网址中的字符、图像、视频、音频等。本采集周期的分配资源数为上一次计算得到的目标资源数,在上一次计算得到目标资源数后,可以进行储存,储存时会将目标资源数与采集对象关联,读取采集对象对应的上一次计算得到的目标资源数就可以作为本采集周期的分配资源数。预设热点可以是由地点、时间、人物及事件中的一种或多种词汇组成的关键词逻辑表达式,与预设热点匹配的采集数据,可以是符合这个关键词逻辑表达式的采集数据,或可以由关键词逻辑表达式查询到的采集数据,相应地,采集数据中匹配的数据量可以是符合这个关键词逻辑表达式的数据量,或采集数据中可以由关键词逻辑表达式查询到的数据量。采集得到的数据可以输入独立的数据系统进行展示,并接收客户端的浏览,浏览的次数为数据浏览量,采集得到的数据浏览量,可以是采集对象对应的所有采集得到的数据的总浏览量。
历史采集周期例如:上一次采集使用时间为5分钟,则上一历史采集周期为5分钟。本采集周期之前的第3个采集周期使用的时间为1小时,则本采集周期之前的第3个采集周期为1小时。预设时间例如一天、三天、一周、两周或一个月等。
S202:根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态。
在本步骤中,每个采集对象的期望采集周期都可以不同。在预设个数的采集周期内,若期望采集周期小于历史采集周期的平均值,且差值超过预设值,则将采集对象的状态确定为破线状态,若期望采集周期大于历史采集周期的平均值,且差值超过预设值,则将采集对象的状态确定为空闲状态。
S203:根据匹配的数据量、数据浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度。
在本步骤中,可以是将预设时间内的匹配的数据量、浏览量、采集数据量及期望采集周期输入预设公式得到任一采集对象的综合热度。
具体地,可以是将预设时间内的匹配的数据量、浏览量及采集数据量输入第一预设公式,得到采集对象的历史热度。将新增数据量及期望采集周期数输入第二预设公式,得到实际热度。将历史热度及实际热度输入第三预设公式,得到综合热度。
其中,采集数据量为一段时间(预设的时间段、至少一个采集周期或至少一个记录周期)内采集得到的数据量,新增数据量为两段时间采集得到的数据量的差。
S204:根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数。
在本步骤中,可以是将任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数输入预设的目标资源数计算公式,得到目标资源数。可以是周期性找到采集状态符合预设标准的采集对象,并改变其目标资源数。
S205:分配目标资源数的资源获取任一采集对象的数据。
在本步骤中,可以是调用数量为目标资源数的资源,获取上述任一采集对象的数据。
从上述实施例的描述可知,本申请实施例通过获取采集对象的历史采集周期、分配资源数和采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,根据期望采集周期及至少一个历史采集周期,确定采集对象的采集状态,由匹配的数据量、浏览量、采集数据量、期望采集周期计算采集对象的综合热度,并根据采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,得到目标资源数,并分配目标资源数的资源获取任一采集对象的数据。由于采用了采集对象的采集状态、综合热度更新了获取采集对象的数据所使用的资源数,所以提高了获得的数据的时效性。
在一种可能的实现方式中,上述步骤S202中,根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态,包括:
S2021:将任一采集对象的至少一个历史采集周期的平均值减去期望采集周期,得到周期差值。
在本步骤中,若只取一个历史采集周期,则历史采集周期的平均值就是这个历史采集周期的长度,若取至少两个历史采集周期,则平均值例如将2个、3个或5个等的历史采集周期取平均得到平均值。将得到的平均值减去期望采集周期,得到周期差值。本步骤采用的历史采集周期的数量可以是预设的。
例如,当前取最近的2个历史采集周期分别为2分钟、3分钟,则平均值为2分30秒,期望采集周期为2分钟,则周期差值为30秒。当前取3个历史采集周期分别为1小时、2小时、1.5小时,则平均值为1.5小时,期望采集周期为2小时,则周期差值为-0.5小时。
S2022:若周期差值与期望采集周期的比值大于或等于第一预设值,则将任一采集对象的采集状态确定为破线状态。
在本步骤中,周期差值与期望采集周期的比值可以是采用周期差值除以期望采集周期。第一预设值可以是小数、百分数等。
例如,周期差值为30秒,期望采集周期为2分钟,则比值为25%,若第一预设值为20%,则采集状态确定为破线状态。
其中,第一预设值还可以是0.19、24%等,本申请对此不作特殊限制。
S2023:若周期差值与期望采集周期的比值小于或等于第二预设值,则将任一采集对象的采集状态确定为空闲状态。
在本步骤中,第二预设值可以是上述第一预设值乘-1,也可以与上述第一预设值无关。
例如,周期差值为-0.5小时,期望采集周期为2小时,则比值为-25%,若第二预设值为-20%,则将采集状态确定为空闲状态。其中第二预设值还可以是其他数值,例如-0.17、-15%等,本申请对此不作特殊限制。
S2024:若周期差值与期望采集周期的比值小于第一预设值且大于第二预设值,则将任一采集对象的采集状态确定为正常状态。
在本步骤中,比值的计算方法与S2022、S2023类似,在这里不再赘述。
例如,比值为2%,第一预设值为10%,第二预设值为-15%,则比值小于第一预设值且大于第二预设值,将对应的采集状态确定为正常状态。又例如,比值为-2%,第一预设值为5%,第二预设值为-10%,则比值小于第一预设值且大于第二预设值,将对应的采集状态确定为正常状态。
从上述实施例的描述可知,本申请实施例通过将采集对象的预设个历史采集周期的平均值减去期望采集周期,得到周期差值,并将周期差值与第一预设值和第二预设之的大小作比较,在大于或等于第一预设值的情况下将采集状态确定为破线状态,在小于或等于第二预设值的情况下将采集状态确定为空闲状态,在大于第二预设值且小于第一预设值的情况下将采集状态确定为正常状态,实现了由历史采集周期的平均值及预设采集周期大小,得到采集对象的采集状态的效果,便于后续根据采集状态变更采集使用的资源数。
在一种可能的实现方式中,在上述步骤S203中,根据预设时间内的匹配的数据量、浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度,包括:
S2031:根据匹配的数据量、数据浏览量及采集数据量,计算任一采集对象的历史热度。
本步骤可以是将预设时间内匹配的数据量、数据浏览量及采集数据量输入预设的公式,得到任一采集对象的历史热度。
在一种可能的实现方式中,本步骤采用的公式如下:
式中,hothistory表示任一采集对象的历史热度,nummatch表示匹配的数据量,readnum表示采集数据的数据浏览量,recordnum表示采集数据量,A、B、C均表示常数,log表示取对数。本公式可以是上述第一预设公式。
S2032:将预设个数的历史采集周期确定为一个记录周期。
在本步骤中,预设个数可以是3个,也可以是2个、5个等。
S2033:将第一记录周期结束时的采集数据量减第一记录周期开始时的采集数据量,得到第一记录周期的采集数据量,其中第一记录周期为当前时间之前的第N个记录周期,其中N为正整数。
其中,开始时可以是开始采集时,结束时可以是采集完成时,在记录周期开始时间对应的采集数据量可以为零,也可以为已有采集数据的数据量,由于在第一记录周 期中进行了数据采集,结束时的采集数据量相对开始时的采集数据量会有所增加,从而通过将第一记录周期结束时的采集数据量减第一记录周期开始时的采集数据量,得到第一记录周期的采集数据量。
例如,第一记录周期结束时的采集数据量为600条,第一记录周期开始时的采集数据量为500条,则第一记录周期的采集数据量为100条。又例如,第一记录周期结束时的采集数据量为30条,第一记录周期开始时的采集数据量为5条,则第一记录周期的采集数据量为25条。
S2034:将第二记录周期结束时的采集数据量减第二记录周期开始时的采集数据量,得到第二记录周期的采集数据量,第二记录周期为当前时间之前的第N+1个记录周期。
本步骤与上述步骤S2033类似,在这里不再赘述。
S2035:将第一记录周期的采集数据量减第二记录周期的采集数据量,得到新增数据量,其中第一记录周期为当前时间之前的第N个记录周期,第二记录周期为当前时间之前的第N+1个记录周期,其中N为正整数。新增数据量可以是记录周期的平均新增数据量。
在本步骤中,第一记录周期可以是当前时间之前的第1个记录周期,即最接近当前时间的记录周期,也可以是其他的记录周期。采集数据量可以在数据库中查询得到。若第N个记录周期为最近的记录周期,则第N+1个记录周期为第N个记录周期的前一个记录周期。记录周期的数据量为历史采集周期的数据量的和,与当前采集周期的数据量无关。
S2036:将新增数据量除以期望采集周期并取对数,得到任一采集对象的实际热度。
在本步骤中,可以是将平均新增数据量除以期望采集周期,得到数据增速,将增速取对数,得到实际热度。平均新增数据量可以是一个记录周期或几个记录周期内的平均新增数据量。
本步骤采用的公式如下:
其中,hotreal表示实际热度,log表示取对数,Ravg表示平均新增数据量,texpect表示期望采集周期。本公式可以是上述第二预设公式。
S2037:将历史热度及实际热度以预设方式分别映射进预设区间内,得到映射历史热度及映射实际热度。
在本步骤中,可以是将历史热度输入预设的映射函数,得到映射历史热度,将实际热度输入预设的映射函数,得到映射实际热度。输入映射函数的还可以有所有采集目标对应的热度的最小值和热度的最大值,此时热度可以是历史热度也可以是实际热度。映射函数原理依据范围限定函数scale(hot,minTarget,maxTarget),将hot限定在minTarget和maxTarget之间,其中hot表示历史热度或实际热度,minTarget表示映射范围的最小值,maxTarget表示映射范围的最大值。
其中映射函数如下:
式中,hot′表示映射历史热度或映射实际热度,hot表示历史热度或实际热度,hotmax表示所有历史热度或实际热度中的最大值,hotmin表示所有历史热度或实际热度中的最小值,H、I表示常数。历史热度或实际热度中的最大值或最小值,应与输入的历史热度或实际热度相对应。本公式可以是上述第三预设公式。
在上述加权求和的公式中,H可表示映射范围的最小值,I可以表示映射范围的最大值。例如H取1,I取100。
在一种可能的实现方式中,若采集对象没有历史热度,则按照预先标定的重要等级映射至固定范围,得到映射历史热度。
例如,采集对象的等级可以分为1至5,5个等级,可以映射至20至100得到映射历史热度。等级1可以映射为20、等级2映射为40、等级3映射为60等,也可以是采用预设的函数关系,将等级输入函数,得到映射历史热度。
S2038:将映射历史热度和映射实际热度加权求和,得到任一采集对象的综合热度。
在本步骤中,可以是将映射历史热度与第一权重系数相乘,得到权重历史热度,将映射实际热度与第二权重系数相乘,得到权重实际热度,将权重历史热度与权重实际热度相加,得到综合热度。
其中第一权重系数可以为0.4、0.35、0.3等,第二权重系数可以为0.6、0.65、0.7等,第一权重系数与第二权重系数的和可以为1。综合热度越高说明数据越重要、实时流量可能较高。
在一种可能的实现方式中,将映射历史热度和映射实际热度加权求和,得到任一采集对象的综合热度,采用的公式如下:
hotcombine=α·hotreal+β·hothistory
其中,hotcombine表示任一采集对象的综合热度,hotreal表示映射实际热度,hothistory表示映射历史热度,α、β表示权重系数。
从上述实施例的描述可知,本申请实施例通过将第一记录周期的采集数据量减第二记录周期的采集数据量,得到新增数据量,根据新增数据量及期望采集周期,得到实际热度,并将实际热度及历史热度映射后,确定了综合热度,可以综合考虑采集对象的历史热度和实际热度,使后续得到的目标资源数更加符合数据热度,从而增加数据时效性。
在一种可能的实现方式中,在上述步骤S204中,根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数,包括:
S2041:将各采集对象的历史采集周期除以期望采集周期,得到各采集对象的时间超限比。
在本步骤中,历史采集周期,可以是上述S2021中的历史采集周期的平均值,也可以是预设第X个历史采集周期。
S2042:将各采集对象的综合热度与时间超限比相乘得到乘积,并取乘积的对数,得到各采集对象的超限热度值。
上述S2041和S2042,可以是采用如下公式综合表示:
式中,V表示超限热度值,hotcombine表示任一采集对象的综合热度,treal表示历史采集周期,texpect表示期望采集周期。
S2043:根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差。
在一种可能的实现方式中,本步骤采用的公式如下:
式中,Δ表示资源数差,Vmax表示所有采集对象的超限热度值中的最大值,Vmin表示所有采集对象的超限热度值中的最小值,hotcombine表示任一采集对象的综合热度,treal表示历史采集周期,texpect表示期望采集周期,D、E、F、G均表示常数,log表示取对数。
式中,D、E可依据系统资源和待抓取对象量级估算调整,D例如1,E例如10,F、G例如1,F、G也可以取相对treal或texpect较小的数值,例如取二者较小值的百分之一、取二者较小值的十分之一等。
在一种可能的实现方式中,计算得到资源数差后还可以取整。
S2044:若任一采集对象的采集状态为破线状态,则将任一采集对象的分配资源数与资源数差相加,得到任一采集对象的目标资源数。
在本步骤中,例如,采集状态为破线状态,分配资源数为7,资源数差为2,则目标资源数为9。又例如,采集状态为破线状态,分配资源数为9,资源数差为3,则目标资源数为12。还例如,采集状态为破线状态,分配资源数为5,资源数差为1,则目标资源数为6。
S2045:若任一采集对象的采集状态为空闲状态,则将任一采集对象的分配资源数与资源数差相减,得到任一采集对象的目标资源数。
在本步骤中,例如,采集状态为空闲状态,分配资源数为7,资源数差为2,则目标资源数为5。又例如,采集状态为破线状态,分配资源数为9,资源数差为3,则目标资源数为6。还例如,采集状态为破线状态,分配资源数为4,资源数差为1,则目标资源数为3。
在一种可能的实现方式中,上述步骤S2041至S2045可以是周期性执行的。
从上述实施例的描述可知,本申请实施例通过将各采集对象的历史采集周期除以期望采集周期,得到各采集对象的时间超限比,将各采集对象的综合热度与时间超限比相乘得到乘积,并取乘积的对数,得到各采集对象的超限热度值。根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,计算得到资源数差,并根据采集对象的采集状态将分配资源数与资源数差相加或相减,得到任一采集对象的目标资源数,实现增加破线状态的采集对象使用的资源数,将综合热度高且超时严重的目标优先给予更多的资源分配调整,降低空闲 状态的采集对象使用的资源数,将综合热度低且不会超时的目标优先给予更少的资源分配调整。
在一种可能的实现方式中,在上述步骤S205,分配目标资源数的资源获取任一采集对象的数据之后,还包括:
S206:将新的历史采集周期与期望采集周期相减得到新的周期差值。
在本步骤中,新的历史采集周期可以是在采用目标资源数的资源获取数据时,获取一次使用的时间,也可以是多次采用目标资源数的资源获取数据的平均使用时间。
S207:若新的周期差值与期望采集周期的比值小于预设比值,则将目标资源数作为固定资源数,以采用固定资源数的资源获取任一采集对象的数据。
在本步骤中,预设比值例如10%、5%、0.02等,本申请对此不作特殊限制。得到固定资源数后,可以不再执行调整目标资源数的步骤。
S208:若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量大于或等于预设值,则重复执行调整目标资源数的步骤。
在本步骤中,调整目标资源数的步骤可以是上述步骤S201至S205。可以在达到上述步骤S207的条件时停止重复执行步骤S201至S205。预设个周期内采集对象的新增数据量,可以是预设个周期中任一个周期的新增数据量,也可以是预设个周期的平均新增数据量。
S209:若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量小于预设值,则输出错误报告。
本步骤与上述步骤S208类似,在这里不再赘述。错误报告可以是文字报告,也可以是预设的提示信息。
从上述实施例的描述可知,本申请实施例通过将新的历史采集周期与期望采集周期作差,得到周期差值,可以反应周期延长或缩短的变化量,在周期差值与期望周期的比值小于预设比值时,将目标资源数作为固定资源数,并在后续采用固定资源数的资源获取数据,若比值大于或等于预设比值,且预设个周期内新增数据量大于或等于预设值,则重复执行调整目标资源数的步骤,若比值大于或等于预设比值,且新增数据量小于预设值,则输出错误报告。实现在目标资源数与采集对象匹配时,使用目标资源数的资源进行数据采集,采集对象的数据增加较多时,调整目标资源数,在采集对象的数据增加较少,但新的周期用时比原有的历史采集周期更长时,确定为出错,输出错误报告以提示用户进行人工排查。
在一种可能的实现方式中,本申请的资源可以是线程,也可以是带宽、内存、处理器占用量等。本申请中的采集对象、综合热度、分配资源数、期望采集周期、平均新增数据量、历史采集周期和/或任务状态等,可以是以表格形式存储的,称为基线表,通过周期扫描基线表,进行目标资源数的调整,基线表例如表1。
表1基线表(示意)
图3为本申请实施例提供的数据采集资源量控制装置的结构示意图。如图3所示,数据采集资源量控制装置300,包括:第一获取模块301、第一确定模块302、计算模块303、第二确定模块304及第二获取模块305。
第一获取模块301,用于获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取任一采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中采集对象包括网址,采集数据包括采集得到的网址中的内容,历史采集周期为本采集周期以前的任一采集周期。
第一确定模块302,用于根据预设的期望采集周期及至少一个历史采集周期,确定任一采集对象的采集状态。
计算模块303,用于根据匹配的数据量、数据浏览量、采集数据量及期望采集周期,计算任一采集对象的综合热度。
第二确定模块304,用于根据任一采集对象的采集状态、历史采集周期、期望采集周期、综合热度及分配资源数,确定任一采集对象的目标资源数。
第二获取模块305,用于分配目标资源数的资源获取任一采集对象的数据。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中,第一确定模块302,具体用于将任一采集对象的至少一个历史采集周期的平均值减去期望采集周期,得到周期差值。若周期差值与期望采集周期的比值大于或等于第一预设值,则将任一采集对象的采集状态确定为破线状态。若周期差值与期望采集周期的比值小于或等于第二预设值,则将任一采集对象的采集状态确定为空闲状态。若周期差值与期望采集周期的比值小于第一预设值且大于第二预设值,则将任一采集对象的采集状态确定为正常状态。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中,计算模块303,具体用于根据匹配的数据量、数据浏览量及采集数据量,计算任一采集对象的历史热度。将预设个数的历史采集周期确定为一个记录周期。将第一记录周期结束时的采集数据量减第一记录周期开始时的采集数据量,得到第一记录周期的采集数据量,其中第一记录周期为当前时间之前的第N个记录周期,其中N为正整数。将第二记录周期结束时的采集数据量减第二记录周期 开始时的采集数据量,得到第二记录周期的采集数据量,第二记录周期为当前时间之前的第N+1个记录周期;将第一记录周期的采集数据量减第二记录周期的采集数据量,得到新增数据量。将新增数据量除以期望采集周期并取对数,得到任一采集对象的实际热度。将历史热度及实际热度以预设方式分别映射进预设区间内,得到映射历史热度及映射实际热度。将映射历史热度和映射实际热度加权求和,得到任一采集对象的综合热度。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中计算模块303,根据匹配的数据量、浏览量及采集数据量,计算任一采集对象的历史热度,采用的公式如下:
式中,hothistory表示任一采集对象的历史热度,nummatch表示匹配的数据量,readnum表示数据浏览量,recordnum表示采集数据量,A、B、C均表示常数,log表示取对数。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中,第二确定模块304,具体用于将各采集对象的历史采集周期除以期望采集周期,得到各采集对象的时间超限比。将各采集对象的综合热度与时间超限比相乘得到乘积,并取乘积的对数,得到各采集对象的超限热度值。根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差。若任一采集对象的采集状态为破线状态,则将任一采集对象的分配资源数与资源数差相加,得到任一采集对象的目标资源数。若任一采集对象的采集状态为空闲状态,则将任一采集对象的分配资源数与资源数差相减,得到任一采集对象的目标资源数。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中,第二确定模块304,根据任一采集对象的综合热度、历史采集周期、期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差,采用的公式如下:
式中,Δ表示资源数差,Vmax表示所有采集对象的超限热度值中的最大值,Vmin表示所有采集对象的超限热度值中的最小值,hotcombine表示任一采集对象的综合热度,treal表示历史采集周期,texpect表示期望采集周期,D、E、F、G均表示常数,log表示取对数。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
在一种可能的实现方式中,数据采集资源量控制装置300,还包括:差值获取模块306、第三确定模块307、资源调整模块308及报告输出模块309。
差值获取模块306,用于将新的历史采集周期与期望采集周期相减得到新的周期差值。
第三确定模块307,用于若新的周期差值与期望采集周期的比值小于预设比值,则将目标资源数作为固定资源数,以采用固定资源数的资源获取任一采集对象的数据。
资源调整模块308,用于若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量大于或等于预设值,则重复执行调整目标资源数的步骤。
报告输出模块309,用于若新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内任一采集对象的新增数据量小于预设值,则输出错误报告。
本实施例提供的装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
为了实现上述实施例,本申请实施例还提供了一种电子设备。
参考图4,其示出了适于用来实现本申请实施例的电子设备400的结构示意图,该电子设备400可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(Read Only Memory,简称ROM)402中的程序或者从存储装置408加载到随机访问存储器(Random Access Memory,简称RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(Liquid Crystal Display,简称LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装, 或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本申请实施例的方法中限定的上述功能。
需要说明的是,本申请上述的计算机可读存储介质可以是计算机可读信号介质或者计算机存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读存储介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和 /或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该模块本身的限定,例如,第一确定模块还可以被描述为“任一采集对象的采集状态确定模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机执行指令,当处理器执行计算机执行指令时,实现上述任一实施例中的数据采集资源量控制方法的技术方案,其实现原理以及有益效果与数据采集资源量控制方法的实现原理及有益效果类似,可参见数据采集资源量控制方法的实现原理及有益效果,此处不再进行赘述。
在本申请的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
本申请还提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时,实现上述任一实施例中的数据采集资源量控制方法的技术方案,其实现原理以及有益效果与数据采集资源量控制方法的实现原理及有益效果类似,可参见数据采集资源量控制方法的实现原理及有益效果,此处不再进行赘述。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。

Claims (10)

  1. 一种数据采集资源量控制方法,其特征在于,包括:
    获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取所述任一采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中所述采集对象包括网址,所述采集数据包括采集得到的网址中的内容,所述历史采集周期为本采集周期以前的任一采集周期;
    根据预设的期望采集周期及至少一个所述历史采集周期,确定所述任一采集对象的采集状态;
    根据所述匹配的数据量、所述数据浏览量、所述采集数据量及所述期望采集周期,计算所述任一采集对象的综合热度;
    根据任一采集对象的所述采集状态、所述历史采集周期、所述期望采集周期、所述综合热度及所述分配资源数,确定任一采集对象的目标资源数;
    分配所述目标资源数的资源获取所述任一采集对象的数据。
  2. 根据权利要求1所述的方法,其特征在于,所述根据预设的期望采集周期及至少一个所述历史采集周期,确定所述任一采集对象的采集状态,包括:
    将任一采集对象的至少一个所述历史采集周期的平均值减去所述期望采集周期,得到周期差值;
    若所述周期差值与所述期望采集周期的比值大于或等于第一预设值,则将所述任一采集对象的采集状态确定为破线状态;
    若所述周期差值与所述期望采集周期的比值小于或等于第二预设值,则将所述任一采集对象的采集状态确定为空闲状态;
    若所述周期差值与所述期望采集周期的比值小于所述第一预设值且大于所述第二预设值,则将所述任一采集对象的采集状态确定为正常状态。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述匹配的数据量、所述数据浏览量、所述采集数据量及所述期望采集周期,计算所述任一采集对象的综合热度,包括:
    根据所述匹配的数据量、所述数据浏览量及所述采集数据量,计算所述任一采集对象的历史热度;
    将预设个数的所述历史采集周期确定为一个记录周期;
    将第一记录周期结束时的采集数据量减第一记录周期开始时的采集数据量,得到第一记录周期的采集数据量,其中所述第一记录周期为当前时间之前的第N个记录周期,其中N为正整数;
    将第二记录周期结束时的采集数据量减第二记录周期开始时的采集数据量,得到第二记录周期的采集数据量,所述第二记录周期为当前时间之前的第N+1个记录周期;
    将所述第一记录周期的采集数据量减所述第二记录周期的采集数据量,得到新增数据量;
    将所述新增数据量除以所述期望采集周期并取对数,得到所述任一采集对象的实际热度;
    将所述历史热度及所述实际热度以预设方式分别映射进预设区间内,得到映射历 史热度及映射实际热度;
    将所述映射历史热度和所述映射实际热度加权求和,得到所述任一采集对象的综合热度。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述匹配的数据量、所述数据浏览量及所述采集数据量,计算所述任一采集对象的历史热度,采用的公式如下:
    式中,hothistory表示所述任一采集对象的历史热度,nummatch表示所述匹配的数据量,readnum表示所述数据浏览量,recordnum表示所述采集数据量,A、B、C均表示常数,log表示取对数。
  5. 根据权利要求2至4任一项所述的方法,其特征在于,所述根据任一采集对象的所述采集状态、所述历史采集周期、所述期望采集周期、所述综合热度及所述分配资源数,确定任一采集对象的目标资源数,包括:
    将各采集对象的所述历史采集周期除以所述期望采集周期,得到所述各采集对象的时间超限比;
    将各采集对象的所述综合热度与所述时间超限比相乘得到乘积,并取所述乘积的对数,得到所述各采集对象的超限热度值;
    根据所述任一采集对象的所述综合热度、所述历史采集周期、所述期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差;
    若任一采集对象的采集状态为所述破线状态,则将所述任一采集对象的所述分配资源数与所述资源数差相加,得到所述任一采集对象的目标资源数;
    若所述任一采集对象的采集状态为所述空闲状态,则将所述任一采集对象的所述分配资源数与所述资源数差相减,得到所述任一采集对象的目标资源数。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述任一采集对象的所述综合热度、所述历史采集周期、所述期望采集周期及所有采集对象的超限热度值中的最大值和最小值,确定资源数差,采用的公式如下:
    式中,Δ表示所述资源数差,Vmax表示所有采集对象的超限热度值中的最大值,Vmin表示所有采集对象的超限热度值中的最小值,hotcombine表示所述任一采集对象的所述综合热度,treal表示所述历史采集周期,texpect表示所述期望采集周期,D、E、F、G均表示常数,log表示取对数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,在所述分配所述目标资源数的资源获取所述任一采集对象的数据之后,还包括:
    将新的历史采集周期与所述期望采集周期相减得到新的周期差值;
    若所述新的周期差值与期望采集周期的比值小于预设比值,则将所述目标资源数作为固定资源数,以采用所述固定资源数的资源获取所述任一采集对象的数据;
    若所述新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内所述任一采集对象的新增数据量大于或等于预设值,则重复执行调整目标资源数 的步骤;
    若所述新的周期差值与期望采集周期的比值大于或等于预设比值,且在预设个周期内所述任一采集对象的新增数据量小于预设值,则输出错误报告。
  8. 一种数据采集资源量控制装置,其特征在于,包括:
    第一获取模块,用于获取任一采集对象对应的预设时间内的采集数据量、采集数据中与预设热点匹配的数据量及采集得到的数据浏览量,并读取采集对象对应的预先存储的各历史采集周期及本采集周期的分配资源数,其中所述采集对象包括网址,所述采集数据包括采集得到的网址中的内容,所述历史采集周期为本采集周期以前的任一采集周期;
    第一确定模块,用于根据预设的期望采集周期及至少一个所述历史采集周期,确定所述任一采集对象的采集状态;
    计算模块,用于根据所述匹配的数据量、所述数据浏览量、所述采集数据量及所述期望采集周期,计算所述任一采集对象的综合热度;
    第二确定模块,用于根据任一采集对象的所述采集状态、所述历史采集周期、所述期望采集周期、所述综合热度及所述分配资源数,确定任一采集对象的目标资源数;
    第二获取模块,用于分配所述目标资源数的资源获取所述任一采集对象的数据。
  9. 一种电子设备,其特征在于,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,使得所述处理器执行如权利要求1至7中任一项所述的数据采集资源量控制方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1至7中任一项所述的数据采集资源量控制方法。
PCT/CN2023/106837 2022-10-14 2023-07-11 数据采集资源量控制方法、装置、设备及存储介质 WO2024078070A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211256657.5 2022-10-14
CN202211256657.5A CN115329179B (zh) 2022-10-14 2022-10-14 数据采集资源量控制方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024078070A1 true WO2024078070A1 (zh) 2024-04-18

Family

ID=83914108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106837 WO2024078070A1 (zh) 2022-10-14 2023-07-11 数据采集资源量控制方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115329179B (zh)
WO (1) WO2024078070A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329179B (zh) * 2022-10-14 2023-04-28 卡奥斯工业智能研究院(青岛)有限公司 数据采集资源量控制方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041881A1 (en) * 2011-08-09 2013-02-14 Microsoft Corporation Optimizing web crawling with user history
CN105912552A (zh) * 2015-12-23 2016-08-31 乐视网信息技术(北京)股份有限公司 网页视频抓取的方法及网页视频抓取的终端设备
CN109388736A (zh) * 2018-09-21 2019-02-26 真相网络科技(北京)有限公司 爬虫系统中的响应调度方法
WO2019180489A1 (en) * 2018-03-21 2019-09-26 Pratik Sharma Frequency based distributed web crawling
CN112019451A (zh) * 2019-05-29 2020-12-01 中国移动通信集团安徽有限公司 带宽分配方法、调试网元、本地缓存服务器及计算设备
CN113536085A (zh) * 2021-06-23 2021-10-22 西华大学 基于组合预测法的主题词搜索爬虫调度方法及其系统
CN115329179A (zh) * 2022-10-14 2022-11-11 卡奥斯工业智能研究院(青岛)有限公司 数据采集资源量控制方法、装置、设备及存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090287684A1 (en) * 2008-05-14 2009-11-19 Bennett James D Historical internet
TW201137776A (en) * 2009-12-23 2011-11-01 Ibm A method and system to dynamically off-loading of batch workload a computing center to external cloud services
US8856321B2 (en) * 2011-03-31 2014-10-07 International Business Machines Corporation System to improve operation of a data center with heterogeneous computing clouds
CN102446225A (zh) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 一种实时搜索的方法、装置和系统
CN104951512A (zh) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 一种基于互联网的舆情数据采集方法及系统
CN105677489B (zh) * 2016-03-04 2017-06-20 山东大学 离散流处理模型下批次间隔大小的动态设置系统及方法
CN106649865A (zh) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 一种分布式服务器系统及数据处理方法
CN109948087B (zh) * 2017-12-05 2021-11-16 Oppo广东移动通信有限公司 网页资源的获取方法、装置及终端
CN110392085A (zh) * 2018-04-23 2019-10-29 中兴通讯股份有限公司 网页预下载方法及装置、存储介质和电子装置
CN111881343A (zh) * 2020-07-07 2020-11-03 Oppo广东移动通信有限公司 信息推送方法、装置、电子设备及计算机可读存储介质
CN113660699A (zh) * 2021-06-30 2021-11-16 齐喝彩(常熟)信息科技有限公司 一种智能集群联网方法、装置及电子设备
CN114780579A (zh) * 2022-05-05 2022-07-22 卡奥斯工业智能研究院(青岛)有限公司 基于工业互联网的数据查找方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041881A1 (en) * 2011-08-09 2013-02-14 Microsoft Corporation Optimizing web crawling with user history
CN105912552A (zh) * 2015-12-23 2016-08-31 乐视网信息技术(北京)股份有限公司 网页视频抓取的方法及网页视频抓取的终端设备
WO2019180489A1 (en) * 2018-03-21 2019-09-26 Pratik Sharma Frequency based distributed web crawling
CN109388736A (zh) * 2018-09-21 2019-02-26 真相网络科技(北京)有限公司 爬虫系统中的响应调度方法
CN112019451A (zh) * 2019-05-29 2020-12-01 中国移动通信集团安徽有限公司 带宽分配方法、调试网元、本地缓存服务器及计算设备
CN113536085A (zh) * 2021-06-23 2021-10-22 西华大学 基于组合预测法的主题词搜索爬虫调度方法及其系统
CN115329179A (zh) * 2022-10-14 2022-11-11 卡奥斯工业智能研究院(青岛)有限公司 数据采集资源量控制方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115329179B (zh) 2023-04-28
CN115329179A (zh) 2022-11-11

Similar Documents

Publication Publication Date Title
CN109299348B (zh) 一种数据查询方法、装置、电子设备及存储介质
CN110008045B (zh) 微服务的聚合方法、装置、设备及存储介质
CN110704751B (zh) 数据处理方法、装置、电子设备及存储介质
WO2024078070A1 (zh) 数据采集资源量控制方法、装置、设备及存储介质
CN108965951B (zh) 广告的播放方法及装置
CN110516159B (zh) 一种信息推荐方法、装置、电子设备及存储介质
WO2020207174A1 (zh) 用于生成量化神经网络的方法和装置
CN110765354A (zh) 信息的推送方法、装置、电子设备及存储介质
CN111985831A (zh) 云计算资源的调度方法、装置、计算机设备及存储介质
CN110852720A (zh) 文档的处理方法、装置、设备及存储介质
WO2019232932A1 (zh) 节点处理方法及装置、计算机可读存储介质和电子设备
US10366094B2 (en) Data access using aggregation
CN111414568B (zh) 一种信息展示方法、装置、电子设备及存储介质
CN112102043A (zh) 物品推荐页面生成方法、装置、电子设备和可读介质
CN114257521B (zh) 流量预测方法、装置、电子设备和存储介质
WO2022242441A1 (zh) 电子表格导入方法、装置、设备及介质
CN112100211B (zh) 数据存储方法、装置、电子设备和计算机可读介质
CN113485890B (zh) 航班查询系统业务监控方法、装置、设备及存储介质
CN111680754B (zh) 图像分类方法、装置、电子设备及计算机可读存储介质
CN110222777B (zh) 图像特征的处理方法、装置、电子设备及存储介质
CN113760178A (zh) 缓存数据处理方法、装置、电子设备和计算机可读介质
US20220050614A1 (en) System and method for approximating replication completion time
CN111143355B (zh) 数据处理方法及装置
CN108898446B (zh) 用于输出信息的方法和装置
US20170032124A1 (en) Transmission of trustworthy data