CN111259220A

CN111259220A - Data acquisition method and system based on big data

Info

Publication number: CN111259220A
Application number: CN202010028465.3A
Authority: CN
Inventors: 罗水芳; 邵州华; 楼未吉
Original assignee: Hangzhou Sebe Intellectual Property Service Co ltd
Current assignee: Hangzhou Sebe Intellectual Property Service Co ltd
Priority date: 2020-01-11
Filing date: 2020-01-11
Publication date: 2020-06-09
Anticipated expiration: 2040-01-11
Also published as: CN111259220B

Abstract

The invention provides a data acquisition method and a data acquisition system based on big data, which comprises the steps of grabbing a URL set of a data source to be acquired by using a web crawler and acquiring a webpage corresponding to the URL; respectively calculating scores of the web pages based on a PageRank algorithm and a HITS algorithm, obtaining total value values of the web pages and sorting the total value values; analyzing and acquiring pictures and/or characters in the webpage according to the priority of the total value of the webpage, and acquiring keywords contained in the webpage; calculating the relevance of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is greater than a preset second threshold and the consensus degree is greater than a preset third threshold, storing the keywords into a block chain corresponding to the category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keywords. By using the method to establish the mapping between the keywords and the webpage, the acquired data is more accurate and has more relevance and consensus.

Description

Data acquisition method and system based on big data

Technical Field

The invention relates to the field of data acquisition, in particular to a data acquisition method and a data acquisition system based on big data.

Background

With the rapid development of science, technology and engineering, in recent 20 years, a large amount of data (more properly described, perhaps "infinite" data, for example, in applications such as optical observation and monitoring, data is continuously coming, forming a "data disaster") has been generated in many fields (such as optical observation, optical monitoring, health care, sensors, user data, internet and financial companies, and supply chain systems), and the concept of big data has attracted attention again. Compared with the traditional data, the big data has other unique characteristics besides the appearance characteristics such as large capacity, for example, the big data is generally unstructured and needs to be analyzed in real time, so the development of the big data needs a completely new architecture for processing the acquisition, transmission, storage and analysis of large-scale data.

The concept of big data has been regarded by various industries since 2008. Over the last 10 years, big data has evolved from a vague idea to actual productivity. Particularly in the fields of data-centered information analysis such as financial early warning, public opinion monitoring and internet user preference analysis, mass data generated by daily information activities of the intelligent information analysis system contain special activity rules of the field, and the rules can be used for analyzing the evolution process of data-information of the corresponding field and promoting the generation of decision-supporting information. Therefore, the method for mining the mass data is used for analyzing the historical data and the information and guiding future decision-making activities according to the analysis, and the method becomes one of the key points of the intelligence research and work in each application field at present. However, although big data has already been put into practical use, the systematic knowledge of this concept is not sufficient in the intelligence community. The specific definition, composition, core method and technology are not determined in different application occasions, and no consensus is formed yet.

In essence, big data not only means large capacity of data, but also embodies some features different from "massive data" and "very big data". With the popularity of big data, the definition of big data shows a diversified trend, and it is very difficult to reach the consensus.

Disclosure of Invention

The invention provides a data acquisition method and system based on big data, which are used for solving the technical problems that the data acquisition is difficult, the efficiency is low, the resource occupancy rate is too high and the data acquisition is difficult to achieve consensus caused by the diversification of the big data in the prior art.

In one aspect, the present invention provides a big data-based data acquisition method, including the following steps:

s1: capturing a URL set of a data source to be acquired by using a web crawler to acquire a webpage corresponding to the URL;

s2: respectively calculating scores of the web pages based on a PageRank algorithm and a HITS algorithm, weighting calculation results, obtaining total value values of the web pages, and sorting the total value values;

s3: in response to the fact that the total value is larger than a preset first threshold value, analyzing and acquiring pictures and/or characters in the webpage according to the priority of the total value of the webpage, and acquiring keywords contained in the webpage based on a text information extraction method;

s4: calculating the relevance of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is greater than a preset second threshold and the consensus degree is greater than a preset third threshold, storing the keywords into a block chain corresponding to the category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keywords.

Preferably, step S1 further includes deduplication of the URL using a bloom filter. The bloom filter has great advantages in space and time, storage space and insertion/query time of the bloom filter are constants, Hash functions have no relation with each other, hardware is convenient to realize in parallel, the bloom filter does not need to store elements, and the bloom filter has advantages in certain occasions with strict confidentiality requirements.

Preferably, the calculation formula of the PageRank algorithm in step S2 is specifically as follows:

wherein, PR_jPageRank value representing jth web page, N representing number of web pages, I_i，jIs a zero-one variable (if page i refers to page j, its value is 1, otherwise it is 0), n_iRepresenting the number of links to other pages in web page i, d is a decay factor.

Preferably, the calculation formula of the HITS algorithm in step S2 is specifically:

wherein the Authority value of the webpage i is A_iHub value of H_iAnd E represents a link from web page j to web page i.

Preferably, the weight values of the PageRank algorithm and the HITS algorithm are the same and are both 50%. The results obtained by the weighted calculation of the two algorithms are more accurate.

Preferably, the text information extraction method in step S3 includes a language rule template-based information extraction method, a statistical method-based information extraction method, a statistical machine learning-based information extraction method, and a graph-based information extraction method. Different text information extraction methods have different application scenes and advantages and disadvantages, and can meet the extraction requirements of different requirements.

Further preferably, the obtaining manner of the keyword in step S3 is specifically: and respectively determining keywords of the webpage by using an information extraction method based on a language rule template, an information extraction method based on a statistical method, an information extraction method based on statistical machine learning and an information extraction method based on a graph, and determining the keywords with the same result as the keywords of the webpage. The keywords obtained by different extraction methods are comprehensively judged, and finally obtained keywords can more accurately represent the information of the webpage.

Further preferably, the information extraction method based on the statistical method includes an IF-IDF feature calculation method and a KF-IDF feature calculation method.

Preferably, step S4 further includes storing the web page, the relevancy, the total value and the ranking result thereof in a database. Storing the web pages, the relevancy, the total value and the sorting result thereof in the database can conveniently call the corresponding web pages, the relevancy, the total value and the sorting result thereof by using the keywords.

Further preferably, in step S4, the relationship degree is greater than the second preset threshold and the consensus degree is greater than the third preset threshold is as follows: and respectively carrying out different relevancy operations on the keywords by a plurality of consensus nodes in the block chain distributed network, and based on a Byzantine fault-tolerant consensus mechanism, when the consensus nodes calculate the relevancy of the keywords to be more than two thirds of the consensus result of a second threshold, achieving consensus and writing the keywords into the block chain.

Preferably, the specific calculation method of the correlation degree in step S4 is as follows:

wherein R is_n＝TF_tn*TR_tn，TF_tnFor term frequency, TR, of term t in the current text_tnAnd representing the weight of t in the current keyword set, wherein n is the number of the keywords.

According to a second aspect of the invention, a computer-readable storage medium is proposed, on which one or more computer programs are stored, which when executed by a computer processor implement the above-mentioned method.

According to a third aspect of the present invention, a big data based data acquisition system is provided, the system comprising:

a web page acquisition unit: configuring a URL link set used for grabbing a data source to be acquired by using a web crawler, and acquiring a webpage corresponding to a URL;

an evaluation unit: the method comprises the steps that the scores of the webpages are calculated based on a PageRank algorithm and a HITS algorithm respectively, the calculation results are weighted, the total value of the webpages is obtained, and the webpages are sorted according to the total value;

a text recognition unit: the method comprises the steps that the configuration is used for responding to the fact that the total value is larger than a preset first threshold value, analyzing and obtaining pictures and/or characters in a webpage according to the priority of the total value of the webpage, and obtaining keywords contained in the webpage based on a text information extraction method;

a data mapping unit: the method comprises the steps of configuring and calculating the relevance of a keyword and a webpage corresponding to the keyword, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is larger than a preset second threshold value and the consensus degree is larger than a preset third threshold value, storing the keyword into a block chain of a corresponding category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keyword.

The invention provides a data acquisition method and a data acquisition system based on big data, wherein the method comprises the steps of grabbing a webpage corresponding to a URL to be acquired based on a web crawler technology, obtaining a total value of the webpage through calculation, analyzing according to the priority of the total value of the webpage, acquiring a webpage, and acquiring keywords contained in the webpage based on a text information extraction method; by calculating the relevance of the keywords and the webpages corresponding to the keywords, the keywords are stored in the blockchains of the corresponding categories by utilizing a consensus mechanism of the blockchain technology, the webpages are stored in a database arranged on the nodes of the blockchains, and the mapping relation between the webpages and the keywords is established. The method effectively solves the problem that consensus is difficult to achieve due to the trend that the definition of the big data is diversified at present, the data of the webpage is collected in the form of the keywords, the data collection work is reduced, all information of the relevant webpage corresponding to the keywords can be called quickly through a mapping mode, the resource occupancy rate is reduced, and the collected data is more valuable.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of a big data based data collection method according to an embodiment of the present application;

FIG. 3 is a block diagram of a big data based data collection system according to an embodiment of the present application;

FIG. 4 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which the big-data based data collection method of the embodiments of the present application may be applied.

As shown in FIG. 1, system architecture 100 may include a data server 101, a network 102, and a host server 103. Network 102 serves as a medium for providing a communication link between data server 101 and host server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The main server 103 may be a server that provides various services, such as a data processing server that processes information uploaded by the data server 101. The data processing server may perform big data based data collection.

It should be noted that the big data based data collection method provided in the embodiment of the present application is generally executed by the main server 103, and accordingly, the apparatus of the big data based data collection method is generally disposed in the main server 103.

The data server and the main server may be hardware or software. When the hardware is used, the hardware can be implemented as a distributed server cluster consisting of a plurality of servers, or can be implemented as a single server. When software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module.

It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 shows a flowchart of a big data based data collection method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

s201: and grabbing a URL set of a data source to be acquired by using a web crawler to acquire a webpage corresponding to the URL.

In a specific embodiment, the web crawler needs to provide a set of URL links to be initially crawled as an address source for the crawler to visit, and the crawler collects new URL links in a visited web page as a new visited address source to be added to the URL set. Crawling of the crawler is generally performed by a plurality of working threads at the same time, each working thread acquires a new access address from the URL set, removes the new access address from the URL set, initiates an HTTP request to the new address, and downloads an HTML file. An absolute path for a URL begins with a scheme type (e.g., http) that determines the network protocol used for the download. Crawlers grab HTTP links, each downloaded document having an associated MIME type. The HTML document is downloaded based on the HTTP protocol. And the picture link in the HTML page is given to the picture downloading thread to be responsible for downloading.

In a specific embodiment, for extracted URL links, the captured links need to be filtered, path links of a webpage form a large connected graph, repeated URL access generates dead cycles on the paths, repeated path access needs to be avoided, URL deduplication is achieved by using a bloom filter, and space occupation and comparison time of URL addresses are reduced by using a long binary vector and a certain column random mapping method. Alternatively, besides the bloom filter, other duplication removing methods may be adopted, for example, directly storing the URL address in a file or a database, and performing query comparison on the URL to be filtered; or hashing the URL address, and adopting a specific Hash function to reduce the occupied space of the URL set, and the technical effect of the invention can also be realized.

In a preferred embodiment, the content of the page can also be deduplicated to avoid grabbing repeated pages due to the process of crawling, a fingerprint (e.g. MD5 value) is generated for each page for the same page downloaded multiple times, and deduplication is performed for the same document when the fingerprints of two documents are the same. Avoiding occupying a large amount of storage space.

S202: and respectively calculating the scores of the web pages based on a PageRank algorithm and a HITS algorithm, weighting the calculation results, obtaining the total value of the web pages, and sorting the total value.

In a particular embodiment, the PageRank algorithm is preceded by Google's Web page ranking algorithm. According to the PageRank algorithm, a weight is attached to each target webpage, the webpage is displayed in front when the weight is large, and the webpage is displayed in back when the weight is small. The PageRank algorithm is to add a weight to each web page. If a web page is linked to by many other web pages, it is said that the web page is important, i.e., the PageRank value is relatively high; if a web page with a high PageRank value links to another web page, the PageRank value of the linked web page is correspondingly increased accordingly. The PageRank algorithm not only weights web pages according to the 'indexed number', but also expresses the weight of each web page added by the PageRank algorithm by using a PR value. The calculation formula of the PageRank algorithm is specifically as follows:

In a particular embodiment, the HIST algorithm is representative of a subset propagation algorithm. In the hit algorithm, the pages are divided into a Hub page and an Authority page, the Authority page refers to a high-quality page related to a certain field or a certain topic, and the Hub page is a web page containing many links pointing to the high-quality Authority page. According to the HITS algorithm, after a user inputs a keyword, the algorithm calculates two values for the returned matching page, wherein one value is a Hub value (Hub Scores) and the other value is an Authority value (Authority Scores), and the two values are interdependent and mutually influenced. The pivot value refers to the sum of authority values of all derived links on the page pointing to the page. The authority value refers to the sum of pivots in the page where all import links are located. The calculation formula of the HITS algorithm is specifically as follows:

In a preferred embodiment, the PageRank algorithm and the HITS algorithm have the same weight value and are both 50%, and the final value of the webpage obtained by summing the weights of the two algorithms is more representative and accurate. Alternatively, the weighting values of the PageRank algorithm and the HITS algorithm may be set to different specific gravities, for example, 40% and 60%, and the weighting values are set according to the actual application requirements, so that the result of the value is more accurate.

S203: and in response to the fact that the total value is larger than a preset first threshold value, analyzing and acquiring pictures and/or characters in the webpage according to the priority of the total value of the webpage, and acquiring keywords contained in the webpage based on a text information extraction method. By means of the sorting of the total value, value sorting of the webpages can be obtained, some meaningless webpages can be filtered out by utilizing the first threshold value, meaningless data acquisition is avoided, and the data acquisition efficiency and the data acquisition quality are improved.

In a specific embodiment, the text information extraction method specifically includes: rule template methods based on natural language processing, traditional statistical methods, statistical machine learning based methods, and graph-based methods. The information extraction method based on natural language processing is to extract and summarize frequently occurring rule patterns through context part-of-speech analysis, syntax analysis and dependency relationship analysis to realize information extraction. The method goes through a process of judging an important concept based on "noun", "compound noun term", "text structure weighted term", and the like. The information extraction based on the statistical method adopts the statistical method to obtain the text concept, is based on the term co-occurrence theory, and is implemented on the basis of finding different differences of similar terms on the statistical characteristics. The information extraction based on machine learning is to integrate context and consider the extraction of entity relationship as a semantic classification problem, and the semantic relationship is expanded from a hierarchical relationship to a non-hierarchical relationship. The graph-based information extraction method generally represents concepts by nodes of a graph, represents relationships between concepts by edges of the graph, and measures distances between concepts by the number of edges between concepts. The information extraction based on the graph integrates various information extraction methods, obtains the layout of concepts and relations thereof in the text on the whole, and is a more integrated and complete information extraction thought.

In a specific embodiment, statistical method-based information extraction may be a method for TF-IDF feature computation. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Alternatively, a KF-IDF feature calculation method can be adopted, and the technical effect of the invention can be realized.

S204: calculating the relevance of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is greater than a preset second threshold and the consensus degree is greater than a preset third threshold, storing the keywords into a block chain corresponding to the category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keywords. The whole process has no centralized link, and the data security is ensured.

In a specific embodiment, the relevance calculation uses a relevance function relevance to calculate the relevance of the current HTML and the topic by using the Query set as the topic description set. The term frequency and the term weight are important factors for evaluating the text weight, the TF-IDF model takes IDF (inverse document frequency) as the word distinguishing weight, and the vocabulary with small document frequency has higher distinguishing degree. And taking the TextRank value as the weight of a single word in the topic-related document set. Term frequency after extracting HTML label, link, metadata information in document

N_tRepresenting the frequency of occurrence, N, of terms t in the current document_dRepresenting the total number of words, TF, of the current document_tI.e. term frequency of term t in the current document. TR (transmitter-receiver)_tDenotes the weight of t in the current Query, TF_t*TR_tWill be as the size of the document in the T direction, to the keyword T in Query₁，T₂，…，T_kHas a coordinate size R in the direction_i＝TF_ti*TR_tiSo each document will have a score vector RT ═ (R)₁、R₂，…，R_k) The length of R | | will be used as the relevancy weight of the document, | | R | | sweet wind

RTR, that is, for the current document doc and keyword set Q, the specific calculation manner of the relevance is:

In a specific embodiment, the correlation degree is greater than the second preset threshold and the consensus degree is greater than the third preset threshold is as follows: and respectively carrying out different relevancy operations on the keywords by a plurality of consensus nodes in the block chain distributed network, and based on a Byzantine fault-tolerant consensus mechanism, when the consensus nodes calculate the relevancy of the keywords to be more than two thirds of the consensus result of a second threshold, achieving consensus and writing the keywords into the block chain. By utilizing a Byzantine fault-tolerant consensus mechanism, the speed of the nodes for achieving consistency is higher, the delay is lower, the throughput of the whole network is greatly improved, and a power consumption mode proved by workload does not need to be used, so that the method is more energy-saving and environment-friendly. Alternatively, the consensus mechanism may be other than the Byzantine fault-tolerant consensus mechanism, for example, the consensus mechanism may be a workload proving mechanism, a rights proving mechanism, a delegation rights proving mechanism, or a voting mechanism, and the technical effects of the present invention may also be achieved.

In a preferred embodiment, the method further comprises storing the webpage, the relevancy, the total value and the sorting result thereof in a database. By utilizing the mapping relation between the keywords and the webpages, all data of the webpages, the relevancy, the total value and the sequencing result thereof can be called, so that the data has more relevance, the collected data is more valuable, a centralized link does not exist in the whole process by utilizing the consensus node technology of the block chain, and the safety of the data is ensured.

With continued reference to FIG. 3, FIG. 3 illustrates a big data based data acquisition system according to an embodiment of the present invention. The system specifically comprises a webpage acquisition unit 301, an evaluation unit 302, a text recognition unit 303 and a data mapping unit 304.

In a specific embodiment, the web page obtaining unit 301: configuring a URL link set used for grabbing a data source to be acquired by using a web crawler, and acquiring a webpage corresponding to a URL; the evaluation unit 302: the method comprises the steps that the scores of the webpages are calculated based on a PageRank algorithm and a HITS algorithm respectively, the calculation results are weighted, the total value of the webpages is obtained, and the webpages are sorted according to the total value; the text recognition unit 303: the method comprises the steps that the configuration is used for responding to the fact that the total value is larger than a preset first threshold value, analyzing and obtaining pictures and/or characters in a webpage according to the priority of the total value of the webpage, and obtaining keywords contained in the webpage based on a text information extraction method; data mapping unit 304: the method comprises the steps of configuring and calculating the relevance of a keyword and a webpage corresponding to the keyword, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is larger than a preset second threshold value and the consensus degree is larger than a preset third threshold value, storing the keyword into a block chain of a corresponding category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keyword.

Referring now to FIG. 4, shown is a block diagram of a computer system 400 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 401. It should be noted that the computer readable storage medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: capturing a URL set of a data source to be acquired by using a web crawler to acquire a webpage corresponding to the URL; respectively calculating scores of the web pages based on a PageRank algorithm and a HITS algorithm, weighting calculation results, obtaining total value values of the web pages, and sorting the total value values; in response to the fact that the total value is larger than a preset first threshold value, analyzing and acquiring pictures and/or characters in the webpage according to the priority of the total value of the webpage, and acquiring keywords contained in the webpage based on a text information extraction method; calculating the relevance of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is greater than a preset second threshold and the consensus degree is greater than a preset third threshold, storing the keywords into a block chain corresponding to the category, storing the webpage into a database arranged on the node of the block chain, and establishing a mapping relation between the webpage and the keywords.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A data acquisition method based on big data is characterized by comprising the following steps:

s1: capturing a URL set of a data source to be acquired by using a web crawler, and acquiring a webpage corresponding to the URL;

s2: respectively calculating scores of the web pages based on a PageRank algorithm and a HITS algorithm, weighting calculation results to obtain total value of the web pages, and sorting the total value;

s3: in response to the fact that the total value is larger than a preset first threshold value, pictures and/or characters in the webpage are analyzed and obtained according to the priority of the total value of the webpage, and keywords contained in the webpage are obtained based on a text information extraction method;

s4: calculating the relevancy of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevancy, responding to the fact that the relevancy is larger than a preset second threshold and the consensus degree is larger than a preset third threshold, storing the keywords into a block chain corresponding to the category, storing the webpage into a database arranged on the node of the block chain, and establishing the mapping relation between the webpage and the keywords.

2. The big data based data collection method of claim 1, wherein the step S1 further comprises deduplication of the URL with a bloom filter.

3. The big-data-based data acquisition method according to claim 1, wherein the calculation formula of the PageRank algorithm in the step S2 is specifically as follows:

wherein, PR_jPageRank value representing jth web page, N representing number of web pages, I_i，jIs a zero-one variable (if page i references page j, its value is 1, otherwise it is0)、n_iRepresenting the number of links to other pages in the webpage i, and d is an attenuation factor; the calculation formula of the HITS algorithm is specifically as follows:

wherein the Authority value of the webpage i is A_iHub value of H_iAnd E represents a link of the webpage j pointing to the webpage i, and the weight values of the PageRank algorithm and the HITS algorithm are the same and are both 50%.

4. The big data-based data collection method according to claim 1, wherein the text information extraction method in step S3 includes a language rule template-based information extraction method, a statistical method-based information extraction method, a statistical machine learning-based information extraction method, and a graph-based information extraction method.

5. The big-data-based data acquisition method according to claim 4, wherein the keywords in step S3 are obtained in a specific manner: and determining keywords of the webpage respectively by using the information extraction method based on the language rule template, the information extraction method based on the statistical method, the information extraction method based on the statistical machine learning and the information extraction method based on the graph, and determining the keywords with the same result as the keywords of the webpage, wherein the information extraction method based on the statistical method comprises an IF-IDF characteristic calculation method and a KF-IDF characteristic calculation method.

6. The big data based data collection method of claim 1, wherein the database comprises one or a combination of a Redis database, a MongoDB database and a distributed file storage system (HDFS), and the step S4 further comprises storing the web page, the relevancy, the total value and the ranking thereof into the database.

7. The big data based data collection method according to claim 6, wherein the correlation degree is greater than a second preset threshold and the consensus degree is greater than a third preset threshold in step S4 by: and respectively carrying out different relevancy operations on the keywords by a plurality of consensus nodes in the distributed network of the block chain, and based on a Byzantine fault-tolerant consensus mechanism, when the consensus nodes calculate the relevancy of the keywords to be greater than two thirds of the consensus result of the second threshold value, achieving consensus and writing the keywords into the block chain.

8. The big-data-based data acquisition method according to claim 1, wherein the correlation degree in step S4 is calculated by:

9. A computer-readable storage medium having one or more computer programs stored thereon, which when executed by a computer processor perform the method of any one of claims 1 to 8.

10. A big-data based data acquisition system, the system comprising:

a web page acquisition unit: configuring a URL link set used for grabbing a data source to be acquired by using a web crawler, and acquiring a webpage corresponding to the URL;

an evaluation unit: the webpage ranking method comprises the steps that the webpage ranking is configured and used for respectively calculating scores of the webpages based on a PageRank algorithm and a HITS algorithm, weighting is conducted on calculation results, the total value of the webpages is obtained, and the webpages are ranked according to the total value;

a text recognition unit: the webpage analysis method comprises the steps that the webpage analysis method is configured to respond to the fact that the total value is larger than a preset first threshold value, pictures and/or characters in the webpage are analyzed and obtained according to the priority of the total value of the webpage, and keywords contained in the webpage are obtained based on a text information extraction method;

a data mapping unit: the method comprises the steps of configuring and calculating the relevance of the keywords and the webpage corresponding to the keywords, sending a consensus request to a consensus node based on the relevance, responding to the situation that the relevance is larger than a preset second threshold and the consensus degree is larger than a preset third threshold, storing the keywords into a block chain corresponding to a category, storing the webpage into a database arranged on the node of the block chain, and establishing the mapping relation between the webpage and the keywords.