CN106126642B

CN106126642B - Financial warehouse receipt wind control information crawling and screening method based on stream-oriented computing

Info

Publication number: CN106126642B
Application number: CN201610465640.9A
Authority: CN
Inventors: 李�浩
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-06-23
Filing date: 2016-06-23
Publication date: 2020-01-17
Anticipated expiration: 2036-06-23
Also published as: CN106126642A

Abstract

The invention discloses a financial warehouse receipt risk information crawling and screening method based on stream computing, which is characterized in that a crawler process is decoupled into 6 sub-processes based on a stream computing technology: URL screening, page analysis, keyword filtering, numerical value filtering, feature vector matching filtering and resource updating. By adopting the technical scheme of the invention, the problems that the real-time performance of the traditional method in the aspect of parallel crawlers is low and the requirement of financial warehouse bill wind control on the real-time performance of the valuation of goods is high are solved.

Description

Financial warehouse receipt wind control information crawling and screening method based on stream-oriented computing

Technical Field

The invention belongs to the related fields of web crawlers and stream computing, and particularly relates to a financial warehouse receipt wind control information crawling and screening method based on stream computing.

Background

As a new type of warehousing transaction and mortgage method, financial bills are widely used by banks and warehousing enterprises along with the popularization of internet application. However, in order to avoid the corresponding risks, banks often select products with small price change, strong showing capability and good falling resistance as financing objects, such as fixed assets, heavy metal goods and the like. The mortgage products of the type of small and medium-sized micro enterprises are small, are usually bulk products, have more product types, and the product price is closely related to the current market price. Banks are limited by technical limitations, difficult to count the market prices of all goods, and unable to make reasonable estimates of mortgage goods, early becoming a potential financial transaction risk.

The method solves the problem of valuation of the goods of bulk goods, firstly needs to acquire the price information of the goods in the market, but because of the limitation of factors such as mass data and accurate information extraction, a high-efficiency web crawler algorithm aiming at a financial warehouse receipt type does not exist at present.

On the other hand, since the price of the goods in the financial warehouse slip has a certain timeliness, in some cases (such as estimation of the perishable goods), the real-time requirement is very high, and therefore, a parallel or distributed crawler technology must be adopted. However, in the existing web crawler technology, sharing and updating operations of resources (such as keywords, a crawled URL (uniform resource locator) list and the like) are necessarily used, and a parallel method is necessarily adopted, so that (1) the problem of single-node hot spots of shared resources is necessarily existed; (2) delay problems in network transmission; (3) the influence of the resource update locking on the performance is also disclosed. These problems lead to the fact that the existing parallel (distributed) crawler method cannot linearly increase the performance, and the crawler has low real-time performance, and cannot meet the requirements of financial warehouse bill wind control in a specific scene.

Disclosure of Invention

The invention aims to solve the technical problems that a financial warehouse receipt risk information crawling and screening method based on stream computing is provided, and the problems that the real-time performance of a traditional method in the aspect of parallel crawlers is low and the requirement of financial warehouse receipt wind control on the real-time performance of goods valuation is high are solved.

In order to solve the problems, the invention adopts the following technical scheme:

a financial warehouse receipt risk information crawling and screening method based on stream computing comprises the following steps:

step S1, URL screening: calculating a Hash value of a URL (uniform resource locator) obtained from a Spout task data source, sending the Hash value to a corresponding node, screening the URL to be crawled and the crawled URL on the node, and if the URL belongs to one of the URLs, discarding the URL;

step S2, page analysis: analyzing and extracting key contents of the URL page to be crawled to obtain all key words of the page, calculating and extracting characteristic values of each key word, wherein the characteristic values of all the key words form a characteristic vector of the record;

step S3, numerical filtering: extracting numerical information in the keywords and the characteristic vectors, judging whether the numerical information is in a price confidence interval, and directly discarding the price information which is not in the confidence interval;

step S4, keyword filtering: filtering the numerical value to obtain the keywords of the record, matching the keywords with the keyword lists of different categories, and determining that the record is sent to the corresponding category and the node where the record is located according to the similarity;

step S5, feature vector matching calculation: carrying out similarity average calculation on the recorded feature vector and the feature vectors of all members of the category; if the similarity average value is lower than a second preset threshold interval, the recorded feature vector is sent to nodes of other categories, and the similarity average value with the different categories is calculated;

step S6, shared resource update: and updating the shared resources according to the feature vector matching calculation result.

Preferably, a page analysis technology and a Chinese word segmentation technology are adopted to analyze and extract the key content of the URL page, and all the key words of the page are obtained.

Preferably, the feature value calculation method of the keyword is TF × IDF, where TF represents the word frequency of each keyword in the record, and IDF represents the number of records in which the keyword appears.

Preferably, step S6 is specifically: determining a corresponding node by calculating the record URL Hash value, and locking and updating a crawled URL list on the node; and determining the category of the record according to the feature vector calculation, and locking and updating the keyword list and the category feature vector of the category.

Preferably, step S4 is specifically: sending the keyword list of the information to each category of nodes, carrying out similarity calculation with each category of keyword list, comparing the calculation result with a preset first threshold interval, and carrying out corresponding processing:

if the maximum similarity value is still lower than the lower limit of the preset first threshold interval after the calculation with each category keyword list, the correlation between the maximum similarity value and the goods price information is considered to be low, and the maximum similarity value and the goods price information are discarded at the same time; otherwise, the information is sent to the node where the category with the maximum similarity is located.

Preferably, in step S5, if the feature vector similarity average is greater than the upper limit of the preset second threshold interval, the information is considered to belong to the category, and a subsequent lock update operation is performed, otherwise, the information and the feature vector thereof are sent to other categories for similarity average calculation, and the following corresponding processing is performed according to the obtained feature vector similarity average:

if the feature vector similarity average value of the category is larger than the upper limit of a preset second threshold interval, the information is considered to belong to the category, and subsequent locking updating operation is carried out on the category;

if the maximum value of the similarity average value of the feature vectors of the category is between the upper limit and the lower limit of a preset second threshold interval, the new category is considered to be generated;

and if the average value of the similarity of the feature vectors of the category is smaller than the lower limit of the preset second threshold interval, the information is considered to be irrelevant to the price information of the goods and is discarded at the same time.

Preferably, the feature vector similarity calculation uses a cosine similarity matching algorithm.

Preferably, generating new categories includes: a category keyword list and a category feature vector library.

The invention decouples the crawler process into a plurality of subprocesses based on the streaming computing technology, and realizes the high-efficiency real-time processing of the network crawler in a high-concurrency scene through task encapsulation, allocation and flow control; aiming at the financial warehouse bill wind control scene, a distributed multi-filter matched with goods price information is designed, and the influence of network transmission overhead and shared resource updating on performance is reduced and the processing efficiency is improved through the classification deployment and near point processing principles of different nodes.

The technical scheme of the invention is as follows: the web crawler process is decomposed into 6 sub-processes: URL screening, page analysis, keyword filtering, numerical value filtering, feature vector matching filtering and resource updating. And packaging the sub-processes into different types of logic tasks by utilizing a streaming computing technology. Task matching (the number of complex task matching tasks is large) and data flow direction control are realized at different physical nodes, the maximization of the overall processing efficiency is realized, and local hot spots are prevented; the distributed multi-filter is realized by a classified deployment mode, namely, the feature vectors and the key words are deployed in different nodes according to different categories in the abstract library, the key words are filtered, the feature vectors are matched, and the resource updating is limited in one physical node as far as possible by matching with a small amount of key word redundancy, so that the influence of network transmission overhead and shared resource updating locking on the performance is reduced.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

(1) the invention realizes the decoupling of the web crawler process by utilizing the streaming computing technology, and enables serial tasks with different complexities to be processed in real time by dynamic task matching and data flow direction control, thereby preventing local task overload from influencing the processing efficiency of the whole process. Meanwhile, the flexible deployment of logic tasks on different nodes is utilized to realize the integral load balance.

(2) Aiming at the application requirements of financial warehouse receipt wind control, the invention designs a plurality of filtering methods to obtain more accurate crawler results. Meanwhile, by using two rapid filtering methods of numerical filtering and keyword filtering, the page which does not meet the requirement is removed in advance, the pressure of matching calculation of the rear-end feature vector is reduced, and the overall efficiency is improved.

(3) The invention adopts a method of storing the keyword list and the feature vector according to the category and the node, limits the keyword filtering, the feature vector matching and the resource updating in one physical node, reduces the network transmission cost and the influence on the performance due to the locking of the shared resource updating, and further improves the efficiency.

Drawings

FIG. 1 is a flow chart showing a method according to the present invention;

FIG. 2 is a diagram of a streaming computing framework based deployment architecture of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

As shown in fig. 1, the financial policy risk information crawling and screening method based on streaming computing of the present invention includes the following steps:

step 1, decomposing a web crawler task under a streaming computing framework.

According to a flow type computing framework, decoupling is carried out on the whole process of the web crawler, and the decoupling is decomposed into 6 subprocesses: URL screening, page analysis, keyword filtering, numerical value filtering, feature vector matching filtering and resource updating. The 6 sub-processes are packaged into 6 types of logical sub-tasks (Bolt type tasks) according to the streaming computation framework. In addition, data source logic tasks (Spout type tasks) are also required for the streaming computing framework.

There may be multiple logical subtasks of the same type, or they may be deployed on the same physical node. In the deployment architecture diagram of fig. 2, two keyword filtering tasks are deployed at each of the nodes 4 and 5, and 1 URL filtering task is deployed at each of the nodes 1 to 3. Different types of tasks are not generally deployed to the same node, and the page analysis and numerical filtering tasks in fig. 2 are marked by dotted frames, which indicates that the tasks can be deployed on the same node as the URL screening task, and can also be deployed separately. In addition, the keyword filtering and feature vector matching tasks are deployed according to the category and the nodes, and shared resources in the category are stored in the same node, so that the two types of tasks are placed in the same node.

Fig. 2 is only an example of different types of task allocation and partial data flow, and in an actual operation process, different types of logic tasks may be dynamically increased or decreased according to task complexity. In addition, by utilizing the control on the data flow direction in the flow frame, the data can be sent to the designated node, and the realization of the following two steps is supported:

1) calculating a URL Hash value, and sending the URL Hash value to nodes in different Hash intervals according to the Hash value;

2) and matching the information to be processed with the keyword lists of all categories to specify a correct routing node.

And 2, configuring and deploying the shared resources.

Because a large amount of shared resources are used in the distributed processing, and in order to improve the efficiency, the method adopts various node storage modes for the shared resources.

1) And the crawled URL queue and the URL queue to be crawled used by the URL screening task are stored in nodes according to the Hash value of the URL as shown in figure 2.

2) The keyword lists and the feature vectors are stored according to categories, as shown in fig. 2, the keyword lists and the feature vector libraries of different categories are separately configured with one node for storage, so that data distribution and parallel processing are realized.

And 3, screening the distributed URL.

The steps 1 and 2 belong to the overall configuration of the system, and only from the beginning of the step, the steps belong to the crawling and processing process of each URL in the streaming computing process.

And calculating the Hash value of the URL obtained from the Spout task data source, determining the segment Hash interval to which the URL belongs according to the size of the Hash value, and sending the URL to the corresponding node. And finishing the screening of the URL to be crawled and the URL which is crawled on the node, if the URL belongs to one of the URLs, showing that the URL is crawled or waits to be crawled, and abandoning the URL.

And 4, analyzing the page.

The processing of two processes is mainly completed by adopting pages:

1) and analyzing and extracting the key content of the URL page to be crawled by adopting a page analyzing technology and a Chinese word segmentation technology to obtain all key words of the page.

2) And calculating and extracting the characteristic value of each keyword by a characteristic value calculating method TF (TF). Wherein TF represents the word frequency of each keyword in the record, and IDF represents the number of records in which the keyword appears. The extracted feature values of all the keywords constitute the feature vector of the record.

As shown in fig. 2, in order to reduce network transmission and reduce the amount of URL screening tasks, the page analysis task may be deployed in the same node as the URL screening task.

And 5, filtering numerical values.

After the page analysis obtains the keywords and the characteristic vectors, extracting the numerical information in the keywords and the characteristic vectors, and judging whether the numerical information is located in a price confidence interval (such as 0.001-10000) or not. Price information that is not within the confidence interval is directly discarded.

And setting a price confidence interval, training an initial value by adopting a sample to obtain upper and lower limits of the interval, and judging whether the upper and lower limit thresholds are reasonable according to the types of the goods after the system runs for a period of time, wherein for example, if the upper limit threshold is too low, the price of the valuable goods is filtered, the upper limit threshold needs to be increased.

The numerical filtering is low in calculation amount and specific filtering is performed for specific goods price, so that the numerical filtering and the page analysis are placed in the same node for processing from the deployment diagram in FIG. 2.

Step 6, filtering the keywords

After the numerical value filtering is finished, other subsequent processing including the keyword filtering task is classified and node-divided according to the goods category. Therefore, the keyword filtering task firstly calculates the similarity between the keywords of the record to be processed and the keywords of each category, and performs the 'routing specification' according to the similarity result. The specific method comprises the following steps:

and sending the keyword list of the information to each category node, and carrying out similarity calculation with each category keyword list. That is, the number of the statistical information keywords appearing in the category keyword list accounts for the percentage of the total number of the category keywords. After the calculation result is compared with a preset first threshold value interval, corresponding processing is carried out:

1) and if the maximum similarity of the keywords is still lower than the lower limit of the first threshold interval after the keyword is calculated with each category keyword list, the correlation between the keywords and the goods price information is considered to be low, and the keywords are discarded.

2) And in other cases, the information is sent to the node where the category with the highest similarity is located. And the subsequent feature vector matching calculation is only completed on the node without being sent to other nodes.

Step 7, feature vector matching

After the filtering of the keywords is completed, the information is already on the node where a certain category is located. And carrying out similarity calculation on the feature vector of the information and all the feature vectors of the category, wherein the similarity calculation of the feature vectors adopts a cosine similarity matching algorithm. And (4) carrying out corresponding treatment according to the result:

1) if the average value of the similarity of the feature vectors is larger than the upper limit of a preset second threshold interval, the information is considered to belong to the category, and subsequent locking updating operation is carried out;

2) and in other cases, sending the information and the feature vectors thereof to other categories for similarity calculation, and performing corresponding processing according to the obtained feature vector similarity average value:

a) and if the average value of the similarity of the feature vectors with the category is larger than the upper limit of a preset second threshold interval, the information is considered to belong to the category, and subsequent locking updating operation is carried out on the category.

b) If the maximum value of the similarity average value of the feature vectors of the category is between the upper limit and the lower limit of a preset second threshold interval, the new category is considered to be generated; generating new categories includes: a category keyword list and a category feature vector library.

c) And if the average value of the similarity of the feature vectors of the category is smaller than the lower limit of the preset second threshold interval, the information is considered to be irrelevant to the price information of the goods and is discarded.

Step 8, updating the shared resource

After the matching of the feature vectors is finished, the updating of shared resources is involved, and the following 3 conditions are included:

1) crawled URL updates: calculating the URL Hash value by adopting the same method as before, and locking the Hash interval and the node where the Hash interval is located for updating;

2) and existing category updating: according to which node (category) the information is located finally in the process of step 7, locking and updating the feature vector library and the keyword list of the category on the node. Since the lock is only for 1 class and node, there is less impact on parallel performance.

3) New category generation: in step 7 it is possible to generate new classes, which updates the situation without locking existing classes. Only the keyword list and feature vector library for that category need be built and deployed on the new node.

The updating process and the updating method limit the locking operation in the node as much as possible, and prevent the influence on the whole performance, thereby effectively improving the whole parallel processing capability.

Finally, it should be noted that: the above examples are only intended to illustrate the invention and do not limit the technical solutions described in the present invention; thus, while the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted; all such modifications and variations are intended to be included herein within the scope of this disclosure and the appended claims.

Claims

1. A financial warehouse receipt risk information crawling and screening method based on stream computing is characterized by comprising the following steps:

step S1, calculating a Hash value of the URL from the URL obtained from the Spout task data source, sending the Hash value to a corresponding node, screening the URL to be crawled and the URL which has been crawled on the node, and if the URL belongs to one of the URLs, discarding the URL;

step S2, analyzing and extracting the key content of the URL page to be crawled to obtain all key words of the page, calculating and extracting the characteristic value of each key word, wherein the characteristic values of all key words form the characteristic vector of the URL page;

step S3, extracting numerical information in the keywords and the characteristic vectors, judging whether the numerical information is in the price confidence interval, and directly discarding the price information which is not in the confidence interval;

step S4, filtering the numerical value to obtain the keywords of the URL page, matching the keywords with the keyword lists of different categories, and determining that the URL page is sent to the corresponding category and the node where the URL page is located according to the similarity;

step S5, calculating the similarity average value of the feature vector of the URL page and the feature vectors of all members of the category; if the similarity average value is lower than a second preset threshold interval, the feature vector of the URL page is sent to the nodes of other categories, and the similarity average value with different categories is calculated;

and step S6, updating the shared resource according to the feature vector matching calculation result.

2. The method for crawling and screening financial warehouse policy risk information based on streaming computing as claimed in claim 1, wherein a page parsing technology and a Chinese word segmentation technology are adopted to parse and extract key contents of a URL page, and all keywords of the page are obtained.

3. The method for crawling and screening financial warehouse risk information based on streaming computing as claimed in claim 1, wherein the feature value of the keyword is TF IDF, wherein TF represents the word frequency of each keyword in the URL page, and IDF represents the number of URL pages where the keyword appears.

4. The financial warehouse policy risk information crawling and screening method based on streaming computing as claimed in claim 1, wherein the step S6 is specifically: determining a corresponding node by calculating a URL Hash value, and locking and updating a crawled URL list on the node; and determining the category of the URL page according to the feature vector calculation, and locking and updating the keyword list and the category feature vector of the category.

5. The financial warehouse policy risk information crawling and screening method based on streaming computing as claimed in claim 1, wherein the step S4 is specifically: sending the keyword list of the price information to each category of nodes, carrying out similarity calculation with each category of keyword list, comparing the calculation result with a preset first threshold interval, and carrying out corresponding processing:

6. The streaming-based financial policy risk information crawling and screening method of claim 1, wherein the feature vector similarity calculation uses a cosine similarity matching algorithm.