CN111506566A

CN111506566A - Method for verifying internet data acquisition result

Info

Publication number: CN111506566A
Application number: CN202010324527.5A
Authority: CN
Inventors: 戴晶; 蒋圣; 谢乾; 王吉; 杨洋; 沈愉悦; 徐润之; 沈赟芳; 汪涛
Original assignee: Kunshan Byosoft Electronic Technology Co ltd; Nanjing Byosoft Co ltd; Jiangsu Zhuoyi Information Technology Co ltd
Current assignee: Kunshan Byosoft Electronic Technology Co ltd; Nanjing Byosoft Co ltd; Jiangsu Zhuoyi Information Technology Co ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-07

Abstract

The invention discloses a method for checking internet data acquisition results, which divides data into three types of constant, regular variable and irregular variable, sequentially uses a static constant checking method for the constant, regularly checks the regular variable, and identifies dirty data based on a naive Bayesian algorithm for the irregular variable.

Description

Method for verifying internet data acquisition result

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method for verifying an internet data acquisition result.

Background

With the development of internet technology, big data analysis and mining technology is rapidly developed, the application of big data is closely related to daily life of people, medical health analysis based on big data can help patients to quickly locate causes of diseases, financial analysis technology of big data can help traders to carry out quantitative financial analysis, and urban application of big data can help decision makers to observe the daily personnel flow direction of people, so that regional economic analysis is assisted. The big data analysis and mining technical process is mainly divided into data acquisition, data cleaning and data modeling analysis, wherein the data acquisition is important for big data analysis and mining.

The internet data is an important data source for data acquisition, but because the data source of the internet data is unstable, the data source structure changes frequently, and in the internet data acquisition process, there may also be problems of network problems, data analysis errors, and the like, which results in the reduction of the accuracy of the data, therefore, the accuracy check must be performed on the data acquired by the internet.

Disclosure of Invention

The technical problems solved by the invention are as follows: in the internet data acquisition process, network problems, data analysis errors and the like may also exist, so that the accuracy of the data is reduced.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for checking internet data acquisition results comprises the steps of dividing data into three types of constant, regular variable and irregular variable, checking the constant by using a static constant checking method, checking the regular variable regularly, identifying the irregular variable dirty data based on a naive Bayesian algorithm, storing the data into an application database if all types of data pass the checking, and checking and updating a data acquisition program if any type of data do not pass the checking. The method specifically comprises the following steps:

s1: collecting internet data;

s2: the constant checking module checks the constant data and judges the accuracy of the data by comparing whether the constant in the acquired data changes or not;

s3: the variable checking module checks variables in the acquired data, and judges the accuracy of the data according to whether the regular variables conform to the checking rules or not; for irregular variables, establishing a dirty data identification model based on a naive Bayes algorithm, and identifying whether the acquired data is accurate or not;

s4: if all types of data pass the verification, performing step S6, and if a certain type of data does not pass the verification, performing step S5;

s5: checking and updating the data collection program, and then performing step S1;

s6: storing the data into a warehousing queue;

s7: and storing the data to an application database through the warehousing queue.

Further, the step of the constant check module checking the constant data is as follows:

s21: manually extracting constant information which does not change frequently in the acquired data, and storing the constant in a warehouse in a manual checking mode;

s22: collecting data by using a script framework, analyzing the data by using an xpath tool, comparing constant data in the data with constant data stored in a database, if the comparison result is consistent, continuing to step S23, otherwise, turning to step S24;

s23: entering a variable checking module through a constant checking module;

s24: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.

Further, aiming at regular variable data, a regular expression is adopted to establish a check rule, and data check based on the rule is carried out.

Further, the rule-based data verification method specifically comprises the following steps:

s31: manually extracting regular variables in the acquired data, and establishing corresponding check rules for each regular variable based on the business rules;

s32: collecting data by using a script framework, analyzing the data by using an xpath tool, carrying out rule-based check on a rule variable, if the rule variable passes the check, continuing to step S33, and if the rule variable does not pass the check, turning to step S34;

s33: entering a random variable check module through a data check module based on a rule;

s34: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.

Further, the step of establishing the dirty data identification model based on the naive Bayes algorithm comprises the following steps:

s41: data acquisition: collecting data by using a script framework, and analyzing the data by using an xpath tool;

s42: data preprocessing: filtering html tags in the data by using a regular expression, setting the minimum length Min of the data, and deleting the data with the length less than Min;

s43: manually identifying whether the data is dirty data to obtain a sample set, and dividing the sample into a training set and a testing set according to a certain proportion;

s44: using a word segmentation tool to segment the data, willConverting text data into word vectors, selecting n words with highest occurrence frequency as data features, and recording the n words as x₁，x₂……x_n；

S45: respectively counting the occurrence probability of each word under the categories of the effective data and the dirty data to obtain P (x)_iY), counting the probability of dirty data and the probability of valid data, and counting to obtain P (y), thereby obtaining a Bayesian model;

s46: and verifying the accuracy of the model by using the test set, adjusting the model and improving the precision of the model.

Further, the step of checking the irregular variable based on the naive Bayes algorithm comprises the following steps:

s51: converting the irregular variable to be checked from a text format into a word vector by using a word segmentation tool;

s52: inputting the word vector into a trained Bayes model, and identifying whether the word vector is dirty data;

s53: setting a threshold value m, and when the quantity of the dirty data appearing in the step 2 is larger than the threshold value m, turning to S55, otherwise, continuing to the step S54;

s54, the data is checked and enters the tail of the warehousing queue, and when the length of the queue is larger than a threshold value L, the head element is stored in an application database;

s55: deleting the data which is recently stored in the warehousing queue from the queue, checking a data acquisition program, analyzing the reason of a large amount of dirty data, and updating the program.

Furthermore, naive bayes are classification algorithms based on bayesian rules, which are shown in equation (1):

when x is a plurality of independent events x₁，x₂……x_nThe Bayesian rule is shown in (2):

in the formulae (1) and (2),

p (y | x) is the posterior probability, which means the probability of occurrence of event y in the case of occurrence of event x,

p (x), P (y) is the probability of occurrence of event x, y,

p (x | y) is a conditional probability, and refers to the probability of x occurring when the event y occurs;

the naive Bayes algorithm calculates Bayes probabilities of all the classes of each piece of data, the class with the highest probability is the class to which the data belongs, and the value of P (x) is the same for all the classes, so the Bayes algorithm is shown as the formula (3),

in formula (3), y is the collection of all classes,

c is a certain category in y.

Has the advantages that: compared with the prior art, the invention has the following advantages:

the method for checking the internet data acquisition result provides a static constant checking method for constant data, a rule-based checking method for regular variable data and a dirty data identification method based on naive Bayes for irregular variable data, and combines the three methods for checking the acquired data, and finally stores the checked data in a warehouse, thereby providing accurate and effective data for data analysis. Through actual measurement verification of acquisition and verification of stock market data, wrong data can be effectively found, and the method has high identification rate and can be practically applied to a verification process of data acquisition.

Drawings

FIG. 1 is a schematic flow diagram of an overview of a method for verifying results of Internet data acquisition;

FIG. 2 is a method constant verification process for verifying Internet data acquisition results;

FIG. 3 is a flow chart of a method for checking the Internet data acquisition result with regular variables.

Detailed Description

The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The method for checking the internet data acquisition result is based on the accuracy of the data acquisition result, the internet data are divided into three types of constant, regular variable and irregular variable, the constant is checked by using a static constant checking method, the regular variable is checked regularly, the irregular variable is subjected to dirty data identification based on a naive Bayesian algorithm, the three types of methods are combined, the comprehensive checking can be performed on the data acquisition result, the accuracy of the data is ensured, if all types of data pass the checking, the data are stored in an application database, and if any type of data does not pass the checking, the data acquisition result is proved to have problems, and the data acquisition program is checked and updated.

The constant refers to data which does not change frequently, such as stock codes, stock names and other information in real-time stock data; the variable is data which changes frequently, and is divided into two types of regular variable and irregular variable, wherein the regular variable refers to variable data with obvious rules, such as real-time stock price, the data type is a floating point type numerical value which is greater than 0, and the irregular variable refers to variable data without obvious rules, such as text type public opinion data.

The general process flow diagram of the present invention is shown in FIG. 1. The whole checking process is divided into a constant checking module and a variable checking module. The constant checking module mainly checks the constant in the acquired data, judges the accuracy of the data by comparing whether the constant in the acquired data changes or not, checks the variable in the acquired data by the variable checking module, judges the accuracy of the data according to whether the regular variable accords with the checking rule or not and establishes a dirty data identification model for the irregular variable based on a naive Bayesian algorithm to identify whether the acquired data is accurate or not. The method comprises the steps of firstly checking a constant, secondly checking a regular variable and finally checking an irregular variable, and when all types pass the checking, storing data into a warehousing queue and storing the data into an application database through the warehousing queue. The warehousing queue mainly plays a data buffering role, and when data is wrong, warehousing can be stopped in time, the wrong data can be deleted, and the application is prevented from being influenced by the wrong data.

The method specifically comprises the following steps:

s1: collecting internet data;

s6: storing the data into a warehousing queue;

Constant check module

The constant checking module mainly checks the constant in the acquired data, and comprises the following steps:

s23: entering a variable checking module through a constant checking module;

Variable check module

The data verification method based on the rule comprises the following steps:

and aiming at regular variable data, a regular expression is adopted to establish a check rule, and data check based on the rule is carried out. The data verification method based on the rule comprises the following specific steps:

s31: manually extracting regular variables in the acquired data, and establishing corresponding check rules for each regular variable based on the business rules; such as real-time prices of stocks, the check rule is set to a floating-point type number that must be greater than 0.

A dirty data identification method based on naive Bayes comprises the following steps:

aiming at irregular variable data such as public sentiment data, the invention establishes a dirty data identification model based on naive Bayes to identify the accuracy of the data.

Naive bayes is a classification algorithm based on bayesian rules in probability theory, which are shown in equation (1):

in the formulae (1) and (2),

p (x), P (y) is the probability of occurrence of event x, y,

in formula (3), y is the collection of all classes,

c is a certain category in y.

The invention establishes a dirty data identification model based on a naive Bayes algorithm, and comprises the following steps:

s42: data preprocessing: filtering html tags in the data by using a regular expression, setting the minimum length Min of the data, and deleting the data with the length less than Min, wherein the html tags are contents such as < p >, < br > and the like;

s44: using a word segmentation tool to segment data, converting text data into word vectors, selecting n words with the highest occurrence frequency as data features, and recording the n words as x₁，x₂……x_n；

S45: respectively counting the occurrence probability of each word under the categories of the effective data and the dirty data to obtain P (x)_iY), the probability of dirty data and the probability of valid data,counting to obtain P (y), thereby obtaining a Bayesian model;

The steps of checking the irregular variable based on the naive Bayes algorithm are as follows:

The invention takes the verification of stock quotation data as an embodiment to verify the method of the invention:

in the embodiment, the stock market data is used as a data acquisition object, and data support is provided for further quantitative financial analysis by acquiring real-time stock market data and public opinion data of listed enterprises. The stock market data of the embodiment is mainly collected from public and transparent stock information published in securities websites such as the oriental wealth network and the same-flower shun, public opinion data of enterprises on the market is mainly collected from latest news published in media websites such as hundredth news and news in new seas, and specific fields of data collection are shown in tables 1 and 2.

Table 1: internet collected data dictionary

Field(s)	Type (B)	Description of the invention
			code	string	Stock code
name	string	Stock name
			price	float	Real-time price of stock
news	array	Public opinion information
			open	float	Price of opening dish
high	float	Highest price
			low	float	Lowest price
per	float	Market profit rate
			pb	float	Net rate of market
changepercent	float	Amplitude of fluctuation
			mktcap	float	Total market value
nmc	float	Value of circulation market

Table 2: details of news field

Field(s)	Type (B)	Description of the invention
			news_url	string	Original linking
title	string	Title
			news_date	string	When releasedWorkshop
source	string	Origin of origin
			abstract	string	Abstract
detail	string	Details of

The data acquisition tool used by the invention is a script framework, and the data analysis tool is an xpath. The Scarpay framework is a multithreading asynchronous data acquisition framework realized based on python, and mainly comprises a task scheduling module Scheduler, a data request module Downloader, a data acquisition module Spiders and a data storage module Pipline, and an xpath tool is mainly used for carrying out data analysis on a structural website.

The software and hardware environments adopted for the experiment in this embodiment are respectively: the operating system is Windows 7 professional edition, the used development language is Python 3.6, the CPU is Intel Core i7, the memory is 16G, the hard disk is PCIeSD, and the graphics card is Geofrace GTX 1060.

In this embodiment, Mongodb is used as an application database, redis is used to store a stock-in queue, and a script frame is used to collect stock market data, where constants are stock codes and stock names, irregular variables are public opinion information of listed enterprises, and variables related to stock prices are regular variables, as shown in table 3.

Table 3: stock market data field checking rule

Field(s)	Type (B)	Description of the invention	Checking method
				code	String	Stock code	Constant check
name	String	Stock name	Constant check
				price	float	Real-time price of stock	Regular variable check
news	Array	Public opinion information	Random variable check
				open	float	Price of opening dish	Regular variable check
high	float	Highest price	Regular variable check
				low	float	Lowest price	Regular variable check
per	float	Market profit rate	Regular variable check
				pb	float	Net rate of market	Regular variable check
changepercent	float	Amplitude of fluctuation	Regular variable check
				mktcap	float	Total market value	Regular variable check
nmc	float	Value of circulation market	Regular variable check

In the embodiment, three rules of regular variables, namely the fluctuation range, the urban profitability and the urban net rate, are set to be in accordance with floating point type numerical values, the rules of other regular variables are floating point type data which is greater than 0, the experiment uses a naive Bayesian algorithm to establish a public sentiment data identification model, 2000 sample data are identified manually, wherein 1823 positive sample data and 177 negative sample dirty data are included, the identification accuracy of the model can reach 86.6%, the length of a warehousing queue is set to be 100 in the experiment, when the proportion of the identified dirty data reaches 10%, namely the quantity of the dirty data in the warehousing queue exceeds 10, the data acquisition procedure is considered to be faulty, the data acquisition process is stopped, and the data acquisition procedure are verified manually.

The invention discloses a method for automatically checking an internet data acquisition result, which is used for checking the accuracy of the data acquisition result. The invention takes stock quotation data as an example, divides the data into constant, regular variable and irregular variable, respectively proposes a static constant checking method aiming at three data types, a data checking method based on the rules, a dirty data identification method based on naive Bayes, combines the three methods to check the data in sequence, stores the checked data to a warehousing queue for data buffering, and stores the data to an application database through the warehousing queue to provide data support for quantitative financial analysis. Experiments show that the method can effectively find wrong data, has high recognition rate, and can be practically applied to a verification process of data acquisition.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for verifying internet data acquisition results is characterized in that: dividing data into three types of constant, regular variable and irregular variable, checking the constant by using a static constant checking method, checking the regular variable regularly, identifying the irregular variable dirty data based on a naive Bayesian algorithm, storing the data into an application database if all types of data pass the check, and checking and updating a data acquisition program if any type of data does not pass the check.

2. The method for verifying the internet data acquisition result according to claim 1, specifically comprising the steps of:

s1: collecting internet data;

s6: storing the data into a warehousing queue;

3. The method for verifying the internet data collection result as recited in claim 2, wherein the step of verifying the constant data by the constant verification module is as follows:

s23: entering a variable checking module through a constant checking module;

4. The method for verifying the internet data acquisition result as recited in claim 1, wherein a regular expression is used to establish the verification rule for regular variable data, and the rule-based data verification is performed.

5. The method for verifying the internet data acquisition result as recited in claim 4, wherein the rule-based data verification method comprises the following specific steps:

6. The method for verifying internet data collection results as recited in claim 2, wherein: the dirty data identification model is established based on a naive Bayes algorithm, and the method comprises the following steps:

7. The method for verifying internet data collection results of claim 6, wherein: the steps of checking the irregular variable based on the naive Bayes algorithm are as follows:

8. The method for verifying internet data collection results as recited in claim 2, wherein:

naive bayes is a classification algorithm based on bayesian rules, which are shown in equation (1):

when x is a plurality of independent events x₁，x₂......x_nThe Bayesian rule is shown in (2):

in the formulae (1) and (2),

p (x), P (y) is the probability of occurrence of event x, y,

in formula (3), y is the collection of all classes,

c is a certain category in y.