CN111506566A - Method for verifying internet data acquisition result - Google Patents

Method for verifying internet data acquisition result Download PDF

Info

Publication number
CN111506566A
CN111506566A CN202010324527.5A CN202010324527A CN111506566A CN 111506566 A CN111506566 A CN 111506566A CN 202010324527 A CN202010324527 A CN 202010324527A CN 111506566 A CN111506566 A CN 111506566A
Authority
CN
China
Prior art keywords
data
checking
variable
constant
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010324527.5A
Other languages
Chinese (zh)
Inventor
戴晶
蒋圣
谢乾
王吉
杨洋
沈愉悦
徐润之
沈赟芳
汪涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunshan Byosoft Electronic Technology Co ltd
Nanjing Byosoft Co ltd
Jiangsu Zhuoyi Information Technology Co ltd
Original Assignee
Kunshan Byosoft Electronic Technology Co ltd
Nanjing Byosoft Co ltd
Jiangsu Zhuoyi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunshan Byosoft Electronic Technology Co ltd, Nanjing Byosoft Co ltd, Jiangsu Zhuoyi Information Technology Co ltd filed Critical Kunshan Byosoft Electronic Technology Co ltd
Priority to CN202010324527.5A priority Critical patent/CN111506566A/en
Publication of CN111506566A publication Critical patent/CN111506566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for checking internet data acquisition results, which divides data into three types of constant, regular variable and irregular variable, sequentially uses a static constant checking method for the constant, regularly checks the regular variable, and identifies dirty data based on a naive Bayesian algorithm for the irregular variable.

Description

Method for verifying internet data acquisition result
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a method for verifying an internet data acquisition result.
Background
With the development of internet technology, big data analysis and mining technology is rapidly developed, the application of big data is closely related to daily life of people, medical health analysis based on big data can help patients to quickly locate causes of diseases, financial analysis technology of big data can help traders to carry out quantitative financial analysis, and urban application of big data can help decision makers to observe the daily personnel flow direction of people, so that regional economic analysis is assisted. The big data analysis and mining technical process is mainly divided into data acquisition, data cleaning and data modeling analysis, wherein the data acquisition is important for big data analysis and mining.
The internet data is an important data source for data acquisition, but because the data source of the internet data is unstable, the data source structure changes frequently, and in the internet data acquisition process, there may also be problems of network problems, data analysis errors, and the like, which results in the reduction of the accuracy of the data, therefore, the accuracy check must be performed on the data acquired by the internet.
Disclosure of Invention
The technical problems solved by the invention are as follows: in the internet data acquisition process, network problems, data analysis errors and the like may also exist, so that the accuracy of the data is reduced.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method for checking internet data acquisition results comprises the steps of dividing data into three types of constant, regular variable and irregular variable, checking the constant by using a static constant checking method, checking the regular variable regularly, identifying the irregular variable dirty data based on a naive Bayesian algorithm, storing the data into an application database if all types of data pass the checking, and checking and updating a data acquisition program if any type of data do not pass the checking. The method specifically comprises the following steps:
s1: collecting internet data;
s2: the constant checking module checks the constant data and judges the accuracy of the data by comparing whether the constant in the acquired data changes or not;
s3: the variable checking module checks variables in the acquired data, and judges the accuracy of the data according to whether the regular variables conform to the checking rules or not; for irregular variables, establishing a dirty data identification model based on a naive Bayes algorithm, and identifying whether the acquired data is accurate or not;
s4: if all types of data pass the verification, performing step S6, and if a certain type of data does not pass the verification, performing step S5;
s5: checking and updating the data collection program, and then performing step S1;
s6: storing the data into a warehousing queue;
s7: and storing the data to an application database through the warehousing queue.
Further, the step of the constant check module checking the constant data is as follows:
s21: manually extracting constant information which does not change frequently in the acquired data, and storing the constant in a warehouse in a manual checking mode;
s22: collecting data by using a script framework, analyzing the data by using an xpath tool, comparing constant data in the data with constant data stored in a database, if the comparison result is consistent, continuing to step S23, otherwise, turning to step S24;
s23: entering a variable checking module through a constant checking module;
s24: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
Further, aiming at regular variable data, a regular expression is adopted to establish a check rule, and data check based on the rule is carried out.
Further, the rule-based data verification method specifically comprises the following steps:
s31: manually extracting regular variables in the acquired data, and establishing corresponding check rules for each regular variable based on the business rules;
s32: collecting data by using a script framework, analyzing the data by using an xpath tool, carrying out rule-based check on a rule variable, if the rule variable passes the check, continuing to step S33, and if the rule variable does not pass the check, turning to step S34;
s33: entering a random variable check module through a data check module based on a rule;
s34: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
Further, the step of establishing the dirty data identification model based on the naive Bayes algorithm comprises the following steps:
s41: data acquisition: collecting data by using a script framework, and analyzing the data by using an xpath tool;
s42: data preprocessing: filtering html tags in the data by using a regular expression, setting the minimum length Min of the data, and deleting the data with the length less than Min;
s43: manually identifying whether the data is dirty data to obtain a sample set, and dividing the sample into a training set and a testing set according to a certain proportion;
s44: using a word segmentation tool to segment the data, willConverting text data into word vectors, selecting n words with highest occurrence frequency as data features, and recording the n words as x1,x2……xn
S45: respectively counting the occurrence probability of each word under the categories of the effective data and the dirty data to obtain P (x)iY), counting the probability of dirty data and the probability of valid data, and counting to obtain P (y), thereby obtaining a Bayesian model;
s46: and verifying the accuracy of the model by using the test set, adjusting the model and improving the precision of the model.
Further, the step of checking the irregular variable based on the naive Bayes algorithm comprises the following steps:
s51: converting the irregular variable to be checked from a text format into a word vector by using a word segmentation tool;
s52: inputting the word vector into a trained Bayes model, and identifying whether the word vector is dirty data;
s53: setting a threshold value m, and when the quantity of the dirty data appearing in the step 2 is larger than the threshold value m, turning to S55, otherwise, continuing to the step S54;
s54, the data is checked and enters the tail of the warehousing queue, and when the length of the queue is larger than a threshold value L, the head element is stored in an application database;
s55: deleting the data which is recently stored in the warehousing queue from the queue, checking a data acquisition program, analyzing the reason of a large amount of dirty data, and updating the program.
Furthermore, naive bayes are classification algorithms based on bayesian rules, which are shown in equation (1):
Figure BDA0002462349010000031
when x is a plurality of independent events x1,x2……xnThe Bayesian rule is shown in (2):
Figure BDA0002462349010000032
in the formulae (1) and (2),
p (y | x) is the posterior probability, which means the probability of occurrence of event y in the case of occurrence of event x,
p (x), P (y) is the probability of occurrence of event x, y,
p (x | y) is a conditional probability, and refers to the probability of x occurring when the event y occurs;
the naive Bayes algorithm calculates Bayes probabilities of all the classes of each piece of data, the class with the highest probability is the class to which the data belongs, and the value of P (x) is the same for all the classes, so the Bayes algorithm is shown as the formula (3),
Figure BDA0002462349010000041
in formula (3), y is the collection of all classes,
c is a certain category in y.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the method for checking the internet data acquisition result provides a static constant checking method for constant data, a rule-based checking method for regular variable data and a dirty data identification method based on naive Bayes for irregular variable data, and combines the three methods for checking the acquired data, and finally stores the checked data in a warehouse, thereby providing accurate and effective data for data analysis. Through actual measurement verification of acquisition and verification of stock market data, wrong data can be effectively found, and the method has high identification rate and can be practically applied to a verification process of data acquisition.
Drawings
FIG. 1 is a schematic flow diagram of an overview of a method for verifying results of Internet data acquisition;
FIG. 2 is a method constant verification process for verifying Internet data acquisition results;
FIG. 3 is a flow chart of a method for checking the Internet data acquisition result with regular variables.
Detailed Description
The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The method for checking the internet data acquisition result is based on the accuracy of the data acquisition result, the internet data are divided into three types of constant, regular variable and irregular variable, the constant is checked by using a static constant checking method, the regular variable is checked regularly, the irregular variable is subjected to dirty data identification based on a naive Bayesian algorithm, the three types of methods are combined, the comprehensive checking can be performed on the data acquisition result, the accuracy of the data is ensured, if all types of data pass the checking, the data are stored in an application database, and if any type of data does not pass the checking, the data acquisition result is proved to have problems, and the data acquisition program is checked and updated.
The constant refers to data which does not change frequently, such as stock codes, stock names and other information in real-time stock data; the variable is data which changes frequently, and is divided into two types of regular variable and irregular variable, wherein the regular variable refers to variable data with obvious rules, such as real-time stock price, the data type is a floating point type numerical value which is greater than 0, and the irregular variable refers to variable data without obvious rules, such as text type public opinion data.
The general process flow diagram of the present invention is shown in FIG. 1. The whole checking process is divided into a constant checking module and a variable checking module. The constant checking module mainly checks the constant in the acquired data, judges the accuracy of the data by comparing whether the constant in the acquired data changes or not, checks the variable in the acquired data by the variable checking module, judges the accuracy of the data according to whether the regular variable accords with the checking rule or not and establishes a dirty data identification model for the irregular variable based on a naive Bayesian algorithm to identify whether the acquired data is accurate or not. The method comprises the steps of firstly checking a constant, secondly checking a regular variable and finally checking an irregular variable, and when all types pass the checking, storing data into a warehousing queue and storing the data into an application database through the warehousing queue. The warehousing queue mainly plays a data buffering role, and when data is wrong, warehousing can be stopped in time, the wrong data can be deleted, and the application is prevented from being influenced by the wrong data.
The method specifically comprises the following steps:
s1: collecting internet data;
s2: the constant checking module checks the constant data and judges the accuracy of the data by comparing whether the constant in the acquired data changes or not;
s3: the variable checking module checks variables in the acquired data, and judges the accuracy of the data according to whether the regular variables conform to the checking rules or not; for irregular variables, establishing a dirty data identification model based on a naive Bayes algorithm, and identifying whether the acquired data is accurate or not;
s4: if all types of data pass the verification, performing step S6, and if a certain type of data does not pass the verification, performing step S5;
s5: checking and updating the data collection program, and then performing step S1;
s6: storing the data into a warehousing queue;
s7: and storing the data to an application database through the warehousing queue.
Constant check module
The constant checking module mainly checks the constant in the acquired data, and comprises the following steps:
s21: manually extracting constant information which does not change frequently in the acquired data, and storing the constant in a warehouse in a manual checking mode;
s22: collecting data by using a script framework, analyzing the data by using an xpath tool, comparing constant data in the data with constant data stored in a database, if the comparison result is consistent, continuing to step S23, otherwise, turning to step S24;
s23: entering a variable checking module through a constant checking module;
s24: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
Variable check module
The data verification method based on the rule comprises the following steps:
and aiming at regular variable data, a regular expression is adopted to establish a check rule, and data check based on the rule is carried out. The data verification method based on the rule comprises the following specific steps:
s31: manually extracting regular variables in the acquired data, and establishing corresponding check rules for each regular variable based on the business rules; such as real-time prices of stocks, the check rule is set to a floating-point type number that must be greater than 0.
S32: collecting data by using a script framework, analyzing the data by using an xpath tool, carrying out rule-based check on a rule variable, if the rule variable passes the check, continuing to step S33, and if the rule variable does not pass the check, turning to step S34;
s33: entering a random variable check module through a data check module based on a rule;
s34: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
A dirty data identification method based on naive Bayes comprises the following steps:
aiming at irregular variable data such as public sentiment data, the invention establishes a dirty data identification model based on naive Bayes to identify the accuracy of the data.
Naive bayes is a classification algorithm based on bayesian rules in probability theory, which are shown in equation (1):
Figure BDA0002462349010000061
when x is a plurality of independent events x1,x2……xnThe Bayesian rule is shown in (2):
Figure BDA0002462349010000062
in the formulae (1) and (2),
p (y | x) is the posterior probability, which means the probability of occurrence of event y in the case of occurrence of event x,
p (x), P (y) is the probability of occurrence of event x, y,
p (x | y) is a conditional probability, and refers to the probability of x occurring when the event y occurs;
the naive Bayes algorithm calculates Bayes probabilities of all the classes of each piece of data, the class with the highest probability is the class to which the data belongs, and the value of P (x) is the same for all the classes, so the Bayes algorithm is shown as the formula (3),
Figure BDA0002462349010000071
in formula (3), y is the collection of all classes,
c is a certain category in y.
The invention establishes a dirty data identification model based on a naive Bayes algorithm, and comprises the following steps:
s41: data acquisition: collecting data by using a script framework, and analyzing the data by using an xpath tool;
s42: data preprocessing: filtering html tags in the data by using a regular expression, setting the minimum length Min of the data, and deleting the data with the length less than Min, wherein the html tags are contents such as < p >, < br > and the like;
s43: manually identifying whether the data is dirty data to obtain a sample set, and dividing the sample into a training set and a testing set according to a certain proportion;
s44: using a word segmentation tool to segment data, converting text data into word vectors, selecting n words with the highest occurrence frequency as data features, and recording the n words as x1,x2……xn
S45: respectively counting the occurrence probability of each word under the categories of the effective data and the dirty data to obtain P (x)iY), the probability of dirty data and the probability of valid data,counting to obtain P (y), thereby obtaining a Bayesian model;
s46: and verifying the accuracy of the model by using the test set, adjusting the model and improving the precision of the model.
The steps of checking the irregular variable based on the naive Bayes algorithm are as follows:
s51: converting the irregular variable to be checked from a text format into a word vector by using a word segmentation tool;
s52: inputting the word vector into a trained Bayes model, and identifying whether the word vector is dirty data;
s53: setting a threshold value m, and when the quantity of the dirty data appearing in the step 2 is larger than the threshold value m, turning to S55, otherwise, continuing to the step S54;
s54, the data is checked and enters the tail of the warehousing queue, and when the length of the queue is larger than a threshold value L, the head element is stored in an application database;
s55: deleting the data which is recently stored in the warehousing queue from the queue, checking a data acquisition program, analyzing the reason of a large amount of dirty data, and updating the program.
The invention takes the verification of stock quotation data as an embodiment to verify the method of the invention:
in the embodiment, the stock market data is used as a data acquisition object, and data support is provided for further quantitative financial analysis by acquiring real-time stock market data and public opinion data of listed enterprises. The stock market data of the embodiment is mainly collected from public and transparent stock information published in securities websites such as the oriental wealth network and the same-flower shun, public opinion data of enterprises on the market is mainly collected from latest news published in media websites such as hundredth news and news in new seas, and specific fields of data collection are shown in tables 1 and 2.
Table 1: internet collected data dictionary
Field(s) Type (B) Description of the invention
code string Stock code
name string Stock name
price float Real-time price of stock
news array Public opinion information
open float Price of opening dish
high float Highest price
low float Lowest price
per float Market profit rate
pb float Net rate of market
changepercent float Amplitude of fluctuation
mktcap float Total market value
nmc float Value of circulation market
Table 2: details of news field
Field(s) Type (B) Description of the invention
news_url string Original linking
title string Title
news_date string When releasedWorkshop
source string Origin of origin
abstract string Abstract
detail string Details of
The data acquisition tool used by the invention is a script framework, and the data analysis tool is an xpath. The Scarpay framework is a multithreading asynchronous data acquisition framework realized based on python, and mainly comprises a task scheduling module Scheduler, a data request module Downloader, a data acquisition module Spiders and a data storage module Pipline, and an xpath tool is mainly used for carrying out data analysis on a structural website.
The software and hardware environments adopted for the experiment in this embodiment are respectively: the operating system is Windows 7 professional edition, the used development language is Python 3.6, the CPU is Intel Core i7, the memory is 16G, the hard disk is PCIeSD, and the graphics card is Geofrace GTX 1060.
In this embodiment, Mongodb is used as an application database, redis is used to store a stock-in queue, and a script frame is used to collect stock market data, where constants are stock codes and stock names, irregular variables are public opinion information of listed enterprises, and variables related to stock prices are regular variables, as shown in table 3.
Table 3: stock market data field checking rule
Field(s) Type (B) Description of the invention Checking method
code String Stock code Constant check
name String Stock name Constant check
price float Real-time price of stock Regular variable check
news Array Public opinion information Random variable check
open float Price of opening dish Regular variable check
high float Highest price Regular variable check
low float Lowest price Regular variable check
per float Market profit rate Regular variable check
pb float Net rate of market Regular variable check
changepercent float Amplitude of fluctuation Regular variable check
mktcap float Total market value Regular variable check
nmc float Value of circulation market Regular variable check
In the embodiment, three rules of regular variables, namely the fluctuation range, the urban profitability and the urban net rate, are set to be in accordance with floating point type numerical values, the rules of other regular variables are floating point type data which is greater than 0, the experiment uses a naive Bayesian algorithm to establish a public sentiment data identification model, 2000 sample data are identified manually, wherein 1823 positive sample data and 177 negative sample dirty data are included, the identification accuracy of the model can reach 86.6%, the length of a warehousing queue is set to be 100 in the experiment, when the proportion of the identified dirty data reaches 10%, namely the quantity of the dirty data in the warehousing queue exceeds 10, the data acquisition procedure is considered to be faulty, the data acquisition process is stopped, and the data acquisition procedure are verified manually.
The invention discloses a method for automatically checking an internet data acquisition result, which is used for checking the accuracy of the data acquisition result. The invention takes stock quotation data as an example, divides the data into constant, regular variable and irregular variable, respectively proposes a static constant checking method aiming at three data types, a data checking method based on the rules, a dirty data identification method based on naive Bayes, combines the three methods to check the data in sequence, stores the checked data to a warehousing queue for data buffering, and stores the data to an application database through the warehousing queue to provide data support for quantitative financial analysis. Experiments show that the method can effectively find wrong data, has high recognition rate, and can be practically applied to a verification process of data acquisition.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for verifying internet data acquisition results is characterized in that: dividing data into three types of constant, regular variable and irregular variable, checking the constant by using a static constant checking method, checking the regular variable regularly, identifying the irregular variable dirty data based on a naive Bayesian algorithm, storing the data into an application database if all types of data pass the check, and checking and updating a data acquisition program if any type of data does not pass the check.
2. The method for verifying the internet data acquisition result according to claim 1, specifically comprising the steps of:
s1: collecting internet data;
s2: the constant checking module checks the constant data and judges the accuracy of the data by comparing whether the constant in the acquired data changes or not;
s3: the variable checking module checks variables in the acquired data, and judges the accuracy of the data according to whether the regular variables conform to the checking rules or not; for irregular variables, establishing a dirty data identification model based on a naive Bayes algorithm, and identifying whether the acquired data is accurate or not;
s4: if all types of data pass the verification, performing step S6, and if a certain type of data does not pass the verification, performing step S5;
s5: checking and updating the data collection program, and then performing step S1;
s6: storing the data into a warehousing queue;
s7: and storing the data to an application database through the warehousing queue.
3. The method for verifying the internet data collection result as recited in claim 2, wherein the step of verifying the constant data by the constant verification module is as follows:
s21: manually extracting constant information which does not change frequently in the acquired data, and storing the constant in a warehouse in a manual checking mode;
s22: collecting data by using a script framework, analyzing the data by using an xpath tool, comparing constant data in the data with constant data stored in a database, if the comparison result is consistent, continuing to step S23, otherwise, turning to step S24;
s23: entering a variable checking module through a constant checking module;
s24: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
4. The method for verifying the internet data acquisition result as recited in claim 1, wherein a regular expression is used to establish the verification rule for regular variable data, and the rule-based data verification is performed.
5. The method for verifying the internet data acquisition result as recited in claim 4, wherein the rule-based data verification method comprises the following specific steps:
s31: manually extracting regular variables in the acquired data, and establishing corresponding check rules for each regular variable based on the business rules;
s32: collecting data by using a script framework, analyzing the data by using an xpath tool, carrying out rule-based check on a rule variable, if the rule variable passes the check, continuing to step S33, and if the rule variable does not pass the check, turning to step S34;
s33: entering a random variable check module through a data check module based on a rule;
s34: and checking a data acquisition program, analyzing the reason of inconsistency and updating the program.
6. The method for verifying internet data collection results as recited in claim 2, wherein: the dirty data identification model is established based on a naive Bayes algorithm, and the method comprises the following steps:
s41: data acquisition: collecting data by using a script framework, and analyzing the data by using an xpath tool;
s42: data preprocessing: filtering html tags in the data by using a regular expression, setting the minimum length Min of the data, and deleting the data with the length less than Min;
s43: manually identifying whether the data is dirty data to obtain a sample set, and dividing the sample into a training set and a testing set according to a certain proportion;
s44: using a word segmentation tool to segment data, converting text data into word vectors, selecting n words with the highest occurrence frequency as data features, and recording the n words as x1,x2……xn
S45: respectively counting the occurrence probability of each word under the categories of the effective data and the dirty data to obtain P (x)iY), counting the probability of dirty data and the probability of valid data, and counting to obtain P (y), thereby obtaining a Bayesian model;
s46: and verifying the accuracy of the model by using the test set, adjusting the model and improving the precision of the model.
7. The method for verifying internet data collection results of claim 6, wherein: the steps of checking the irregular variable based on the naive Bayes algorithm are as follows:
s51: converting the irregular variable to be checked from a text format into a word vector by using a word segmentation tool;
s52: inputting the word vector into a trained Bayes model, and identifying whether the word vector is dirty data;
s53: setting a threshold value m, and when the quantity of the dirty data appearing in the step 2 is larger than the threshold value m, turning to S55, otherwise, continuing to the step S54;
s54, the data is checked and enters the tail of the warehousing queue, and when the length of the queue is larger than a threshold value L, the head element is stored in an application database;
s55: deleting the data which is recently stored in the warehousing queue from the queue, checking a data acquisition program, analyzing the reason of a large amount of dirty data, and updating the program.
8. The method for verifying internet data collection results as recited in claim 2, wherein:
naive bayes is a classification algorithm based on bayesian rules, which are shown in equation (1):
Figure FDA0002462347000000031
when x is a plurality of independent events x1,x2......xnThe Bayesian rule is shown in (2):
Figure FDA0002462347000000032
in the formulae (1) and (2),
p (y | x) is the posterior probability, which means the probability of occurrence of event y in the case of occurrence of event x,
p (x), P (y) is the probability of occurrence of event x, y,
p (x | y) is a conditional probability, and refers to the probability of x occurring when the event y occurs;
the naive Bayes algorithm calculates Bayes probabilities of all the classes of each piece of data, the class with the highest probability is the class to which the data belongs, and the value of P (x) is the same for all the classes, so the Bayes algorithm is shown as the formula (3),
Figure FDA0002462347000000033
in formula (3), y is the collection of all classes,
c is a certain category in y.
CN202010324527.5A 2020-04-22 2020-04-22 Method for verifying internet data acquisition result Pending CN111506566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010324527.5A CN111506566A (en) 2020-04-22 2020-04-22 Method for verifying internet data acquisition result

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010324527.5A CN111506566A (en) 2020-04-22 2020-04-22 Method for verifying internet data acquisition result

Publications (1)

Publication Number Publication Date
CN111506566A true CN111506566A (en) 2020-08-07

Family

ID=71864815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010324527.5A Pending CN111506566A (en) 2020-04-22 2020-04-22 Method for verifying internet data acquisition result

Country Status (1)

Country Link
CN (1) CN111506566A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145207A (en) * 2018-08-01 2019-01-04 广东奥博信息产业股份有限公司 A kind of information personalized recommendation method and device based on classification indicators prediction
CN109800168A (en) * 2019-01-24 2019-05-24 北京奇艺世纪科技有限公司 The test method and device of the action event data of software
CN110442709A (en) * 2019-06-24 2019-11-12 厦门美域中央信息科技有限公司 A kind of file classification method based on model-naive Bayesian

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145207A (en) * 2018-08-01 2019-01-04 广东奥博信息产业股份有限公司 A kind of information personalized recommendation method and device based on classification indicators prediction
CN109800168A (en) * 2019-01-24 2019-05-24 北京奇艺世纪科技有限公司 The test method and device of the action event data of software
CN110442709A (en) * 2019-06-24 2019-11-12 厦门美域中央信息科技有限公司 A kind of file classification method based on model-naive Bayesian

Similar Documents

Publication Publication Date Title
CN108256074B (en) Verification processing method and device, electronic equipment and storage medium
CN110109835B (en) Software defect positioning method based on deep neural network
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
US7389306B2 (en) System and method for processing semi-structured business data using selected template designs
US20190171944A1 (en) Integrity evaluation of unstructured processes using artificial intelligence (ai) techniques
US20050071217A1 (en) Method, system and computer product for analyzing business risk using event information extracted from natural language sources
CN112231431B (en) Abnormal address identification method and device and computer readable storage medium
Jerzak et al. An improved method of automated nonparametric content analysis for social science
CN111899090B (en) Enterprise associated risk early warning method and system
CN112560491A (en) Information extraction method and device based on AI technology and storage medium
US11494850B1 (en) Applied artificial intelligence technology for detecting anomalies in payroll data
CN109255029A (en) A method of automatic Bug report distribution is enhanced using weighted optimization training set
CN110750588A (en) Multi-source heterogeneous data fusion method, system, device and storage medium
CN115759104A (en) Financial field public opinion analysis method and system based on entity recognition
CN112016294A (en) Text-based news importance evaluation method and device and electronic equipment
CN116644184A (en) Human Resource Information Management System Based on Data Clustering
CN116610592B (en) Customizable software test evaluation method and system based on natural language processing technology
CN111506566A (en) Method for verifying internet data acquisition result
Chen et al. Predicting a corporate financial crisis using letters to shareholders
Seo et al. Measuring News Sentiment of Korea Using Transformer
CN115345401A (en) Six-dimensional analysis method for finding enterprise financial risk
CN112818215A (en) Product data processing method, device, equipment and storage medium
CN112508665A (en) Distributed enterprise credit assessment method based on information sharing
Konstantinidis et al. Financial News Classification Model for NLP-based Bond Portfolio Construction
CN116991983B (en) Event extraction method and system for company information text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination