CN117076810A

CN117076810A - Internet big data processing system and method based on artificial intelligence

Info

Publication number: CN117076810A
Application number: CN202311316381.XA
Authority: CN
Inventors: 刘金磊
Original assignee: Ruizhi Technology Group Co ltd
Current assignee: Ruizhi Technology Group Co ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-11-17

Abstract

The invention discloses an Internet big data processing system based on artificial intelligence, which relates to the technical field of computers and comprises the following components: the internet big data collection module is used for collecting related data from different data sources and transmitting the data to the internet big data preprocessing module; the internet big data preprocessing module is used for cleaning the collected data and transmitting the data to the internet big data storage module; the internet big data storage module is used for efficiently storing the preprocessed data and providing a query interface for the internet big data analysis module; and the Internet big data analysis module is used for carrying out deep analysis, mining and processing on the stored data by using an artificial intelligence technology. By combining with the artificial intelligence technology, the hidden mode and rule in the data are automatically discovered, the capability of insight and problem solving is improved, the analysis and utilization capability of Internet big data is further improved, and the real-time response is realized.

Description

Internet big data processing system and method based on artificial intelligence

Technical Field

The invention relates to the technical field of computers, in particular to an Internet big data processing system and method based on artificial intelligence.

Background

The internet big data processing system refers to a system capable of effectively collecting, storing, managing, analyzing and mining massive, diverse, fast and valuable data generated on the internet. The existing internet big data processing system mainly comprises the following types:

batch processing-based systems, such as Hadoop, spark and the like, can perform off-line batch processing on large-scale static data, and realize operations such as data cleaning, conversion, aggregation and the like, machine learning, data mining and the like. The system has the advantages of processing massive data, providing high reliability and fault tolerance, and supporting multiple programming languages and frameworks. The disadvantages are the slow processing speed, inability to meet real-time or near real-time requirements, and the weak processing power of streaming data.

And the stream processing-based systems, such as Storm, flink and the like, can perform online stream processing on continuously generated dynamic data, and realize operations such as data filtering, conversion, aggregation and the like, and tasks such as complex event processing, real-time analysis and the like. Such a system has the advantage of being able to process data that varies at high speed, providing low latency and high throughput, supporting multiple programming languages and frameworks. The defects are low processing precision, data integrity and consistency cannot be guaranteed, and processing capacity on batch data is weak.

Hybrid processing-based systems, such as Lambda, kappa, etc., which are capable of supporting both batch and stream processing, enable unified management and analysis of static and dynamic data. The advantage of such a system is that it allows for both the size, speed and value of the data, providing an efficient and flexible solution. The disadvantage is that the system architecture is complex, multiple parallel processing layers need to be maintained, and the assurance of data consistency and fault tolerance is difficult.

Disclosure of Invention

The invention provides an Internet big data processing system based on artificial intelligence, which comprises: the system comprises an Internet big data collection module, an Internet big data preprocessing module, an Internet big data storage module and an Internet big data analysis module;

the internet big data collection module is used for collecting related data from different data sources and transmitting the data to the internet big data preprocessing module;

the internet big data preprocessing module is used for cleaning the collected data and transmitting the data to the internet big data storage module;

the internet big data storage module is used for efficiently storing the preprocessed data and providing a query interface for the internet big data analysis module;

the Internet big data analysis module is used for carrying out deep analysis, mining and processing on the stored data by using an artificial intelligence technology.

An internet big data processing system based on artificial intelligence as described above, wherein the internet big data collection module comprises the following sub-modules:

the network crawling sub-module is used for crawling webpage data from the Internet by using a network crawling program;

the API access sub-module is used for accessing and acquiring data by calling an API;

the Internet of things equipment access submodule is used for acquiring Internet of things equipment data exposed in the Internet;

the data transmission sub-module is used for transmitting the collected internet big data to the internet big data preprocessing module.

The Internet big data processing system based on artificial intelligence, wherein the Internet big data preprocessing module comprises the following submodules:

the data de-duplication sub-module is used for identifying and removing duplicate data;

the missing value processing submodule is used for identifying and processing the missing values;

the data format conversion sub-module is used for converting data with different formats into a uniform format;

the data transmission sub-module is used for transmitting the preprocessed data to the Internet big data storage module.

An artificial intelligence based internet big data processing system as described above wherein the missing value processing sub-module uses special values NaN and NULL to represent missing values to ensure that the structure of the original data is not changed.

According to the Internet big data processing system based on the artificial intelligence, the data transmission sub-modules in the Internet big data collection module and the Internet big data preprocessing module both adopt the HTTPS security protocol and desensitize sensitive data.

An artificial intelligence based internet big data processing system as described above, wherein the internet big data storage module comprises the following sub-modules:

the data storage sub-module adopts a distributed storage structure combining a relational database and a NoSQL database to store data;

the data management submodule is used for effectively managing the stored data.

The Internet big data processing system based on artificial intelligence, as described above, wherein the data management sub-module specifically comprises the following functional points:

the data partitioning and slicing are used for partitioning and slicing the Internet big data;

the data backup and disaster recovery are used for periodically carrying out data backup and establishing a disaster recovery mechanism at the same time;

data security and rights management is used for database access rights verification and management of user access rights.

The Internet big data processing system based on artificial intelligence, which is disclosed by the invention, comprises the following sub-steps of deep analysis, mining and processing of stored data, wherein the feature extraction and feature map establishment are carried out, and the sub-steps are as follows:

building a training data set;

taking a data sample in the training data set and a characteristic sample as an input value and a characteristic label as an output value, and performing model training to obtain a characteristic extraction model;

and establishing a characteristic map according to the output result of the characteristic extraction model.

The Internet big data processing system based on the artificial intelligence, wherein the feature extraction model is divided into a physical feature extraction function, a relational feature extraction function and an attribute feature extraction function.

The invention also provides an Internet big data processing method based on artificial intelligence, which comprises the following steps:

step1, collecting data of different data sources in the Internet;

step2, preprocessing the collected data;

step3, high-efficiency storage is carried out on the preprocessed data;

step4, carrying out deep analysis, mining and processing on the stored data.

An internet big data processing method based on artificial intelligence as described above,

the beneficial effects achieved by the invention are as follows: by combining with the artificial intelligence technology, the hidden mode and rule in the data are automatically discovered, the capability of insight and problem solving is improved, the analysis and utilization capability of Internet big data is further improved, and the real-time response is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a schematic diagram of an Internet big data processing system based on artificial intelligence according to a first embodiment of the present invention;

fig. 2 is a flowchart of an internet big data processing method based on artificial intelligence according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, a first embodiment of the present invention provides an internet big data processing system based on artificial intelligence, including:

(1) The internet big data collection module is used for collecting related data from different data sources and transmitting the data to the internet big data preprocessing module;

the collection of internet big data refers to the acquisition and collection of large-scale data from the internet, which can come from various sources, and the internet big data collection module comprises the following submodules:

(1) the network crawling insect submodule: the web crawler program is used for capturing web page data from the Internet, and the web crawler program can automatically browse web pages and extract required data according to certain rules and strategies;

(2) API access submodule: the system comprises an API interface for accessing and acquiring data by calling the API, wherein the API interface comprises an open API interface provided by websites and services, various data collected and recorded by a mobile application in the use process of a user, user-generated content on a social media platform, such as a push, a post, a comment and the like, and the data can be acquired through the open API interface or a special data provider;

(3) the equipment access submodule of the Internet of things: the system is used for acquiring the data of the Internet of things equipment exposed to the Internet, more and more equipment and sensors are connected to the Internet along with the development of the Internet of things, and data acquired by various Internet of things equipment, such as temperature, humidity, light and the like, can be collected and can be used for analyzing and monitoring environmental changes.

(4) And a data transmission sub-module: the system is used for transmitting the collected internet big data to an internet big data preprocessing module, adopts an HTTPS security protocol and desensitizes sensitive data.

(2) The internet big data preprocessing module is used for cleaning the collected data, ensuring the cleanness and consistency of the data and transmitting the data to the internet big data storage module;

through data cleaning, clean and consistent data can be obtained, a reliable data base is provided for the following steps of entity extraction, attribute extraction, relation extraction and the like, and the method specifically comprises the following submodules:

(1) and a data de-duplication sub-module: for identifying and removing duplicate data, which may result in duplicate data records due to data collection, we need to identify and remove the duplicate data to ensure that the entities and relationships in the knowledge-graph are unique, the identification of duplicate data being achieved by a hash algorithm.

(2) Missing value processing submodule: for identifying and processing these missing values, some data may have missing values, i.e. some attributes or relationships are NULL, during the data collection process, special values NaN and NULL are required to represent the missing values, so that the structure of the original data can be preserved and the missing values can be processed in subsequent analysis.

(3) A data format conversion sub-module: for converting data in different formats into a unified format, different data sources may use different data formats, such as CSV, JSON, XML, etc., in the data collection process, and the data format conversion sub-module processes the data in different formats by integrating an ETL (Extract, transform, load) tool is a tool that is specially used for data extraction, conversion and loading.

(4) And a data transmission sub-module: the data processing system is used for transmitting the preprocessed data to the Internet big data storage module, adopting the HTTPS security protocol and desensitizing the sensitive data.

(3) The internet big data storage module is used for efficiently storing the preprocessed data and providing a query interface for the internet big data analysis module;

the storage of internet big data needs to comprehensively consider the scale, structure, access requirement and security of the data, select a proper storage system and technology, and take corresponding management and protection measures, and specifically comprises the following submodules:

(1) a data storage sub-module: the distributed storage structure combining the relational database and the NoSQL database is adopted, the relational database is used for storing structured data, the NoSQL database is used for storing unstructured data, and the distributed storage is adopted because of huge Internet data volume, so that the reading pressure of a data server is reduced.

(2) A data management sub-module: the method is used for effectively managing the stored data and improving the query efficiency and the security of the data, and specifically comprises the following functional points:

i. data partitioning and slicing: the method is used for partitioning and slicing Internet big data so as to improve the processing and inquiring efficiency of the data, the data partitioning can be divided according to time, geographic positions, subjects and the like, and the data slicing can be used for storing the data on a plurality of nodes of a distributed storage structure in a scattered manner, so that the parallel processing capacity is improved;

data backup and disaster recovery: the method is used for regularly carrying out data backup and simultaneously establishing a disaster recovery mechanism to prevent data loss and system faults, the data of the data storage submodule is regularly backed up and uploaded to the cloud, and a disaster recovery plan is prepared in advance and used for protecting the data and recovering services when a disaster event occurs, such as: when the data server A is down, synchronizing the data of the data server A backed up in the cloud to the standby server B, recovering the processing of the service, and when the data server A is restarted, recovering the use of the data server A.

Data security and rights management: the method is used for verifying the access rights of the database and managing the access rights of the user, setting a white list of the access rights of the database, encrypting the sensitive list in the database, and distinguishing different access rights set by different users, such as: the data manager can only modify the data if the ordinary user has viewing rights and cannot modify the data.

(4) The Internet big data analysis module is used for carrying out deep analysis, mining and processing on the stored data by using an artificial intelligence technology;

and carrying out deep analysis, mining and processing on the stored data, wherein the deep analysis, mining and processing comprise feature extraction and feature map establishment.

Feature extraction is to extract physical, attribute and relational features in structured and unstructured data, wherein physical features refer to things or concepts with unique identification, such as characters, places, organizations and the like; attributes are features or attributes describing an entity, such as name, age, address, etc.; relationships are connections or links between entities describing the relevance between the entities; and forming a characteristic map according to the extracted characteristics and applying the characteristic map to practice, wherein the method specifically comprises the following substeps:

i. a set of training data is established and,wherein A is ₁ ~A _n For the feature samples, each feature sample in turn comprises its feature tag and a structured or unstructured data sample, denoted +.>Wherein a is a feature tag, a ₁ ~a _m Structured or unstructured data samples containing feature tags a;

training the model by taking the training data set and the data sample in the characteristic sample as an input value and the characteristic label as an output value to obtain a characteristic extraction modelWherein x is an input value, E (x) is a physical feature extraction function, G (E) is a relational feature extraction function, E is a parameter-input relational feature extraction function, V (E) is an attribute feature extraction function, and the extracted physical feature is also used as a parameter of the attribute feature extraction function>For splicing->Where i is the input data sample subscript, n is the total number of input data samples, k _j The j-th word split for the input data sample x, m is the total number of words, xk _j To include the word k in the data sample x _j X is the total number of data samples, X _i For the total number of statements in the ith data sample, xk _ji To include word k in the ith data sample _j The denominator plus one avoids zero, v is the sensitivity of the extracted entity, the higher the sensitivity the less the extracted entity is, but the greater the correlation with the data sample, the ownMax function will be for each k _j Is sorted in descending order of the calculated results, returning k of the first v% _j The method comprises the steps of carrying out a first treatment on the surface of the It should be noted that splitting the words of the data samples includes two steps, prepositioning, and word segmentation, which is implemented by importing the jieba library and using the word segmentation method in the library. />Where Location () is a function that determines the Location of an entity, location (e _j ) Returning the position of the entity with index j in the entity set E in the data sample statement, location (E _j-1 ) Returning the position of the entity with index j-1 in the sample sentence, wherein the value range of j is 2-n, x _i For the ith statement of the data sample, x is the input data sample, the Sub () function is an intercept function, the statement from the head entity to the tail entity is intercepted, the position () returns to the head entity when the result is 0, the return result is the statement length minus the entity length, the return result is the tail entity, the return result of G (E) is expressed as a set of a plurality of entity relationships, each entity relationship is expressed as->Wherein e is _t For head entity, e _w G is the return value of the Sub () function for the tail entity.Wherein e is _j For the entity with the subscript j in the entity set E, K is an attribute mark set containing common attribute marks such as ' yes ', ' and the like, and is obtained through learning in a preset and training sample, and x _i Ith statement of data sample x, subv is intercept attribute function, and x is matched by characters _i Entity e in the sentence _j The attribute mark at the back, the word after the attribute mark is intercepted and returned as the attribute, the return value of V (E) is a set of a plurality of entity and attribute relations, and each entity and attribute relation is expressed as +.>Wherein e is an entity, v ₁ ~v _l Is an attribute of entity e.

Establishing a feature map according to the output result of the feature extraction model, calling a data query interface of the Internet big data storage module to acquire data, inputting the data as a sample into the feature extraction model, and outputting an entity feature set, a relation feature set and an attribute feature set;

the feature map is a data structure which represents entities, attributes and relations among the entities, the attributes and the attributes in the form of a map, the entities are nodes, the attributes are labels on the nodes, the relations are edges between the nodes, and the output feature set data are built into the map according to the corresponding mapping relation and stored in a Nosql database in the form of a map structure.

The processed data can be used for search engines, ai call backs and other applications, and answers are obtained by inputting questions to be understood, dividing words into keywords, inquiring corresponding entities and association relations between the keywords, the relation features and the attribute features to form answers, such as inquiring what is the first of the united states, inquiring entity keywords united states, and then searching the first of the relations corresponding to the entity united states.

Examples

As shown in fig. 2, a second embodiment of the present invention provides an internet big data processing method based on artificial intelligence, including:

s10: collecting data of different data sources in the Internet;

the collection of internet big data refers to the acquisition and collection of large-scale data from the internet, which can come from various sources, including in particular the following acquisition modes:

(1) web crawler acquisition: capturing webpage data from the Internet by using web crawlers, wherein the crawlers can automatically browse webpages and extract required data according to certain rules and strategies;

(2) API access acquisition: accessing and acquiring data by calling an API (application program interface), wherein the data comprises an open API interface provided by websites and services, various data collected and recorded by a mobile application in the use process of a user, and user-generated content on a social media platform, such as a push, a post, a comment and the like, can be acquired by the open API interface or a special data provider;

(3) and accessing and acquiring the Internet of things equipment: the internet of things equipment data exposed in the internet is acquired, more and more equipment and sensors are connected to the internet along with the development of the internet of things, and data acquired by various internet of things equipment, such as temperature, humidity, light and the like, can be collected.

S20: preprocessing the collected data;

through data cleaning, clean and consistent data can be obtained, a reliable data base is provided for the following steps of entity extraction, attribute extraction, relation extraction and the like, and the method specifically comprises the following substeps:

(1) data deduplication: identifying and removing duplicate data, which may result in duplicate data records due to data collection, is needed to ensure that entities and relationships in the knowledge-graph are unique, and is achieved by hashing.

(2) Missing value processing: identifying and processing these missing values, some data may have missing values, i.e., some attributes or relationships may have NULL values, requiring special values (e.g., naN or NULL) to represent the missing values during data collection, thus preserving the structure of the original data and processing the missing values for subsequent analysis.

(3) Data format conversion: data in different formats are converted into a unified format, different data sources may use different data formats in the data collection process, for example CSV, JSON, XML, and the data format conversion submodule processes the data in different formats by integrating an ETL (Extract) tool, which is a tool specially used for data extraction, conversion and loading.

S30: the data after preprocessing is efficiently stored;

the storage of internet big data needs to comprehensively consider the scale, structure, access requirement and security of the data, select a proper storage system and technology, and take corresponding management and protection measures, in particular:

(1) and (3) data storage: the distributed storage structure combining the relational database and the NoSQL database is adopted, the relational database is used for storing structured data, the NoSQL database is used for storing unstructured data, and the distributed storage is adopted because of huge Internet data volume, so that the reading pressure of a data server is reduced.

(2) And (3) data management: the method is used for effectively managing the stored data and improving the query efficiency and the security of the data, and specifically comprises the following functional points:

S40: deep analysis, mining and processing are carried out on the stored data;

feature extraction is to extract physical, attribute and relational features in structured and unstructured data, wherein physical features refer to things or concepts with unique identification, such as characters, places, organizations and the like; attributes are features or attributes describing an entity, such as name, age, address, etc.; relationships are connections or links between entities describing the relevance between the entities; and then the map is formed according to the extracted characteristics and applied to practice, and the method specifically comprises the following substeps:

Establishing a map according to the output result of the feature extraction model, calling a data query interface of the Internet big data storage module to acquire data, inputting the data as a sample into the feature extraction model, and outputting an entity feature set, a relation feature set and an attribute feature set;

the map is a data structure which represents entities, attributes and relations among the entities, the attributes and the attributes in the form of a map, the entities are nodes, the attributes are labels on the nodes, the relations are edges between the nodes, the map is built according to the corresponding mapping relation of the output feature set data, and the map is stored in a Nosql database in the form of a map structure.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. An artificial intelligence based internet big data processing system comprising: the system comprises an Internet big data collection module, an Internet big data preprocessing module, an Internet big data storage module and an Internet big data analysis module;

2. The internet big data processing system based on artificial intelligence according to claim 1, wherein the internet big data collection module comprises the following sub-modules:

3. The internet big data processing system based on artificial intelligence according to claim 1, wherein the internet big data preprocessing module comprises the following sub-modules:

4. An artificial intelligence based internet big data processing system according to claim 3, wherein the missing value handling submodule uses special values NaN and NULL to represent the missing values to ensure that the structure of the original data is not changed.

5. The system of claim 1, wherein the internet big data collection module and the data transmission sub-module in the internet big data preprocessing module both use HTTPS security protocol and desensitize sensitive data.

6. An artificial intelligence based internet big data processing system according to claim 1, wherein the internet big data storage module comprises the following sub-modules:

the data management submodule is used for effectively managing the stored data.

7. The internet big data processing system based on artificial intelligence according to claim 6, wherein the data management submodule specifically comprises the following functional points:

8. The internet big data processing system based on artificial intelligence according to claim 1, wherein the deep analysis, mining and processing of the stored data comprises feature extraction and feature map establishment, and the method comprises the following steps:

building a training data set;

9. The system of claim 8, wherein the feature extraction model is divided into a physical feature extraction function, a relational feature extraction function, and an attribute feature extraction function.

10. An Internet big data processing method based on artificial intelligence comprises the following steps:

step1, collecting data of different data sources in the Internet;

step2, preprocessing the collected data;

step3, high-efficiency storage is carried out on the preprocessed data;

step4, carrying out deep analysis, mining and processing on the stored data.