CN111047112A

CN111047112A - Computer internet of things data processing system

Info

Publication number: CN111047112A
Application number: CN201911377769.4A
Authority: CN
Inventors: 刘巍巍
Original assignee: Shenyang Sport University
Current assignee: Shenyang Sport University
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-04-21
Anticipated expiration: 2039-12-27
Also published as: CN111047112B

Abstract

The invention provides a computer Internet of things data processing system which comprises a data acquisition module, a data processing module, a data storage module, an information optimization module and a logistics distribution module, and can acquire logistics data from a plurality of heterogeneous systems in real time to efficiently process the data in a real-time or batch processing mode, thereby improving the sequential delivery capacity of goods, reducing the forwarding times of the goods at intermediate nodes, improving the transportation efficiency of the goods, and overcoming the difficulties of untimely management of complex events and the like.

Description

Computer internet of things data processing system

Technical Field

The invention belongs to the field of computer Internet of things, and particularly relates to a computer Internet of things data processing system.

Background

Computer networking is leading to a shift in the thinking model of the logistics industry. Logistics service providers use sensor technologies such as GPS or telemetry to track and manage their cargo processes, which helps to label and connect factories, ships and machines, etc., and also provide forecasted events and prevention of accidents for delivery delays by using external data that contains critical information about the event, such as information traffic accidents and natural disasters, correlating data from different sensors and social media and analyzing in real time. The connectivity of "things" enables instant communication between devices over the Internet, and this highly connected ecosystem has a profound impact on the revenue of both the logistics operators, their business customers and the end customers. One of the main advantages of the ecosystem of the internet of things is that it can merge and fuse information of logistics sensors and external sensors, such as weather sensors and traffic (GPS) sensors, and the internet of things can also be connected with social media, such as providing information of events such as important traffic, accidents, weather, natural disasters, and the like.

However, due to the diversity of data and the difference of collection speed, the accuracy and speed of collecting and processing data from different sources are very different, and meanwhile, the workload of processing data in real time is very large, and the traditional logistics information system cannot solve the problem. On the other hand, although predictive analysis to predict shipment delays or prescriptive analysis to optimize routes can increase delivery speed and thus customer satisfaction within a prescribed time, delayed delivery remains a pending problem and timely delivery is a significant challenge for logistics companies because delays are sometimes caused by factors that anyone cannot control. The delay in delivery can have various effects, such as customer churn or order cancellation, which can cause significant losses. Therefore, timely delivery is critical to logistics companies. In recent years, logistics enterprises are beginning to investigate how to utilize data prediction delay, and particularly, in terms of big data technology, logistics providers are concerned about using a lot of accidents, traffic congestion and other event streams from external resources, such as social media real-time analysis and prediction delay. Real-time prediction delays enable companies to take actions, such as optimizing real-time flight routes. The existing solution is based on the classical data processing technology, so that the traditional logistics information system cannot process the sensor or social media data in real time because the data flows in a high-speed state, and the traditional data processing method cannot process the modeless data such as text. Existing data processing methods (e.g., techniques or algorithms) do not have sufficient efficiency to process data in real-time.

Considering the evaluation of data sources, most existing solutions are limited to only one data source. In addition, for the continuous improvement of real-time systems, the prior art uses static historical data sets for testing, and obviously, the current logistics requirements cannot be met only by relying on historical data. Based on the above, the invention provides a mixed framework for batch processing and real-time processing of mass data, which is based on a classification algorithm and can collect stream data in real time from a plurality of heterogeneous systems to efficiently process the data in a real-time or batch processing mode. The present invention is directed to developing a hybrid solution that enables real-time data to be processed in bulk, making logistics services possible, and there is an urgent need for computer processing to provide programs to perform analysis in real-time.

Disclosure of Invention

The invention provides a computer Internet of things data processing system which is based on a classification algorithm and can collect logistics data from a plurality of heterogeneous systems to process the data efficiently in real time.

A computer Internet of things data processing system comprises a data acquisition module, a data processing module, a data storage module, an information optimization module and a logistics distribution module, wherein the data processing module comprises a batch data processing device and a real-time data processing module, the batch data processing device is used for reading/extracting stored data and preparing the data, the batch data processing device comprises a data preparation stage and a data processing stage, the data preparation stage comprises data extraction, data cleaning, data filtering, data integration and data storage, the data processing stage classifies and processes fully prepared data, the batch data processing device directly sends the data to the real-time data processing module through a wireless/wired network, the information optimization module performs line optimization on logistics and transmits an optimized line to the logistics distribution module through the wireless/wired data, the batch data processing device processes the logistics data from the plurality of data sensors and the logistics application in batches.

Furthermore, a data extractor captures web pages linked in a specific website from the cloud server, extracts links from the crawled web pages, and stores the extracted link data information in the data storage module respectively; the query module provides a user search interface, a user inputs search words, and returns query results to the user according to the query of the user, the data filtering is to remove noise from web pages, filter out some script identifiers and useless information, store useful texts in each web page, perform word segmentation, noise removal and sorting, extract keywords of the web pages, and acquire a web page PR value calculated based on the link relation of the web pages according to the link relation among the web pages extracted in the web page capturing module and the idea of a PageRank sorting algorithm; and then, calculating similarity weight of the logistics related information and related webpage keywords by using a space vector model, increasing the weight of historical search and search keywords of a user, finally recalculating contribution values among webpages with link relations through an algorithm, and obtaining a rank ranking, wherein the contribution values are used as important reference basis of logistics service.

Further, the data filtering comprises the following steps:

(1) analyzing web page link Set needing sorting_webLinking the orientation relations, and determining the out-linking and in-linking conditions of each webpage;

(2) from Set_webExtracting keywords from the page content of each webpage to generate a keyword set S of the webpage_{web_keywords}＝{V₁,V₂,V₃,…,V_i}；

(3) Calculating Set_webObtaining keyword correlation factor set W (u) by the similarity between the keywords corresponding to each webpage and K;

(4) finding a keyword list S such as logistics, traffic, weather, geographical position and the like corresponding to the user according to the ID_{h_web_keywords}；

(5) Calculating Set_webThe corresponding key words and S of each web page in the database_{h_web_keywords}Obtaining the influencing factor H (u);

(6) for each web page, there are three factors, according to the formula GR ═ 1-d) + d [ ∑ pr (v) (α/N)_v+ β·W(u)+γ·H(u))]；

And calculating the comprehensive score of each webpage to obtain the final webpage ranking GR, wherein α, gamma respectively represents the weight of the link, the topic relevance factor and the user factor in PR value distribution.

Further, data extraction includes information sources for collecting various structured and unstructured data to obtain complete and accurate descriptions of regions of interest and to normalize the multi-source heterogeneous data.

Furthermore, the web page is grabbed by using a Heritrix open source crawler program, and on the existing open source code, a user can expand each component of the web page to realize the grabbing logic of the user and acquire required resources from a network.

Furthermore, the data acquisition module acquires multi-source heterogeneous data, wherein the multi-source heterogeneous data comprises information of a data sensor and information of logistics application, and the data sensor comprises a vehicle sensor and a weather sensor; the logistics application comprises microblog and social media.

Further, data cleansing is the detection of correction or removal of corrupted or inaccurate record sets, tables.

Further, two steps of data set composition are performed: in a first step, data is converted from a source to a target serialized format; the second step is to merge the converted data.

Further, the real-time data processing module groups or segments the data items, and generates an aggregate data set from the objective function, which is effectively analyzed in predicting delivery delays.

Further, the information optimization module is used for constructing high-throughput persistent data and information of a reliably delivered collection system, and further performing theme aggregation on the logistics route, wherein the theme aggregation is divided into one or more linear and ordered message sequences, and each message is identified according to the index of the message sequence.

The original PageRank algorithm only considers the link-in and link-out relations of web pages, does not analyze whether the content of the web pages is consistent with or similar to the topic searched by a user, can capture high-quality web pages, but also captures web pages which are irrelevant to the query topic or have low similarity, namely the topic drift problem exists.

The real-time data processing module executes the event cluster in real time and obtains instant insight on the processed data, and the objective function is generated into an aggregated data set, so that effective analysis is facilitated when delivery delay is predicted. And timely adjusting the logistics transportation according to the interactive data in real time so as to realize informatization and standardization of the logistics distribution products.

The computer Internet of things data processing system optimizes logistics lines, can save a large amount of manpower and material resources, enables goods to be delivered to customers in time, improves user satisfaction, improves the sequential delivery capacity of the goods, reduces the forwarding times of the goods at intermediate nodes, improves the transportation efficiency of the goods, and overcomes the difficulties of untimely management of complex events and the like.

Drawings

FIG. 1 is a schematic diagram of a computer Internet of things data processing system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

A computer Internet of things data processing system comprises a data acquisition module, a data processing module, a data storage module and a logistics distribution module, wherein the data processing module comprises batch data processing equipment and a real-time data processing module, the batch data processing equipment is used for reading/extracting stored data and performing data preparation, cleaning and filtering of the data are performed under the real-time data processing condition, and the batch data processing equipment directly sends the data to the real-time data processing module through a wireless network.

The data acquisition module acquires multi-source heterogeneous data, wherein the multi-source heterogeneous data comprises information of a data sensor and information of logistics application, and the data sensor comprises a vehicle sensor and a weather sensor; the logistics applications include microblogging, Twitter, social media, Facebook, and the like.

The batch data processing device carries out batch processing on logistics data from a plurality of data sensors and logistics applications, and comprises two stages: a data preparation phase and a data processing phase. The data preparation phase comprises data extraction, data cleaning, data filtering, data integration and data storage. And in the data processing stage, classifying the fully prepared data. Specifically, the method comprises the following steps:

data extraction: the method is used for collecting various information sources to obtain complete and accurate description of the interested region and standardizing multi-source heterogeneous data. The data extractor uses data both internal and external, the internal data source typically being the system used by the user. A customer system includes an information system (supply chain management), Customer Relationship Management (CRM), logistics management system, and Account Management System (AMS) that are formed from supply chain management. These systems produce large amounts of data that are collected by a data extractor. It also obtains data from external source weather sensors, and other social media. Further, structured and unstructured data may be collected. For example, unstructured text may be collected from microblogs, or structured business process data from a logistics information system may be collected. The data extractor is used for grabbing web pages linked in a specific website from the cloud server and extracting links from the crawled web pages, extracted link data information is stored in the data storage module respectively, meanwhile, the data extractor comprises a web page preprocessing module and an inquiry module, and the web page preprocessing module is used for analyzing the grabbed web pages, establishing indexes and calculating the grades of the web pages; the query module provides a user search interface, and the user inputs search terms and returns a query result to the user according to the query of the user. The web page is grabbed by using a Heritrix open source crawler program, the Heritrix is a crawler for grabbing web page contents in a multi-thread mode, and on the existing open source code, a user can expand each component of the crawler to realize the grabbing logic of the user and acquire required resources from a network.

And (3) data filtering: refers to a broad strategy or solution for optimizing a data set. Data overload, which is refined to what a group of users need, does not include other data that may be repetitive, irrelevant, or even sensitive, increases computational cost and accuracy of data processing. During the collection process, the data block, especially the label, determines the direct and indirect connection between the transportation, delivery, logistics and shipping processes. For example, the get message "today's stock prices are very high" will be deleted by the data filter because it does not carry any information related to the logistics flow. Data filtering is to consist of three parts: webpage denoising, Chinese word segmentation and link analysis. Most web pages are semi-structured and have a large amount of format information, so the first step of analyzing the content of the web page is to denoise the web page and filter out some script identifiers and useless information. And then, storing the useful texts in each page, analyzing the texts, performing word segmentation, denoising and sequencing on the texts, and extracting the keywords of the webpage. According to the link relation between the webpages extracted from the webpage capturing module and by using the idea of a PageRank sorting algorithm, a webpage PR value calculated based on the link relation of the webpages is firstly obtained. And then, calculating similarity weight of the logistics related information and related webpage keywords by using a space vector model, and increasing the weight of historical search and search keywords of the user. And finally, recalculating the contribution values among the web pages with the link relation through an algorithm, and obtaining the rank ranking which is used as an important reference basis of the logistics service. The method comprises the following steps:

(3) Calculating Set_webThe similarity between the key word corresponding to each webpage and the K is obtained to obtain the key word phaseThe relationship factors set W (u);

(6) for each web page, there are three factors, according to the formula GR ═ 1-d) + d [ ∑ pr (v) (α/N)_v+ β·W(u)+γ·H(u))]

And calculating the comprehensive score of each webpage to obtain a final webpage ranking GR, wherein α, gamma respectively represents the weight of the link, the topic relevance factor and the user factor in PR value distribution, the three parameters are all larger than 0, in order to ensure the convergence of the algorithm, the sum of the three values is equal to 1, the weight of each item represents the importance degree of the factors in the distribution process, and the change of the values of the three factors can influence the quality of the sequencing result.

Data cleaning: it is a set of records, tables, that detect corrections (or removals) for corruption or inaccuracy.

Data integration: the data set composition is performed in two steps. In a first step, the data is converted from a source to a target serialized format; the second step is to merge the converted data.

Data storage: this step is intended to process the integrated data set and store the data in memory.

The data query module mainly comprises two parts: a query agent and a user interface. After system pre-processing, the data passed to the query module at this point consists of two parts: index the web page library and reverse files. The query agent receives the query phrases input by the user through the user interface, searches from the index webpage library and the inverted file after segmenting the phrases, acquires the documents containing the query phrases, and returns the documents to the user as a return result. In the process of realizing query, after the query phrases are segmented, the vector representation of the query is obtained, and the weight of the query phrases in the inverted index and the position information of the terms are comprehensively considered. Calculating the similarity between the query and the webpage document through a traditional information retrieval model; and combining the webpage ranking obtained in the webpage preprocessing stage, sequencing the webpages to form a final ranking, and then returning the corresponding webpages to the user according to the ranking sequence.

The real-time data processing module is a core component. Logistics services have different shipping modes, including air, ship and land, and a single transportation mode cannot meet the transportation requirements. Particularly overseas logistics, such as products manufactured in china are shipped to customers in different cities abroad; the shipping process must be intermodal, meaning that the process will include trucks, trains, ships, or air, etc. The integrated multimodal logistics process is susceptible to various challenges, resulting in delivery delays. For example, if customs clearance at a port is delayed, cargo may be delayed even if all other modes of transportation conform to a predetermined schedule. Uncertain events such as natural disasters, war, strikes may affect one or more delivery modes or integrate further steps of the logistics process. Uncertainty is a major challenge for such events. Thus, the present invention analyzes data in real time to extract factors that may cause delivery delays, the information of which contains a continuous stream of data that may cause delivery delay events. The real-time data processing module is based on social media and sensor events, the access speed of the real-time data processing module is one hundred thousand times of that of a magnetic disk, and the real-time data processing module is designed to add lacking data information to facilitate timely handling of events. These events first enter the delivery to the data storage module via distributed messages. For such uncertain events, the real-time data processing module can preferentially extend the processing behavior, rather than batch processing. The real-time data processing module executes the cluster of events in real time and obtains instant insight into the processed data. Categorization is the process of grouping or segmenting data items that are similar in a cluster but belong to another cluster than the data. The invention is based on the classification concept, and the objective function is generated into an aggregated data set, thereby being beneficial to effective analysis when the delivery delay is predicted.

Let X_i＝{X₁,X₂,…,X_nDenotes data with n logistics objectsSet, wherein X_i＝{X₁,X₂,…,X_nDenotes m attributes of the ith object, and the dataset is represented as an n × m matrix. Classify the data set T times, R_i＝{R_i1,R_i2,…,R_iTThe result of the ith object under T-time classification is represented, the base classification result is represented as an n multiplied by T matrix, the data information adopts paired constraints, and the paired constraints describe the relationship between two data objects and comprise two relationships: the information of the necessary connection relation reflecting that the data object belongs to the same class is marked as M, and the information of the disconnected relation reflecting that the data object does not belong to the same class is marked as C.

In the original data characteristic space, the original data is expressed into an n multiplied by n matrix D, D (i, j) represents the similarity between an object i and an object j, and Gaussian similarity is used for calculating

Where δ is a hyper-parameter, then calculating a diagonal matrix E, where the elements on the diagonal are the sums of all elements in a row (column) of the W matrix, normalizing to obtain a final matrix D ═ E^-1/2WE^-1/2The closer the distance the greater the similarity between the two points. In the symbolic feature space formed by the base classes, the base classes are represented as an n × n matrix B. B (i, j) represents the number of times that the object i and the object j are classified into one class under the T-time base classification result, and is calculated according to the following formula:

δ(R_it,R_jt)＝1,R_it＝R_jt；δ(R_it,R_jt)＝0,R_it≠R_jt。

in the supervised information feature space, the pairwise constraints are represented as an n × n matrix S. The pair-wise constraints have symmetry and transitivity for a given same data set. Calculating the similarity between the object points according to the following formula to ensure the nonnegativity of the similarity matrix S,

in this way, after n × n matrices D, B and S are respectively constructed in three feature spaces of original data, basis classification and supervision information, three similarity matrices are linearly combined to construct a new matrix L ═ w₁D+w₂B+w₃S, wherein, w₁、 w₂、w₃And respectively carrying out NMF classification on the L for the weights of the original data, the base classification and the supervision information to obtain a result, and selecting a row with the maximum NMI value as a class label in a final result matrix.

The information optimization module optimizes logistics routes according to NMI values, buyer information, seller information and transportation information (such as flights, train numbers and the like), is a publish-subscribe-based information system, is a fast and highly extensible distributed information module, is used for constructing a collection system of persistent data high throughput and reliable delivery, and is used for performing topic collection on logistics routes by using information, and is divided into one or more linear and ordered message sequences, wherein each message is identified according to the index of the message sequence. The information optimization module transmits the optimized line to the logistics distribution module through wireless/wired data, and data interaction is achieved.

The logistics distribution module comprises a GPS module and a displacement sensor, the position of the goods is monitored in real time through the combination of the GPS module and the displacement sensor, and logistics conveying is adjusted in time according to interactive data in real time, so that informatization and standardization of logistics distribution products are achieved.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A computer Internet of things data processing system is characterized by comprising a data acquisition module, a data processing module, a data storage module, an information optimization module and a logistics distribution module, the data processing module comprises batch data processing equipment and a real-time data processing module, the batch data processing equipment is used for reading/extracting stored data and preparing data, the batch data processing equipment comprises a data preparing stage and a data processing stage, the data preparation phase comprises data extraction, data cleaning, data filtering, data integration and data storage, the data processing stage classifies and processes the prepared sufficient data, the batch data processing equipment directly sends the data to the real-time data processing module through a wireless/wired network, the information optimization module optimizes logistics lines of logistics, and the optimized lines are transmitted to the logistics distribution module through wireless/wired data.

2. The computer internet-of-things data processing system as claimed in claim 1, wherein the data extractor fetches linked web pages from a specific website from the cloud server and extracts links from the crawled web pages, the extracted link data information is stored in the data storage module respectively, and meanwhile, the data extractor comprises a web page preprocessing module and a query module, the web page preprocessing module analyzes the crawled web pages, establishes indexes, and calculates the grades of the web pages; the query module provides a user search interface, a user inputs search words, and returns query results to the user according to the query of the user, the data filtering is to remove noise from web pages, filter out some script identifiers and useless information, store useful texts in each web page, perform word segmentation, noise removal and sorting, extract keywords of the web pages, and acquire a web page PR value calculated based on the link relation of the web pages according to the link relation among the web pages extracted in the web page capturing module and the idea of a PageRank sorting algorithm; and then, calculating similarity weight of the logistics related information and related webpage keywords by using a space vector model, increasing the weight of historical search and search keywords of a user, finally recalculating contribution values among webpages with link relations through an algorithm, and obtaining a rank ranking, wherein the contribution values are used as important reference basis of logistics service.

3. The computer internet of things data processing system of claim 2, wherein the data filtering comprises the steps of:

(6) for each web page, there are three factors, according to the formula GR ═ 1-d) + d [ ∑ pr (v) (α/N)_v+β·W(u)+γ·H(u))]；

4. A computer internet of things data processing system as claimed in any one of claims 1 to 3 wherein data extraction includes a data processing system for collecting various sources of structured and unstructured data information to obtain a complete and accurate description of a region of interest and to normalize the multi-source heterogeneous data.

5. A computer IOP data processing system according to any of claims 1 to 4 in which the crawling of web pages is done using the Heritrix open source crawler, on its existing open source code, the user can extend its components to implement its own crawling logic and obtain the required resources from the network.

6. The computer internet of things data processing system of any one of claims 1-4, wherein the data acquisition module acquires multi-source heterogeneous data, the multi-source heterogeneous data comprising information of data sensors and information of logistics applications, the data sensors comprising vehicle sensors, weather sensors; the logistics application comprises microblog and social media.

7. A computer Internet of things data processing system as claimed in any one of claims 1 to 4, wherein data cleansing is the detection of corrections or removal of corrupt or inaccurate sets, tables of records.

8. A computer internet of things data processing system as claimed in claim 1, wherein the real-time data processing module groups or segments data items, generates an aggregate data set from the objective function, and performs an efficient analysis in predicting delivery delays.

9. The computer internet of things data processing system of claim 1, wherein the information optimization module is configured to construct a collection system of high throughput persistent data and reliable deliveries to subject the logistics route into one or more linearly ordered sequences of messages, wherein each message is identified by its index.