CN116701396A - Data indexing method based on homogeneous map structure FeatureRDD model - Google Patents

Data indexing method based on homogeneous map structure FeatureRDD model Download PDF

Info

Publication number
CN116701396A
CN116701396A CN202310688288.5A CN202310688288A CN116701396A CN 116701396 A CN116701396 A CN 116701396A CN 202310688288 A CN202310688288 A CN 202310688288A CN 116701396 A CN116701396 A CN 116701396A
Authority
CN
China
Prior art keywords
data
database
input
prediction
relation set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310688288.5A
Other languages
Chinese (zh)
Other versions
CN116701396B (en
Inventor
卫炜
邢雪
郭琳
赵春梅
刘宇航
项程程
李晓辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Development Center Of Ministry Of Agriculture And Rural Areas
Original Assignee
Big Data Development Center Of Ministry Of Agriculture And Rural Areas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Development Center Of Ministry Of Agriculture And Rural Areas filed Critical Big Data Development Center Of Ministry Of Agriculture And Rural Areas
Priority to CN202310688288.5A priority Critical patent/CN116701396B/en
Publication of CN116701396A publication Critical patent/CN116701396A/en
Application granted granted Critical
Publication of CN116701396B publication Critical patent/CN116701396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a data index method based on a homogeneous map structured FeatureRDD model, which comprises the steps of at least establishing one of an atmospheric environment database, a regional humanization geographic database and a ground natural geographic database; constructing attribute isomorphic information network models in and between an atmospheric environment database, a regional and regional humanization geographic database and a ground natural geographic database; establishing a real-time input history database and/or an access history database of the database; constructing a FeatureRDD model comprising a filtering and analyzing module and a modeling module; constructing an index prediction model of data based on input and access histories of the homogeneous map, and realizing local dynamic space-time index; and selecting input data and/or selecting the data type required by access, displaying required database items according to the model constructed in the step S5, and finally realizing massive vector element parallel calculation and dynamic indexing.

Description

Data indexing method based on homogeneous map structure FeatureRDD model
Technical Field
The invention relates to a massive vector element parallel computing method, in particular to a data indexing method based on a homography architecture FeatureRDD model, and belongs to the field of spatial data structures and computing.
Background
The FeatureRDD model is used as a model which is expanded on the RDD model of Spark, supports distributed space calculation, takes the FeatureRDD model as an input and output data processing and analyzing interface and a reading and writing interface of various data engines, and realizes the time-containing space geographic position and time-containing geographic attribute of data and the time-containing space organization and parallel calculation of high-efficiency geographic data.
However, the FeatureRDD model is only a storage mode of a data architecture, which realizes the organization of basic space-time attribute and geographic attribute of space longitude and latitude through ID coding, and the prior art utilizes a quadtree index and a binary tree index to index data after the repartitioning of the distributed data set model, and utilizes grid aggregation to realize the logic architecture of data nodes and the analysis and calculation of the logic relationship among the data nodes.
However, the data repartition, the quadtree index and the binary tree index inevitably bring computational complexity, and because massive historical records of data access are not concerned, the data structure is still used for passive static index on the basis of the final parallel computation of the data. Therefore, the mass data storage architecture is jumped from the thinking, and how to utilize the data access history to realize dynamic indexing is considered, so that the method becomes a new way for improving the data indexing efficiency and the final parallel computing efficiency.
The rural land contractual operation right refers to the right that members of a rural collective economic organization or units or individuals outside the rural collective economic organization can share possession, use and income for all rural lands used by farmers in a group or all rural lands used by the farmers in a country for contractual operation. The country has the advantages that the space position, the area, the four to the other information of each contract land are checked by developing the rural land contract management right registration issuing work, the land contract management right of each contract farmer is defined, a national rural land contract information database is established, the geospatial information of more than 11 hundred million contract plots, the basic conditions of nearly 2 hundred million contract farmers and family members thereof, and the contract, register book, certificate and other information are stored in a centralized manner. How to continuously improve the data operation efficiency based on a rural land contract information database and a management system which integrate images, graphics and rights into a whole, and realizing the efficient management of rural land contract management rights information are problems which we need to face for a long time.
In the early stage, the data are divided into three types of days, grounds and people, and the data access history is predicted based on a heterogeneous graph structure, so that the efficient index of the data is realized. However, in real application, the access to the data is not concerned about the direction of the path, but is concerned about which data contents are accessed by the user in a certain period, so we consider that if the heterogeneous graph structure is converted into a homogeneous graph structure, the path direction details of the access history are lost, but the efficiency of dynamic indexing can be further improved.
Disclosure of Invention
The present invention is based on the above problems and considerations, and its main work includes two important aspects: firstly, dividing input space-time big data or data to be accessed into three parts of a day, a ground and a person, forming space-time organization of the data after the data passes through a FeatureRDD model, and simultaneously recording input and access histories in parallel; secondly, constructing a data prediction model based on the data input and access histories of the homogeneous map so as to realize an efficient and dynamic indexing mode and a parallel computing mode. And a reasonable and efficient distributed spatial index is established based on a parallel computing architecture, so that parallel processing of massive spatial data is realized, and analysis, processing and rendering efficiency of the massive spatial data is greatly improved.
Therefore, the invention provides a data indexing method based on a homography architecture FeatureRDD model, which specifically comprises the following steps:
s1, at least one of an atmospheric environment database, a regional and regional humanization database and a ground natural geographic database is established, wherein the atmospheric environment data comprise, but are not limited to, meteorological data A, illumination data B, environmental noise and management data C and production and living gas emission and management data D; the anthropogenic geographic data of the region subarea comprises, but is not limited to, population basic information data E, agricultural production management data F, industrial and enterprise company production management data G, third industrial production management data H, information network data I, social medical data J, social financial data K, social property and insurance data L, social education data M, judicial and administrative management data N and ground road and traffic management data O; the ground natural geographic data includes, but is not limited to, geological and geological activity data P, hydrologic data Q, surface vegetation data R, ground construction and artificial landscape data S. In the description of the invention, the total of three libraries is nineteen types of data sets, and the types of the data sets can be added or subtracted according to the needs in practical application, so the invention is not limited to the data sets.
S2, constructing an atmospheric environment database, a regional-partitioned humane geographic database, and an attribute isomorphic information network model in and between the ground natural geographic database to form a homography G= { G 1 ;G 2 ;G 3 ;G 4 }={V 1 ,E 1 ,X 1 ;V 2 ,E 2 ,X 2 ;V 3 ,E 3 ,X 3 ;V 4 ,E 4 ,X 4 }, itMiddle V 1 、V 2 、V 3 Data sets in an atmospheric environment database, a regional partitioned humane geographic database and a ground natural geographic database respectively, E 1 、E 2 、E 3 The relation sets between the data sets in the atmospheric environment database, the humane geographic database of regional partition and the ground natural geographic database are respectively shown as X 1 、X 2 、X 3 For the corresponding information matrix, the homogeneous graph structure inside the library is reflected; v (V) 4 ,E 4 ,X 4 The database set is composed of an atmospheric environment database, a regional partition humanization geographic database and a ground natural geographic database, and the corresponding database relation set and the database information matrix reflect the homogeneous graph structures among the databases. Wherein E is 1 、E 2 、E 3 、E 4 The relation sets respectively comprising the first preset time period are divided into a single element relation set A e Multi-element relationship set A m And both are arranged in time sequence of a plurality of the first preset time periods to form a time-containing unit element relation set A at the current moment t e (t) Multi-element relationship set A m And (t), wherein the information matrix contains attribute information of various data, elements in the current relation set in any first preset time period are subgraphs of the homogeneous graph, and the subgraphs with the same class number of the data are collected in the same relation set.
Preferably, the first predetermined period of time is 1ms to 0.01s.
If the input data or the accessed data only contains one type of data in nineteen types of data within the preset range of 1ms-0.01s, the relationship formed by the input and access operations for each type of data is a unit element relationship, and the time arrangement of a plurality of first preset time periods formed by the unit element relationship is a unit element relationship set. For example, in 1ms-0.01s the server finds that the type of data entered and/or accessed is A, forms the current set of element relationships { A }, and in the next 1ms-0.01s (not necessarily consecutive in time to the last 1ms-0.01 s) the type of data entered and/or accessed is G, then the single set of elements is at the current two first predetermined timesThe current unit element relation set { A, G } is formed in the segment, so that the unit element relation set A at any moment is formed e (t)。
For a multi-element relation set, i.e. for example, when the input data or the accessed data contains multiple types of data in the nineteen types of data, such as two types A and G, forming the current multi-element relation set { (AG) }, the input and/or accessed data in the next 1ms-0.01s (not necessarily in time succession with the last 1ms-0.01 s) are of the types G and J, the current multi-element relation set { (AG), (GJ) }, and so on forming the multi-element relation set A at any moment m (t). Within the current relation set of any predetermined schedule, the positions of the mutual exchanges between any elements are regarded as the same subgraph, for example { (AG) } = { (GA) }, { (AG), (GJ) } = { (GJ), (AG) } = { (GA), (GJ) }, due to the relation of the homograms.
When a user needs to perform operations such as inputting, modifying, auditing, inquiring and the like to access a database data set for a certain purpose, access history data is left, and a plurality of different homogeneous sub-images can be generated, so that the traditional thought of spatial index of massive data set data per se is converted into a new thought of dynamic space-time data organization of the data by utilizing search history, the new thought does not change the existing data architecture mode, the cost of data index and call analysis is saved, and the efficiency is improved.
Input history is also formed when user input data requiring parallel computation is entered into the FeatureRDD model.
S3, establishing a real-time input historical database and/or an access historical database of the database, wherein the access historical database records the real-time input and modification history, the data auditing history and the data inquiring history of the data, and respectively forms the modification historical database, the data auditing historical database and the data inquiring historical database; the history databases are divided into intra-library and inter-library history databases to realize the homogeneous graph G in the step S2 1 ;G 2 ;G 3 ;G 4 Data set mapping and relation set mapping of (a) to finally form a dynamic homogeneous graph varying with timeThus with t 1 -t 4 Over time, the input history data in the real-time input history database and the access history data in the access history database are continuously updated, wherein E 1 (t 1 )={A e1 (t 1 ),A m1 (t 1 )},E 2 (t 2 )={A e2 (t 2 ),A m2 (t 2 )},E 3 (t 3 )={A e3 (t 3 ),A m3 (t 3 )},E 4 (t 4 )={A e4 (t 4 ),A m4 (t 4 )}。
S4, constructing a FeatureRDD model comprising a filtering and analyzing module and a modeling module, and dividing all data input by history into meta information and attribute information, wherein the meta information stores space-time coordinates where the input or access data is located and a time-containing unit element relation set A ei (t) time-lapse multi-element relationship set A mi (t), i=1, 2,3,4, and attribute information is correspondingly divided into a plurality of database modules according to the category of data (for example, nineteen categories of data are divided into nineteen database modules), and each module stores a category of temporal data sets (node representation of the homogeneity map), a temporal information matrix, and an input history database and/or an access history database of the homogeneity map; it will be appreciated that:
that is, whenever elements in a collection are identical or partially identical, each element is a single data type, i.e., a single element relationship collection, and each element is a plurality of data types, i.e., a multi-element relationship collection.
S5, constructing an index prediction model based on input and access history data of the homography, and realizing local dynamic space-time index.
S6, selecting input data and/or selecting the type of data required for access, and displaying required database items according to the model constructed in S5.
Wherein S5 specifically includes:
s5-1, inputting data into a FeatureRDD model, inputting the input data into corresponding attribute information according to various data classifications in S1, inputting meta information, and updating an input history database and/or an access history database; as the description S1 can be nineteen types of data, the S5-1 realizes the classified entry of corresponding attribute information for the nineteen types of data;
s5-2, constructing a relation set prediction graph;
s5-3, establishing a CNN-LSTM prediction model based on the relation set prediction graph.
Wherein S5-2 specifically comprises:
s5-2-1 updating the time-containing Unit relation set A with data in the input historian and/or the Access historian ei (t) time-lapse multi-element relationship set A mi (t) real-time statistics of input probability of each relation setAnd/or access probability->Where c ε { A, B, …, S } represents the type of database. For example when selecting statistical input probability +.>When the statistical input data belongs to c=a, updating the in-library database records of the input history database input and modification history database, wherein the input probability of a is the occurrence frequency of the element relation set a in the whole element relation set a ei (t) the ratio of the number of elements; for example, if the statistical input data belongs to c= (AG), the probability that the statistical input data belongs to (AG) is the ratio of the number of occurrences of the dual-element relation set (AG) in the whole dual-element relation set element number, and so on.
When selecting statistical access probabilityWhen the access data belongs to A, updating the in-library database records of the input and modification history databases in the access history database, and then counting the access probability of A, namely the occurrence times of A in the unit pixel relation set in the whole unit pixel relation set A ei (t) the ratio of the number of elements, and performing a rational analysis on the multi-element relation set.
Similarly, select statisticsAnd access probability->And when the method is used, the database records in the database of the input history database and the database records in the database of the modification history are updated simultaneously, and the probability is obtained by analysis according to the same size.
S5-2-2 defines nineteen classes of database c and image pixel values pix c Mapping relation c→pix between c Then the time-containing unit element relation set A ei (t) time-lapse multi-element relationship set A mi (t) mapping into time-containing unit pixel value relation sets respectivelyAnd time-containing multi-element pixel value relation set +.>Wherein +pix c Adding the plus sign to represent the sum of the pixel values corresponding to each database in the elements in the corresponding relation set to form a pixel value representing the element (for example, the pixel value of the element (AG) in the double-element relation set is the sum of the pixel values corresponding to A and G to obtain the pixel value representing (AG)), and extracting the pixel value of the time-containing unit pixel value relation set and the pixel value of the representative element in the time-containing multi-element pixel value relation set in a second preset time period; according to the same relation set in S2 and for each relation set according to the first time period And sequentially arranging the pixel values of the representative elements to form K predictive graphs with consistent sizes of a plurality of pixels, wherein K is the number of relation sets, the number of elements in the relation set of the current pixel values exceeds a preset value, each predictive graph corresponds to one of the same relation sets, the data in a first preset time period is accumulated along with the time, the number of each predictive graph is also increased, and when one predictive graph which is formed currently is regulated, and other multiple predictive graphs of the same kind are formed, the selection mode of the representative elements is the representative elements which appear immediately after the representative elements of the previous predictive graph, and the pixel arrangement mode is the same as that of the previous predictive graph.
Preferably, the image pixel value pix c For at least one of the gray values or the color RGB values or a combination thereof, the pixel values are different in size for different c, and more preferably, the pixel values corresponding to adjacent two databases are different in size by 5-11 according to the difference of c.
Preferably, the predetermined value is 5×10 8
Preferably, the duration of the second predetermined time period is a duration corresponding to an integer multiple of the first time period, more preferably, the integer is 224×224-500×500, that is, a epi (t) or A mpi (t) the dimension of the prediction graph is 224×224-500×500 pixels, so that each prediction graph corresponds to multiple input probabilities generated according to the difference of cAnd/or access probability->Where j is a control index, which differs according to whether or not an element in the square matrix takes over a pixel value formed by elements in all the current correspondence sets, so that +.> Providing j=0 as a pass taking an equal number (which may generally occur at the beginning of the generation of the set of pixel value relationships, see example 4), otherwise j=1 is taken to be less than the number, if the pixel value is absent, the corresponding element in the square matrix is zero pixel value, once the size of the square matrix is selected, the sizes of all kinds of prediction graphs formed after the time-shifting are no longer changed, so that there may be at most a case where j=0 in each kind of prediction graph, and the other cases where the dimension of the set of pixel value relationships increases with the passage of time, but the size of the square matrix is unchanged, so that the pixel value in all the set of pixel value relationships cannot be taken. It can be seen that j is essentially a determinate amount of control matrix size, and that the image size must be uniform in order to train CNN, so j=1 is a general case over time. In order to fill up the square matrix pixel values as much as possible, and reduce zero pixels, the time is generally long enough to wait, and in practice, this condition is easily and quickly met due to the massive number of frequent service requests.
Preferably, the elements in the square matrix take pixel values formed by the elements in all the current corresponding relation sets Otherwise->
Wherein when c=a, the probability is inputAnd/or access probability->The input probability and/or the access probability of the element A in the prediction graph obtained by the unit element relation set are respectively; when c= (AG) (also abbreviated as c=ag), the probability of input +.>And/or access probability->The input probabilities and/or the access probabilities of the elements (AG) in the prediction graph obtained from the two-element relationship set, respectively. For the input or access of single elements, the number of elements in the relation set is greater than that of elements in the double-element relation set, so that zero pixel values are necessarily present in the prediction graph formed by the double-element relation set. In another case, for a certain element relation set, for example, a unit element relation set, if a certain element is arranged according to a certain element in the relation set, the certain element is already arranged, and the rest square matrix elements are filled with zero pixels.
It should be emphasized that for a single element pixel value relation set, when a pixel value corresponding to a representing element is selected, a is arranged at the first pixel position of the prediction graph, for example, when the prediction graph is square, the first position is arranged at the upper right corner of the square, and then other single element representing element pixel values such as G or J are arranged in the same way. Taking a as a representative element when arranging other more homogeneous prediction graphs, taking a as an example, taking a which appears immediately after the a representative element in the single element prediction graph of the previous set of a representative element as a representative element of the single element prediction graph of the next a representative element, arranging the single elements at the upper right corner of the single element prediction graph square matrix of the next a representative element, and continuing arranging other single elements according to the sequence to form the single element prediction graph of the next a representative element.
For a relation set of single elements and/or multiple elements in a certain class, if the elements in the set exceed the preset value, for example, according to the size of 224×224 predictive pictures on the premise of selecting the longest 0.01s in a first preset time period, if each predictive picture training set guarantees that ten thousands of predictive pictures are used, and the K value is 3, the number of single element pixel values needs to be guaranteed to exceed 5×10 8 The element number requirements of the unit element relation set meeting the requirements are counted and counted into K, so that the data quantity meeting the training requirements can be accumulated for less than half a year, if the value of K is 9,the data volume meeting the training requirement can be accumulated less than half a year. In practice, a small number of data sets of interest are typically selected for each access. Thus, the first predetermined period of time is chosen to be shorter, and as a whole, satisfying CNN training requires only an average data accumulation time of less than 1 year.
It will be appreciated that the K predictive graphs correspond to K sets of element relationships, where the element composition between each set of element relationships is different, i.e., each set of element relationships is one of the same set of relationships.
According to theory, the total seed number of the predictive diagram reaches The predictive modeling after completion is time consuming if all of the considerations are taken into account. Thus, only a set of relationships with a number of elements sufficient for modeling calculation is selectively selected to predict the required index on the one hand, while those sets of relationships with insufficient element numbers to learn to obtain accurate predictions are not used for modeling. For example, if the four element relationship set currently only has 10 elements, it is apparent that it cannot be used to model the probability of predicting the four element dynamic index; on the other hand, modeling time is also reduced. K is determined as needed.
Wherein, for example, for more than 5X 10 formed by the element represented by A 8 Elementary relation set of individual elementsThe set of element relationships is selected as one of the K prediction graphs for modeling. It can be seen that over time, the more the number of species that can be modeled, the more abundant the predicted index information, forming a dynamic index. In addition, for the selection of the preset value, the accuracy of prediction and the network structure of the prediction model can be adjusted. From the following, for example, when the number of elements of the four-element relation set does not satisfy a predetermined value, then LSTM can only predict three-element index prediction.
S5-3 specifically comprises:
S5-3-1, constructing a CNN model with a prediction graph corresponding to each element relation set as an input end and prediction probability values of each element relation set as an output end;
s5-3-2, a full-connection feature vector corresponding to the unit element relation set is predicted to serve as an input end, input probability and/or access probability corresponding to the multi-element relation set serve as an output end, and a CNN-LSTM joint model is established.
Wherein step S5-3-1 comprises:
s5-3-1-1 sequentially forming a corresponding prediction graph of the current element relation set by the relation sets of single elements, double elements and more elements through a step S5-2, and dividing the corresponding prediction graph into a training set, a verification set and a test set; the data proportion of the training set, the verification set and the test set is 5-1:1:3-1;
s5-3-1-2, inputting the training set of the predictive graph corresponding to the current element relation set into a CNN network, and forming an input vector P through first full connection 1 ,P 2 ,…,P K Substituting the prediction probability p into the first softmax function regression layer to classify the prediction probability p 1 ,p 2 ,…,p K Calculating cross entropy loss function L by verification of verification set coss The back propagation is carried out to adjust CNN network parameters, and stable accuracy acc and L are obtained coss And (5) minimum, stopping training. I.e. having the characteristics respectively as P 1 ,P 2 ,…,P K The input vector of the form recognizes the corresponding prediction probability p after CNN network learning is completed 1 ,p 2 ,…,p K The probability that the input prediction graph belongs to a single-element or multi-element relation set with subscripts of 1,2, … and K is respectively judged.
Wherein step S5-3-2 comprises:
s5-3-2-1, constructing an LSTM model network, setting all levels of hidden layers to represent prediction units of elements (namely sub-graphs) in a multi-element relation set, wherein the number of the multi-element relation set sequentially increases from double elements to multiple elements, the transmission vectors can be jumped among the hidden layers of all levels, the corresponding output end of each hidden layer is connected with a second full-connection layer, a group of m sub-graphs with high probability of the specific first m bits in the corresponding multi-element relation set is predicted through a second softmax function regression layer, and m is not more than half of all sub-graphs in the corresponding current multi-element relation set.
For example, when the set of dual element relationships { (AG), (GA), (JG), (GJ), (AS), (AN), (SA), (NA) } contains AG, GJ, AS, AN seed patterns (the other 4 sub-patterns considered to be the same class after sequential exchange with their data types due to non-directionality of the homography relationship), then m may take 2 or 1.
It can be appreciated that the first hidden layer output result corresponds to the probability of the first m elements in the predicted two-element relationship set that are most likely in the classification obtained by the second softmax function regression layer. For example, when the first input end of the LSTM inputs A, the first output end predicts the double-element relation set through the second connecting layer and the second softmax function regression layer, the double-element AG and AS subgraphs are the two subgraph prediction results with the highest probability, and the index prediction is considered to be carried out on AG and AS at the moment, so that dynamic index is realized, which is equivalent to guessing that other data types possibly need to be input or accessed by a user except A, so that index marks of the data and display selection of database items can be provided. The predicted AG and AS are predicted results at the first output, belonging to the first set of predicted results, and denoted AS number 1.
There is a case where, for example, the number of elements of the double element set does not satisfy a predetermined value, i.e., does not count in K and does not participate in LSTM training, while the number of elements of the triple does, vector jumps generated by a first hidden layer in the LSTM hidden layer are transferred to a third hidden layer, i.e., the triple element index of a sub-graph formed by the triple data types is predicted directly by a single data type, so we set a jumpable case in the hidden layer to adapt to different cases of K prediction graphs.
S5-3-2-2 is activated by using zero vector input zeroth hidden layer, zero vector input zeroth input end is endowed with the first output end asAnd/or +.>Initializing, and recording the first input vector outputted by the first full connection layer as +.>And/orIs input to the corresponding input end of the LSTM, the vector generated by the zeroth hidden layer is transferred or jumped to the hidden layer output to the corresponding input end, and the probability of each of the m sub-graphs of the first group (with the serial number of 1) is predicted according to S5-3-2-1And/or +.>The corresponding input vector multiplied by the output of the first full connection layer is denoted +.>And/orAdd with offset vector->And/or +.>Form a second input vector->And/or +.>'
Sequentially inputting to the corresponding input end of the next LSTM, wherein c takes all corresponding multi-element types (for example, for all current double-element types of element double elements such as AG and GJ), and the vector in the hidden layer outputted from the corresponding input end inputted by the first input vector is transferred or jumped to the hidden layer outputted from the corresponding input end of the next LSTM again, and the next group (with the sequence number of 2) m is predicted again Predicting results, wherein the next m predicting results are m subgraphs with the probability of being larger than the specific first m bits in the next corresponding multi-element relation set, and the like until a group of m subgraphs with the probability of being larger than the first m bits of all K element relation sets are predicted; wherein the method comprises the steps ofAnd/or +.>Is obtained by inputting a prediction graph obtained by a single element or the minimum multielement number multielement relation set through the step S5-2 into a trained CNN model and outputting the prediction graph by a first full connection layer, and the prediction graph is respectively combined with probability->And/or +.>Corresponding to the above; />And/or +.>And then, each vector obtained by the output of the first full-connection layer is a vector obtained by the output of the first full-connection layer in a trained CNN model which is input by the predictive map obtained in the step S5-02 for each multi-element relation set with sequentially increased multi-element numbers.
S5-3-2-3 uses the verification set to verify the LSTM model through S5-3-2-1-S5-3-2-2, accuracy and calculation loss function L 'are obtained, counter propagation is carried out to adjust LSTM network parameters, stable accuracy ACC and minimum L' are obtained, and training is stopped.
Preferably, maximum likelihood estimation is used as a loss functionWherein, the liquid crystal display device comprises a liquid crystal display device,
or (b)Num is the total number of prediction result groups excluding 1 zeroth output, where y ij Is to verify the corresponding real access probability in the set, lambda is a regularization parameter, model optimization is carried out by adopting random gradient descent or deformation thereof, theta is a parameter in a regression layer of the second softmax function, and the parameter comprises a weight matrix W ij' And offset vector b ij' J' is the group number of the predicted outcome, < >>Is the square of the 2-norm, and for each of the m predicted outcomes of each set, the second softmax function form is y 0ij' =softmax(W ij' ·P oij' +b ij' ) Wherein P is oij' For the output vector obtained by the second full connection output in the output end under the sequence number of the current prediction result group, the weight matrix W ij' And offset vector b ij' The predictors for different sub-graphs are not shared, but only for the same sub-graph in LSTM.
S6 specifically comprises:
inputting input data into a FeatureRDD model, establishing updated meta information and attribute information, inputting a CNN input end of a CNN-LSTM model based on a prediction graph obtained by an updated input history database and/or an access history database through S5-2, predicting at least one group of prediction results in the LSTM established according to the step S5-3, realizing dynamic indexing of the prediction results, and finally carrying out data processing and analysis according to the data after dynamic indexing to realize parallel calculation, wherein the data processing and analysis comprises: filtering the data after the local spatial index is created, obtaining geographic and time ranges, cutting, spatial query, attribute summarization, grid aggregation, polygon aggregation, extraction columns and additional column calculation.
Or selecting the data type required by access, inputting a prediction diagram obtained by S5-2, inputting a CNN input end of a CNN-LSTM model, predicting at least one group of prediction results in the LSTM established according to the step S5-3, and realizing a database corresponding to the selected and predicted results to display database items.
The beneficial effects are that:
the adopted scheme realizes the parallel computation of massive vector elements by using the FeatureRDD model, and continuously accumulates input data and/or accessed historical data by using the FeatureRDD model, so that a training set and a verification set which are enough for CNN-LSTM optimization are provided for dynamic indexes, and further accurate required results are provided for parallel computation based on the dynamic indexes. Especially, when the elements in the square matrix are taken from pixel values formed by the elements in the corresponding relation set of some last first preset time periods, the CNN-LSTM is optimized to reflect the rule of data input and/or access in the last time so as to really realize efficient and dynamic indexing and realize instant indexing.
Drawings
Fig. 1 is a schematic diagram of the structure of nineteen types of data in three types of databases of the day, the ground and the human at the same time t, and a homogeneity map formed based on the nineteen types of data according to embodiment 1 of the present invention.
FIG. 2 is a flow chart of the FeatureRDD model structure including the filtering and analysis module and the CNN-LSTM modeling module constructed in example 3, and the parallel computation and dynamic indexing of data input and data access by the FeatureRDD model.
Fig. 3 is a flowchart illustrating the construction of a relationship set prediction graph according to embodiment 4.
FIG. 4 is a schematic diagram of the structure of the CNN-LSTM model and modeling and prediction flow.
Detailed Description
For simplicity of explanation, the single element and the multi-element relationship set are simply referred to as a single element set and a multi-element set, respectively, in the embodiment.
Example 1
The embodiment provides a construction method related to the establishment of a data base and a homogeneity map of the day, the ground and the person, as shown in fig. 1. Establishing an atmospheric environment database, a regional personnel geographic database and a ground natural geographic database; wherein the atmospheric environmental data includes, but is not limited to, meteorological data A, illumination data B, environmental noise and management data C, and production and living gas emission and management data D; the geographical database of the region subarea comprises, but is not limited to, population basic information data E, agricultural production management data F, industrial and enterprise company production management data G, third industrial production management data H, information network data I, social medical data J, social financial data K, social property and insurance data L, social education data M, judicial and administrative management data N, ground road and traffic management data O; the ground natural geographic data includes, but is not limited to, geological and geological activity data P, hydrologic data Q, surface vegetation data R, ground building and artificial landscape data S, for a total of nineteen types of data sets.
Wherein the atmospheric environment data formed by four data of ABCD are taken as a data set V 1 Human geographic database V of a city formed by E-O eleven types of data 2 The P-S four types of data form a data set V in a ground natural geographic database 3 ,E 1 、E 2 、E 3 The relation sets between data sets in an atmospheric environment database, a regional partitioned humane geographic database and a ground natural geographic database are respectively represented by X 1 、X 2 、X 3 For the corresponding information matrix (not shown in the figure, respectively representing the relationship mapping information and the data mapping information in the three databases), the structure of the homogeneous map in the library is reflected; v (V) 4 ,E 4 ,X 4 Respectively atmospheric environment database V 1 Humane geographic database V of regional partition 2 Database set V formed by ground natural geographic database 3 The corresponding set of database relationships (not shown), and the database information matrix (not shown, respectively representing the relationship mapping information and the data mapping information between the three databases) reflect the structure of the homogeneity map between the libraries.
Wherein E is 1 、E 2 、E 3 、E 4 The relation sets respectively including the first predetermined time period 30s are divided into a single element set A e Multiple element set A m And both are arranged in time sequence of a plurality of the first preset time periods to form a time-containing unit element set A at the current moment t e (t), multi-element set A m And (t), wherein the information matrix contains information of attributes of various data, elements in the current relation set in any first preset time period are subgraphs of the homogeneous graph, and the subgraphs with the same class number of the data are collected in the same relation set.
In this embodiment A e (t)={A,…,G,…,J},A m (t) = { (AG), … }. U.g. AGJ), … }. U. …, i.e. a single element set, has at least element A, G, J, and a multi-element set includes at least a double element set containing (AG), a triple element set containing (AGJ), and an element set of more elements.
Example 2
Establishing a real-time input history database and/or an access history database of the database, wherein the access history database records the real-time input and modification history, the data auditing history and the data query history of the data, and forms a modification history database, a data auditing history database and a data query history database respectively; the history databases are divided into intra-library and inter-library history databases to realize the homogeneous graph G in the step S2 1 ;G 2 ;G 3 ;G 4 Data set mapping and relation set mapping of (a) to finally form a dynamic homogeneous graph varying with timeThus with t 1 -t 4 Over time (which may be equal to the instant of recording, or at least two different recorded instants), the input history data in the real-time input history database and the access history data in the access history database are continuously updated, wherein E 1 (t 1 )={A e1 (t 1 ),A m1 (t 1 )},E 2 (t 2 )={A e2 (t 2 ),A m2 (t 2 )},E 3 (t 3 )={A e3 (t 3 ),A m3 (t 3 )},E 4 (t 4 )={A e4 (t 4 ),A m4 (t 4 )}。
Example 3
This example will illustrate the construction of a FeatureRDD model comprising a filtering and analysis module and a CNN-LSTM modeling module, as shown in FIG. 2, which contains space-time coordinates and a time-dependent set of relationships E of example 2 1 (t 1 )={A e1 (t 1 ),A m1 (t 1 )},E 2 (t 2 )={A e2 (t 2 ),A m2 (t 2 )},E 3 (t 3 )={A e3 (t 3 ),A m3 (t 3 )},E 4 (t 4 )={A e4 (t 4 ),A m4 (t 4 ) Meta information of }. The three types of databases in embodiment 1, namely, the day, the ground and the human, are respectively provided with parallel computation of data input and access service of data by different servers, and respectively correspond to three types of attribute information in the FeatureRDD model according to data input or data access requests, namely, the attribute information of a certain city atmospheric environment formed by four types of data of ABCD, the attribute information of a certain city formed by eleven types of data of E-O and the attribute information of natural geographic data of the ground of a certain city formed by four types of data of P-S, so as to complete the distributed architecture of space-time data. The attribute information of the data B is given in fig. 2, comprising a time-containing dataset (i.e. node representation of a homogeneity map), a time-containing information matrix, an input history database and/or an access history database.
When a parallel computing user or an access user input server or a selection server carries out data parallel computing or data access, filtering and analyzing input data through a FeatureRDD model to obtain parallel computing results, calling input historical data in a real-time input historical database and access historical data in an access historical database in attribute information to continuously update a time-containing relation set of the embodiment 2 in meta information, realizing CNN-LSTM modeling through a CNN-LSTM modeling module, and according to the selection input data or the selection access data type of the user, predicting the input data and/or the access data, completing dynamic indexing, and further carrying out filtering and analyzing on the dynamically indexed data to obtain the parallel computing results.
Example 4
This embodiment illustrates constructing a relational set prediction graph, as shown in fig. 3, in which an updated history database is formed in the FeatureRDD distributed data structure model for requests for input data and data access, thereby updating the temporal relational set of embodiment 2, and counting the probabilities of data input and history access for single-element sets, double-element sets, triple-element sets, and more element sets in real time. The probabilities of data input and history access for the single element set, the double element set and the three element set in the embodiment respectively compriseAnd->And->And->Where the latter is the maximum probability in the predicted correspondence set in example 5.
Using preset mapping relation c-pix of pixel gray value c And c.fwdarw.pix c The conversion of the temporal relationship set of example 2 to a prediction graph pixel set is accomplished,and->In this embodiment, the dimensions of both are 224×224×k, and j=0 is selected as the taking condition, so as to obtain a prediction graph (which may be respectively referred to as a single element prediction graph, a double element prediction graph, a triple element prediction graph, etc.) corresponding to the single element set, the double element set, and the triple element set. The pixel arrangement results are sequentially arranged from left to right and from top to bottom as shown by the arrow in fig. 3. +. >Andand->And->The real values to be predicted values in embodiment 5 and the representative elements a, AG, AGJ are all arranged in the first right corner to schematically illustrate the subgraphs in the homogeneity map made up of the access history data recorded by the first predetermined period featureRDD, representing the real-time counted input probabilities and access probabilities, respectively, corresponding thereto. The actual position is then according to A epi (t) and A mpi In (t), A, AG and AGJ are arranged in a 224X 224 square matrix in order of order by arranging positions respectively. Then the dimension of the relation set of pixel values increases with time, while the square matrix keeps the size 224×224 unchanged, so that the number of prediction graphs of each kind increases, and the corresponding input probability and access probability are no longer j=0 as the taking case.
For the arrangement of more single element prediction graphs, double element prediction graphs, triple element prediction graphs and other prediction graphs, as shown in fig. 3, for the represented elements of a, AG and AGJ, the red boxes which appear immediately after the represented elements of a, AG and AGJ respectively (for AG and AGJ, due to the relationship of homogeneous graphs, positions of any two-by-two exchange elements are regarded as the same subgraph) are taken as the represented elements, and are also arranged according to the time sequence of generating the respective elements from the right upper corner of the square matrix (namely, the sequence of 10ms for each first specified time period), so that other more various prediction graphs are formed, and a training set is provided for training CNN.
Whereby the input history database and access history database updates in the featureRDD are updated based on the input data and data access to update the relationship set E into the meta-information 1 (t 1 )={A e1 (t 1 ),A m1 (t 1 )},E 2 (t 2 )={A e2 (t 2 ),A m2 (t 2 )},E 3 (t 3 )={A e3 (t 3 ),A m3 (t 3 )},E 4 (t 4 )={A e4 (t 4 ),A m4 (t 4 ) And simultaneously counting the input probability and the access probability in each element set in real time, converting the relationship set between the mapping relationships of the embodiment into a pixel value relationship set, and performing 224×224 square matrix arrangement of a plurality of multi-kind predictive graphs to obtain the predictive graph for training the CNN.
Example 5
This example illustrates the process of CNN-LSTM modeling. Taking access data service as an example, as shown in fig. 4, the method for forming a prediction graph according to embodiment 4 forms K kinds of prediction graphs based on current access history data, inputs the K kinds of prediction graphs into a CNN network, outputs the K kinds of prediction graphs to connect to a first full connection layer FC, and inputs the K kinds of prediction graphs into a first softmax regression layer to realize identification of various kinds of prediction graphs. The method specifically comprises the following steps:
s5-3-1-1 sequentially forms a corresponding prediction graph of each element relation set at present by the relation sets of single elements, double elements, three multi-elements and more elements through step S5-2, and divides the corresponding prediction graph into a training set, a verification set and a test set, wherein the ratio of the training set to the verification set is 3:1:1.
S5-3-1-2, inputting the training set of the predictive diagram corresponding to the current element relation set into a CNN network, and forming an input vector P through a first full connection layer FC 1 ,P 2 ,…,P K Substituting the prediction probability p into the first softmax function regression layer to classify the prediction probability p 1 ,p 2 ,…,p K Calculating cross entropy loss function L by verification of verification set coss The back propagation is carried out to adjust CNN network parameters, and stable accuracy acc and L are obtained coss And (5) minimum, stopping training. I.e. having the characteristics respectively as P 1 ,P 2 ,…,P K The input vector of the form recognizes the corresponding prediction probability p after CNN network learning is completed 1 ,p 2 ,…,p K The probability that the input prediction graph belongs to a single-element or multi-element relation set with subscripts of 1,2, … and K is respectively judged.
S5-3-2-1Constructing an LSTM model network, and a first hidden layer h (1) The second hidden layer h is a dual-element prediction unit (2) Is a three-element prediction unit, and so on to more element prediction units, and comprises a zero hidden layer h (0) All hidden layers within have the function of jumping the transfer vector. Wherein W is e 、W h U is the output weight of each layer, E is the input matrix and the input vector act to form each input end vector E (1) 、e (2) Etc.
The corresponding output end of each hidden layer is connected with a second full-connection layer FC, and a group of 1 subgraphs of specific maximum (namely m=1) probability in the corresponding multi-element relation set is predicted through a second softmax function regression layer.
When the user access data is identified as A by the featureRDD, the current predicted graph is formed to input the trained CNN to form a first input vector of LSTM of a first FC output Inputting a first input end to form an input end vector e (1) After inputting the first hidden layer h (1) Activating the zeroth hidden layer h using a zero vector (0) The zero vector O is used as an initial input to the zeroth hidden layer, and the output of the zeroth hidden layer is transferred or jumped to the next hidden layer. In the present embodiment, the number of the double element set elements exceeds a predetermined value (e.g., 1.5X10 9 ) Thus participating in prediction, so that the zeroth hidden layer h (0) The transferred vector does not choose to jump over the first hidden layer h (1) Directly from W h Transfer of action to first hidden layer h (1) Jump transfer to a hidden layer represented by a multi-element set capable of meeting a predetermined value if the dual-element set does not meet the predetermined value.
S5-3-2-2 is activated by using zero vector input zeroth hidden layer, zero vector input zeroth input end is endowed with the first output end asInitializing, and recording the first input vector outputted by the first full connection layer FC as +.>Is input to a first input end corresponding to the LSTM, and a zero hidden layer h (0) The generated vector is transferred to the first hidden layer h which is output by the corresponding first input end (1) According to S5-3-2-1, 1 subgraphs AG of the first group (SEQ ID NO: 1) are predicted +.>The corresponding input vector multiplied by the output of the first full connection layer is denoted +. >Add with offset vector->Form a second input vector->Sequentially input into the corresponding second input end of the next LSTM, and the corresponding first input end outputs the first hidden layer h (1) The vector in (a) is transferred to a second hidden layer h which is output by a second input end corresponding to the next LSTM (2) The next set (sequence number 2) of 1 subgraphs AGJ is predicted again, and the next set of 1 prediction results are likewise the 1 subgraphs AGJ of the specific maximum probability in the next corresponding three-element relation set, and so on, until all K element relation sets are predicted; wherein->Is obtained by inputting the predicted graph obtained from the step S5-2 of the unit element relation set into the CNN model trained in the example 4 (obviously, the number of the unit element relation set elements exceeds the specified value at the moment), and->Corresponding to the above; />And the input vectors obtained by the output of the first full-connection layer are input into the prediction graph obtained by the step S5-2 in the CNN model trained in the embodiment, wherein the input vectors are sequentially increased from the first full-connection layer and are obtained from the input vectors of the multi-element relation set.
If m is selected to be 2 or another number, a corresponding second input vector, third input vector, etc. is also generated. And the prediction result of the previous m large probabilities is output at each input end of the output layer.
S5-3-2-3 uses the verification set to verify the LSTM model through S5-3-2-1-S5-3-2-2, accuracy and calculation loss function L 'are obtained, counter propagation is carried out to adjust LSTM network parameters, stable accuracy ACC and minimum L' are obtained, and training is stopped.
Using maximum likelihood estimates as a loss functionWherein the method comprises the steps ofNum is the total number of predicted outcome groups excluding 1 zero-th output, i.e. at least comprising sequence number 1, sequence number 2 in FIG. 4, where y ij Is the corresponding real access probability in the verification set, in fig. 4, λ is the regularization parameter, model optimization is performed by adopting random gradient descent or deformation thereof, θ is the parameter in the regression layer of the second softmax function, and includes a weight matrix W ij' And offset vector b ij' J' is the group number of the predicted outcome, < >>Is the square of the 2-norm, and for each of the m predicted outcomes of each set, the second softmax function form is y 0ij' =softmax(W ij' ·P oij' +b ij' ) Wherein P is oij' For the output vector obtained by the second full connection output in the output end under the sequence number of the current prediction result group, the weight matrix W ij' And offset vector b ij' Taking no for different sub-graph prediction resultsWith the same value, m=1 in this embodiment, there is only one weight matrix W ij' And offset vector b ij' Participate in parameter adjustment during back propagation and are shared among LSTM.
Example 6
This embodiment illustrates dynamic indexing and parallel computing, including:
inputting input data into a featureRDD model, establishing updated meta information and attribute information, inputting a prediction graph obtained by S5-2 based on an updated input history database and/or an access history database, inputting a CNN input end of a CNN-LSTM model, predicting at least one group of prediction results in the LSTM established according to the embodiment 5, realizing dynamic indexing of the prediction results, and finally carrying out data processing and analysis according to the data after dynamic indexing to realize parallel calculation, wherein the data processing and analysis comprises: filtering the data after the local spatial index is created, obtaining geographic and time ranges, cutting, spatial query, attribute summarization, grid aggregation, polygon aggregation, extraction columns and additional column calculation.
Or selecting the data type required by access, inputting the prediction graph obtained in the embodiment 4 into the CNN input end of the CNN-LSTM model, predicting at least one group of prediction results in the LSTM established in the embodiment 5, and realizing the database corresponding to the selected and predicted results to display the database items.
Thus, when the server of embodiment 3 is used to access a data request, according to the selected data type, a 224×224 prediction graph corresponding to the selected data type at the present time, that is, the latest version of the prediction graph according to the time lapse, is obtained, and is input to the CNN to complete the first FC output vector, for inputting a corresponding input terminal of the LSTM, and if a is selected, the first input terminal is input, and the two-element and three-element subgraphs, such as AG and AGJ of embodiment 5, are predicted, respectively. If the AG is selected for access, the LSTM second input is entered, and a sub-graph in the three-element set is predicted, such as AGJ of example 5, and so on. It can be seen that the access AG and GA are selected to be regarded as the same subgraph and are counted into the access probability of AG as a representative element, the access sequence of the AG does not affect the prediction result, the access sequence of A and G does not affect the access purpose of expanding the database items of A and G, and therefore the homogeneous graph architecture can reduce the calculation load.

Claims (9)

1. The data indexing method based on the homogeneous map architecture FeatureRDD model is characterized by comprising the following steps of:
s1, at least establishing an atmospheric environment database, a regional partition humanization geographic database of regional partitions and one of three databases of a ground natural geographic database;
S2, constructing attribute isomorphic information network models in an atmospheric environment database, a regional and regional-partitioned human geographic database and a ground natural geographic database and between the atmospheric environment database and the regional-partitioned human geographic database to form a heterogeneous graph G= { G 1 ;G 2 ;G 3 ;G 4 }={V 1 ,E 1 ,X 1 ;V 2 ,E 2 ,X 2 ;V 3 ,E 3 ,X 3 ;V 4 ,E 4 ,X 4 }, wherein V 1 、V 2 、V 3 Data sets in an atmospheric environment database, a regional partitioned humane geographic database and a ground natural geographic database respectively, E 1 、E 2 、E 3 The relation sets between the data sets in the atmospheric environment database, the humane geographic database of regional partition and the ground natural geographic database are respectively shown as X 1 、X 2 、X 3 For the corresponding information matrix, the homogeneous graph structure inside the library is reflected; v (V) 4 ,E 4 ,X 4 A database set consisting of an atmospheric environment database, a regional partitioned human geographic database and a ground natural geographic database, a corresponding database relation set and a database information matrix reflect a homogeneous graph structure among the databases, wherein E 1 、E 2 、E 3 、E 4 The relation sets respectively comprising the first preset time period are divided into a single element relation set A e Multi-element relationship set A m And both are arranged in time sequence of a plurality of the first preset time periods to form a time-containing unit element relation set A at the current moment t e (t) Multi-element relationship set A m (t) wherein the information matrix contains information of attributes of various data, elements in a current relation set in any first preset time period are sub-graphs of a homogeneous graph, sub-graphs with the same class number of the data are collected in the same relation set, and the first preset time period is 1ms-0.01s;
S3, establishing a real-time input historical database and/or an access historical database of the database, wherein the access historical database records the real-time input and modification history, the data auditing history and the data inquiring history of the data, and respectively forms the modification historical database, the data auditing historical database and the data inquiring historical database; the history databases are divided into intra-library and inter-library history databases to realize the homogeneous graph G in the step S2 1 ;G 2 ;G 3 ;G 4 Data set mapping and relation set mapping of (a) to finally form a dynamic homogeneous graph varying with timeThus with t 1 -t 4 Over time, the input history data in the real-time input history database and the access history data in the access history database are continuously updated, wherein E 1 (t 1 )={A e1 (t 1 ),A m1 (t 1 )},E 2 (t 2 )={A e2 (t 2 ),A m2 (t 2 )},E 3 (t 3 )={A e3 (t 3 ),A m3 (t 3 )},E 4 (t 4 )={A e4 (t 4 ),A m4 (t 4 )};
S4, constructing a FeatureRDD model comprising a filtering and analyzing module and a modeling module, and dividing all data input by history into meta information and attribute information, wherein the meta information stores space-time coordinates where the input or access data is located and a time-containing unit element relation set A ei (t) time-lapse multi-element relationship set A mi (t), i=1, 2,3,4, and the attribute information is divided into a plurality of database modules according to the category of the data, each module storing the same A class of time-containing data sets of the mass graph, a time-containing information matrix, and an input history database and/or an access history database;
s5, constructing an index prediction model of data based on input and access histories of the homography, and realizing local dynamic space-time index;
s6, selecting input data and/or selecting the type of data required for access, and displaying required database items according to the model constructed in S5.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the atmospheric environmental data comprises, but is not limited to, meteorological data A, illumination data B, environmental noise and management data C and production and living gas emission and management data D;
the geographical database of the region subarea comprises, but is not limited to, population basic information data E, agricultural production management data F, industrial and enterprise company production management data G, third industrial production management data H, information network data I, social medical data J, social financial data K, social property and insurance data L, social education data M, judicial and administrative management data N, ground road and traffic management data O;
the ground natural geographic data includes, but is not limited to, geological and geological activity data P, hydrologic data Q, ground vegetation data R, ground architecture and artificial landscape data S;
A total of nineteen classes of data sets.
3. The method according to claim 2, wherein S5 specifically comprises:
s5-1, inputting the input data into a FeatureRDD model, realizing that the input data is classified and input into corresponding attribute information according to various data in S1, inputting meta information, and updating an input history database and/or an access history database;
s5-2, constructing a relation set prediction graph;
s5-3, establishing a CNN-LSTM prediction model based on the relation set prediction graph.
4. A method according to claim 3, wherein S5-2 comprises:
s5-2-1 updating the time-lapse cell relation set A with the data in the input history database and/or the access history database ei (t) time-lapse multi-element relationship set A mi (t) real-time statistics of input probability of each relation setAnd/or access probability->Where c ε { A, B, …, S } represents the type of database;
s5-2-2 defines nineteen classes of database c and image pixel values pix c Mapping relation c→pix between c Then the time-containing unit element relation set A ei (t) time-lapse multi-element relationship set A mi (t) mapping into time-containing unit pixel value relation sets respectivelyAnd time-containing multi-element pixel value relation set +.>Wherein +pix c The medium plus sign+ represents summing the pixel values corresponding to each database in the elements in the corresponding relation set to form a pixel value representing the element, extracting the time-containing unit pixel value relation set and the representative element pixel values in the time-containing multi-element pixel value relation set in a second preset time period, arranging the representative element pixel values in the same relation set according to the same relation set in S2 and the same relation set according to the sequence of the first time period to form K predictive graphs with consistent sizes of a plurality of pixels, wherein K is the number of relation sets of which the number of elements in the current pixel value relation set exceeds a preset value, each predictive graph corresponds to one of the same relation sets, and the data of the first preset time period is provided with the time period In the case of accumulation, the number of each prediction graph is also increasing, and when one prediction graph is formed and a plurality of other prediction graphs are formed, the representative element selection mode is the representative element which appears immediately after the representative element of the previous prediction graph, and the pixel arrangement mode is the same as the previous prediction graph arrangement mode, wherein the preset value is 5 multiplied by 10 8
5. The method of claim 4, wherein the image pixel value pix c For at least one of the gray values or the color RGB values or a combination thereof, the pixel values are different in size for different c, and more preferably, the pixel values corresponding to adjacent two databases are different in size by 5-11 according to the difference of c.
6. The method of claim 4 or 5, wherein the duration of the second predetermined period of time is a duration corresponding to an integer multiple of the first period of time.
7. The method of claim 6, wherein the integer is 224X 224 to 500X 500, A epi (t) or A mpi (t) the dimension of the prediction graph is 224×224-500×500 pixels, so that each prediction graph corresponds to multiple input probabilities generated according to the difference of c And/or access probability->Where j is a control index, which differs according to whether or not an element in the square matrix takes over a pixel value formed by elements in all the current correspondence sets, so that +.>Let j=0 be equal sign for the pass, otherwise j=1 be less than sign, if there is no imageThe pixel value is zero, and once the size of the square matrix is selected, the sizes of all kinds of prediction graphs formed after the time-lapse are not changed any more; the elements in the square matrix are pixel values formed by the elements in all the current corresponding relation sets, at the moment +.>Otherwise->
8. The method of claim 4, wherein S5-3 specifically comprises:
s5-3-1, constructing a CNN model with a prediction graph corresponding to each element relation set as an input end and prediction probability values of each element relation set as an output end;
s5-3-2, a full-connection feature vector corresponding to the unit element relation set is predicted to serve as an input end, input probability and/or access probability corresponding to the multi-element relation set serve as an output end, and a CNN-LSTM joint model is established.
9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,
wherein step S5-3-1 comprises:
S5-3-1-1 sequentially forming a corresponding prediction graph of each element relation set of the single element, the double element and more elements through the step S5-2, dividing the corresponding prediction graph into a training set and a verification set, and testing the training set and the verification set according to the ratio of 5-1:1:3-1;
s5-3-1-2, inputting the training set of the predictive graph corresponding to the current element relation set into a CNN network, and forming an input vector P through first full connection 1 ,P 2 ,…,P K Substituting the prediction probability p into the first softmax function regression layer to classify the prediction probability p 1 ,p 2 ,…,p K Calculating cross entropy loss function L by verification of verification set coss The back propagation is carried out to adjust CNN network parameters, and stable accuracy acc and L are obtained coss Minimum, training is stopped, i.e. with a respective e.g. P 1 ,P 2 ,…,P K The input vector of the form recognizes the corresponding prediction probability p after CNN network learning is completed 1 ,p 2 ,…,p K The probability that the input prediction graph belongs to a single-element or multi-element relation set with subscripts of 1,2 and … and K is respectively judged;
wherein step S5-3-2 comprises:
s5-3-2-1, constructing an LSTM model network, setting all levels of hidden layers to represent prediction units of elements in a multi-element relation set, wherein the number of the multi-element relation set sequentially increases from double elements to multi-elements, all levels of hidden layers can jump transfer vectors, the corresponding output end of each hidden layer is connected with a second full-connection layer, a group of m sub-graphs with the probability of being larger than the specific first m bits in the corresponding multi-element relation set is predicted through a second softmax function regression layer, and m is not more than half of all sub-graphs in the corresponding current multi-element relation set;
S5-3-2-2 is activated by using zero vector input zeroth hidden layer, zero vector input zeroth input end is endowed with the first output end asAnd/or +.>Initializing, and recording the first input vector outputted by the first full connection layer as +.>And/orInput to LSTM corresponding input terminal, vector generated by the zeroth hidden layer is transferred or jumped to the hidden layer output to the corresponding input terminal, and the probability of each of the m sub-pictures of the first group is predicted according to S5-3-2-1>And/orThe corresponding input vector multiplied by the output of the first full connection layer is denoted +.>And/or +.>Add with offset vector->And/or +.>Form a second input vector->And/or +.>Sequentially inputting the m prediction results into the input end corresponding to the next LSTM, wherein c' takes all corresponding multi-element types, and the vector in the hidden layer output by the corresponding input end input by the first input vector is transferred or jumped to the hidden layer output by the corresponding input end of the next LSTM, predicting the m prediction results again, wherein the m prediction results are m subgraphs with the probability of being larger than the specific m bits in the corresponding multi-element relation set again, and the like until a group of m subgraphs with the probability of being larger than the m bits in the corresponding multi-element relation set are predicted; wherein- >And/or +.>Is made of single element or the most probableThe multi-element relation set with small multi-element number is input into a trained CNN model through a predictive diagram obtained in the step S5-2 and is output by a first full-connection layer, and the multi-element relation set is obtained by the output of the first full-connection layer and is respectively combined with probability +.>And/or +.>Corresponding to the above; />And/or +.>And then each vector obtained by the output of the first full-connection layer is a vector obtained by the output of the first full-connection layer in a trained CNN model which is input by the predictive map obtained in the step S5-02 from each multi-element relation set with sequentially increased multi-element number;
s5-3-2-3 uses the verification set to verify the LSTM model through S5-3-2-1-S5-3-2-2, obtains accuracy and a calculation loss function L ', and carries out counter propagation to adjust LSTM network parameters to obtain stable accuracy ACC and minimum L', and training is stopped;
using maximum likelihood estimates as a loss functionWherein the method comprises the steps of
Or (b)Num is the total number of prediction result groups excluding 1 zeroth output, where y ij The corresponding real access probability in the set is verified, lambda is a regularization parameter, random gradient descent or deformation thereof is adopted for model optimization, and theta is a second sParameters in the oftmax function regression layer, including weight matrix W ij' And offset vector b ij' J' is the group number of the predicted outcome, < > >Is the square of the 2-norm, and for each of the m predicted outcomes of each set, the second softmax function form is y 0ij' =softmax(W ij' ·P oij' +b ij' ) Wherein P is oij' For the output vector obtained by the second full connection output in the output end under the sequence number of the current prediction result group, the weight matrix W ij' And offset vector b ij' The prediction results of different sub-graphs are not shared, and the prediction results of the same sub-graph are shared in the LSTM;
s6 specifically comprises:
inputting input data into a FeatureRDD model, establishing updated meta information and attribute information, inputting a prediction graph obtained by S5-2 based on an updated input history database and/or an access history database, inputting a CNN input end of a CNN-LSTM model, predicting at least one group of prediction results in the LSTM established according to the step S5-3, realizing dynamic indexing of the prediction results, and finally carrying out data processing and analysis according to the data after dynamic indexing to realize parallel calculation, wherein the data processing and analysis comprises: filtering the data after the local spatial index is created, obtaining geographic and time ranges, cutting, spatial inquiry, attribute summarization, grid aggregation, polygon aggregation, extraction columns and additional column calculation;
or selecting the data type required by access, inputting a prediction diagram obtained by S5-2, inputting a CNN input end of a CNN-LSTM model, predicting at least one group of prediction results in the LSTM established according to the step S5-3, and realizing a database corresponding to the selected and predicted results to display database items.
CN202310688288.5A 2023-06-12 2023-06-12 Data indexing method based on homogeneous map structure FeatureRDD model Active CN116701396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310688288.5A CN116701396B (en) 2023-06-12 2023-06-12 Data indexing method based on homogeneous map structure FeatureRDD model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310688288.5A CN116701396B (en) 2023-06-12 2023-06-12 Data indexing method based on homogeneous map structure FeatureRDD model

Publications (2)

Publication Number Publication Date
CN116701396A true CN116701396A (en) 2023-09-05
CN116701396B CN116701396B (en) 2023-12-29

Family

ID=87835275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310688288.5A Active CN116701396B (en) 2023-06-12 2023-06-12 Data indexing method based on homogeneous map structure FeatureRDD model

Country Status (1)

Country Link
CN (1) CN116701396B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909642A (en) * 2017-02-20 2017-06-30 中国银行股份有限公司 Database index method and system
CN111563081A (en) * 2020-04-09 2020-08-21 农业农村部规划设计研究院 Vector element parallel computing method and device, storage medium and terminal
CN111563080A (en) * 2020-04-09 2020-08-21 农业农村部规划设计研究院 Spatial data indexing and topological method, device and storage medium
KR20220099745A (en) * 2021-01-07 2022-07-14 서강대학교산학협력단 A spatial decomposition-based tree indexing and query processing methods and apparatus for geospatial blockchain data retrieval
CN115810148A (en) * 2022-11-17 2023-03-17 农业农村部大数据发展中心 Crop type image generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909642A (en) * 2017-02-20 2017-06-30 中国银行股份有限公司 Database index method and system
CN111563081A (en) * 2020-04-09 2020-08-21 农业农村部规划设计研究院 Vector element parallel computing method and device, storage medium and terminal
CN111563080A (en) * 2020-04-09 2020-08-21 农业农村部规划设计研究院 Spatial data indexing and topological method, device and storage medium
KR20220099745A (en) * 2021-01-07 2022-07-14 서강대학교산학협력단 A spatial decomposition-based tree indexing and query processing methods and apparatus for geospatial blockchain data retrieval
CN115810148A (en) * 2022-11-17 2023-03-17 农业农村部大数据发展中心 Crop type image generation method and device

Also Published As

Publication number Publication date
CN116701396B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
Ke et al. Hexagon-based convolutional neural network for supply-demand forecasting of ride-sourcing services
US10896205B2 (en) Interactive on-demand hypercube synthesis based multi-dimensional drilldown and a pivotal analysis tool and methods of use
CN105045869B (en) Natural resources geographical spatial data method for organizing based on multiple data centers and system
Long et al. Mapping block-level urban areas for all Chinese cities
Bellini et al. Smart city architecture for data ingestion and analytics: Processes and solutions
May et al. Reversed effects of grazing on plant diversity: the role of below‐ground competition and size symmetry
Wilson Models in urban planning: a synoptic review of recent literature
Ren et al. Deep spatio-temporal residual neural networks for road-network-based data modeling
Dendoncker et al. A statistical method to downscale aggregated land use data and scenarios
CN109242170A (en) A kind of City Road Management System and method based on data mining technology
Singh Modelling land use land cover changes using cellular automata in a geo-spatial environment
Laskar Integrating GIS and multicriteria decision making techniques for land resource planning
CN107870949A (en) Data analysis job dependence relation generation method and system
CN104732091A (en) Cellular automaton river bed evolution prediction method based on natural selection ant colony algorithm
CN112148820B (en) Underwater terrain data identification and service method and system based on deep learning
CN113158038A (en) Interest point recommendation method and system based on STA-TCN neural network framework
CN112380302A (en) Thermodynamic diagram generation method and device based on track data, electronic equipment and storage medium
Jendryke et al. Spatial prediction of sparse events using a discrete global grid system; a case study of hate crimes in the USA
Li et al. Terrain visualization information integration in agent-based military industrial logistics simulation
CN116701396B (en) Data indexing method based on homogeneous map structure FeatureRDD model
Chen et al. RISeer: Inspecting the status and dynamics of regional industrial structure via visual analytics
CN115759291B (en) Spatial nonlinear regression method and system based on ensemble learning
Le Gallic et al. Investigating long‐term lifestyle changes: A methodological proposal based on a statistical model
Palafox et al. Predicting gentrification in Mexico city using neural networks
CN113642755A (en) Whole sales prediction method based on partial medicine sales data of hospital

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant