CN114372185A - Rapid search system and method for remote sensing big data - Google Patents

Rapid search system and method for remote sensing big data Download PDF

Info

Publication number
CN114372185A
CN114372185A CN202210051029.7A CN202210051029A CN114372185A CN 114372185 A CN114372185 A CN 114372185A CN 202210051029 A CN202210051029 A CN 202210051029A CN 114372185 A CN114372185 A CN 114372185A
Authority
CN
China
Prior art keywords
data
information
combined
fragment
mapping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210051029.7A
Other languages
Chinese (zh)
Other versions
CN114372185B (en
Inventor
黄祥志
刘向东
郝梦非
周丽玲
顾冬冬
吴志钦
邓冬
戴希凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Tianhui Spatial Information Research Institute Co ltd
Original Assignee
Jiangsu Tianhui Spatial Information Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Tianhui Spatial Information Research Institute Co ltd filed Critical Jiangsu Tianhui Spatial Information Research Institute Co ltd
Priority to CN202210051029.7A priority Critical patent/CN114372185B/en
Publication of CN114372185A publication Critical patent/CN114372185A/en
Application granted granted Critical
Publication of CN114372185B publication Critical patent/CN114372185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation

Abstract

The invention discloses a quick search method of remote sensing big data, which comprises the following steps: step S100: acquiring and analyzing a retrieval request input by a user to generate a combined retrieval tag; data crawling is carried out to obtain a plurality of combined data information fragments meeting the retrieval request; step S200: decomposing a plurality of combined data information fragments meeting the retrieval request into a plurality of mapping data pairs; sequencing and primarily adjusting the priority of each combined data information fragment; step S300: traversing all mapping data pairs, capturing the differential data pairs, and converging the differential data pairs into a differential mapping pair set to calculate and predict the activity and reliability of different source data centers; step S400: the priority sequence of each combined data information fragment is adjusted again based on the activity calculation result and the reliability prediction result; and the user retrieves and acquires the data in the adjusted transfer data center.

Description

Rapid search system and method for remote sensing big data
Technical Field
The invention relates to the technical field of remote sensing data search, in particular to a system and a method for quickly searching remote sensing big data.
Background
Through development of more than half a century, the remote sensing technology and the multi-field application enter a new stage, the remote sensing technology is more and more closely related to national economy, ecological protection and national defense safety, such as land resource investigation, ecological environment monitoring, agricultural monitoring and crop assessment, disaster forecast and disaster assessment, marine environment investigation and the like, including activities such as weather forecast, air quality monitoring, electronic maps, navigation and the like which are closely related to daily life, and the remote sensing plays an important role; in the 21 st century, the remote sensing technology has shown new characteristics of high spatial resolution, high spectral resolution and high time resolution, and has exploited more new application fields to efficiently and conveniently find existing spatial information resources on the internet; remote sensing data with multi-spatial resolution, multi-spectral resolution and multi-temporal resolution acquired by different remote sensing platforms can provide remote sensing information support for users in different application fields, and meanwhile, massive various spatial information resources are opened on the Internet, including a data set opened by a public institution according to a national policy, a result set opened by a scientific research department according to national requirements, an open data set shared by an open source community and a public welfare organization, information service provided by commercial enterprises to the outside and the like; one of the bottlenecks of high-resolution remote sensing data application is that a reliable data source cannot be provided, satellite remote sensing data resources are dispersed in each data center, the data resources of a single data center are very limited, each data center needs to be searched one by one before a project is developed, whether the project has implementation conditions can be judged by integrating the satellite remote sensing data coverage condition, the work is very complicated along with the increase of data companies and the reduction of resource concentration, the index and search service of a quick boundary is improved invisibly, and a user can know whether satellite remote sensing data meeting business requirements exist quickly and how to acquire the data.
Disclosure of Invention
The invention aims to provide a system and a method for quickly searching remote sensing big data, which aim to solve the problems in the background technology.
In order to solve the technical problems, the invention provides the following technical scheme: a quick search method for remote sensing big data comprises the following steps:
step S100: acquiring a retrieval request input by a user and analyzing the retrieval request to generate a combined retrieval tag corresponding to the retrieval request; the form of the combined search tag is such as { master tag; a secondary label; auxiliary tags }; the quick search system generates a corresponding retrieval instruction based on the main label in the combined retrieval label; the quick search system performs data crawling on the data centers of all remote sensing data websites in a large range based on the retrieval instruction to obtain a plurality of combined data information fragments meeting the retrieval request;
step S200: decomposing a plurality of combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data center and the fragment data; sequencing the priority of each combined data information fragment to obtain an initial combined data information fragment sequence; and carrying out primary adjustment on the initial combined data information fragment sequence; sending the combined data information fragment sequence after primary adjustment to a transit data center;
step S300: traversing all mapping data pairs in the combined data information fragment sequence, capturing the difference data pairs which have the same fragment data and are mapped with different source data centers, and converging the captured difference data pairs into a difference mapping pair set; calculating and predicting the activity and the credibility of different source data centers in the difference mapping pair set;
step S400: the priority sequence of each combined data information fragment is adjusted again based on the activity calculation result and the reliability prediction result; the user retrieves and acquires data in the adjusted transfer data center; the user can generate user data feedback after the data is used; the rapid search system saves the priority sequence of each combined data information segment or adjusts the priority sequence of each combined data information segment based on the user data feedback.
Further, step S100 includes:
step S101: correspondingly decomposing a retrieval request input by a user into retrieval condition parameters of each part, wherein the retrieval condition parameters of each part comprise a region, a time range, cloud cover, resolution, a technical source and a field to be applied; the technical source refers to different remote sensors for obtaining various remote sensing data;
step S102: taking two parts of retrieval condition parameters of a technical source and a field to be applied as auxiliary labels of a retrieval request; taking retrieval condition parameters of an area, a time range and a cloud cover as main labels of retrieval requests; taking the resolution as a secondary label of the retrieval request; combining the auxiliary label, the main label and the secondary label to form a form such as { main label; a secondary label; auxiliary tags } a combined retrieval tag;
step S103: sending out the retrieval condition parameters of each part in a json format, and receiving by a server interface by using @ RequestBody; the server analyzes the retrieval condition parameters in the main label, wherein the analysis comprises processing the region coordinate information into a corresponding space data type through an gis algorithm, and converting the received time range information in a character string form into time information in a date format;
step S104: performing data crawling in a data center of each remote sensing data website in a large range based on regional information in a main tag of a retrieval request; using the crawled data as an initial data range; further screening data in an initial data range based on the time range and the cloud cover information in the main label of the retrieval request;
the @ requestBody note is commonly used to handle content that is not the default application/x-www-form-url encoded for content-type, such as application/json or application/xml, which is commonly used to handle application/json types; the @ RequestBody accepts a JSON-formatted character string, and the JSON character string in the retrieval request body can be bound to a corresponding bean or can be bound to a corresponding character string respectively through the @ RequestBody; through the arrangement, the screened data can better accord with the retrieval request and the retrieval efficiency is improved.
Further, step S200 includes:
step S201: respectively carrying out data center traceability on a plurality of combined data information fragments meeting the retrieval request to obtain a traceability set { A) corresponding to each combined data information fragment1,A2,…,Ai,…,An}; wherein A isiRepresenting the ith source data center; n represents the total number of source data centers; and (3) carrying out fragment data disassembly on each combined data information fragment based on the corresponding traceable set obtained by tracing to obtain a data fragment set { B }1,B2,…,Bk,…,Bm}; wherein, BkRepresents the kth fragment data; m represents the total number of fragment data; and n is m;
step S202: respectively establishing one-to-one mapping relation between the tracing sets and the data fragment sets of a plurality of combined data information fragments meeting the retrieval request to obtain a plurality of mapping data pairs { A }i,Bk}; and i ═ k; traversing the mapping data logarithm in all combined data information fragments meeting the retrieval request; sequencing all combined data information fragments from small to large based on respective mapping data logarithms to obtain an initial combined data information fragment sequence;
step S203: extracting combined retrieval tag information of each combined data information fragment in the initial combined data information fragment sequence, and performing primary adjustment on the initial combined data information fragment sequence based on secondary tags and auxiliary tags in the combined retrieval tags of each combined data information fragment;
the mapping data pairs obtained by the steps are beneficial to tracing the source of each metadata center of each obtained combined data information fragment subsequently, and the source composition of each part of data in each combined data information fragment can be reflected by the mapping data pairs of each combined data information fragment, for example, because different combined data information fragments comprise different mapping data pairs and each mapping data pair comprises a source data center, the searching source of the combined data information fragments with few mapping data pairs is simpler because the source of data crawling is simple when all remote sensing data meeting the retrieval request are obtained, and the risk of the credibility of the source data center is low; finishing the sorting of each combined data information fragment based on the consideration; the secondary label and the auxiliary label are used for sequencing and adjusting the obtained remote sensing data based on the quality requirement, so that the user can preferably check the high-quality remote sensing data.
Further, step S300 includes a method for calculating liveness for different source data centers within the set of distinct mapping pairs:
step S301: respectively acquiring historical browsing information, historical data downloading information and historical data fragment copying information of each source data center in a differential data pair; capturing browsing rules of each browsing record in the historical browsing information, calculating browsing frequency, and setting a standard browsing frequency fluctuation interval; accumulating browsing times Q meeting standard browsing frequency fluctuation interval1Number of browsing Q within fluctuation range of browsing frequency not meeting standard2
Step S302: establishing information association between the historical browsing information and historical data downloading information and historical data fragment copying information respectively; cumulative Q1Number of times L of downloading historical data under association1,Q2Number of times L of downloading historical data under association2(ii) a Cumulative Q1Associated historical data segment copy times H1,Q2Associated historical data segment copy times H2(ii) a Calculating the activity A of each source data center:
A=a×Q1+b×L1+c×H1
wherein a is Q1/Q2;b=L1/L2;c=H1/H2
The activity of the source data center is analyzed and calculated, so that bedding analysis before reliability prediction is performed on the source data center, and when reliability analysis data is missing or insufficient, the reliability analysis data is used as reference data for supplementary analysis, so that the scientificity of data sources in the rapid searching process is guaranteed;
in the above step, a, b and c correspond to weight values, and Q is set to be Q1/Q2As the weight value a, when Q2Is far greater than Q1When it is time, Q1And Q2A multiple of difference of (a) is taken as Q1The weight value of can reflect the Q of the liveness really when the liveness is calculated out in a linear way1Data of (a) represents meaning and Q1And Q2Correlation between, Q1/Q2The greater the activity A, the greater Q1/Q2The smaller, the lower is Q1The ratio in the activity calculation process is smaller along with the activity A; in the same way, the L1/L2As the weight value b, when L2Is far greater than L1At the time of L, the1And L2Is taken as L1The weighted value can be linearly increased to actually reflect the L of the liveness in the liveness calculation1Data of (3) represents meaning and L1And L2Correlation between, L1/L2The greater the activity A, the greater L1/L2The smaller, the lower is L1The ratio in the activity calculation process is smaller along with the activity A; h is to be1/H2As the weight value c, when H2Is far more than H1When H is required, H is1And H2Is taken as H1The weighted value can be linearly increased, and H which can really reflect the activity in the activity calculation1Represents meaning of data of (A) and H1And H2Correlation between them, H1/H2The greater the activity A, the greater the activity H1/H2The smaller, the lower is H1The ratio in the activity calculation process is smaller along with the activity A.
Further, step S300 includes a method of predicting trustworthiness of different source data centers within a set of distinct mapping pairs:
step S311: segment data information coverage for different source data centers in each mapping data pairThe coverage rate is calculated as P1F/G; wherein F represents the times of occurrence of a certain source data center in a different mapping pair set in each mapping data pair of a certain combined data information fragment; g represents the total number of the difference mapping pairs in the difference mapping pair set;
step S312: capturing page information of different source data centers in each mapping data pair; for frequency P of advertisement page appearing in page message2Capturing; frequency P of technical word irregularity appearing in page information3Capturing; the technical terms are not standard, and do not conform to the technical terms special for remote sensing data generated based on a large database or common alternative description terms;
step S313: credibility of different source data centers according to formula
Figure BDA0003474350490000051
Predicting the reliability value;
in the process of the credibility value prediction analysis, the coverage rate P of the fragment data information of different source data centers in the mapping data pair is used1On the basis of the analysis, a data information coverage rate P is set by default1A high source data center is lower in confidence risk; because the data reference rate represents the activity and data coverage speciality of the source data center; the credibility analysis program of the non-key source data center in the data retrieval process is reduced; and the frequency P of the appearance of the advertisement pages2And frequency P of technical denormalization of occurrence3The data are related data which reduce the credibility risk of the source data center; when P is present2And P3The larger the product of (a), the smaller the confidence value K; when P is present2And P3The smaller the product of (c), the greater the confidence value K.
Furthermore, capturing the browsing rule of each browsing record in the historical browsing information means capturing an effective browsing record in the historical browsing information; the determination of valid browsing records includes:
the historical browsing information contains a sliding record of a user on a page, and when the average duration of the sliding page is greater than a preset average duration threshold value, a prepared effective record is recorded;
capturing the interval dwell time of each page in the sliding process in the sliding record capturing process; recording as an effective record when the interval stay time is longer than the preset interval stay time, and acquiring the total page number x and the total word number y of the sliding record when the interval stay time is shorter than the preset interval stay time; when y: when x is smaller than a preset ratio threshold value, recording as an effective record; when y: and when x is smaller than a preset ratio threshold value, deleting the part of the sliding record.
Further, the step S400 of adjusting the priority ranking of each combined data information fragment again based on the liveness calculation result and the reliability prediction result includes:
step S401: acquiring liveness calculation results and credibility calculation results of the centralized source data centers by the differential mapping; sequencing the values of the source data centers from large to small according to the reliability calculation results; when the reliability values are equal to or smaller than the deviation threshold value of the reliability values in the sorting process, sequentially sorting according to the activity calculation result;
step S402: traversing all combined data information segments meeting the retrieval request, locking and labeling the combined data information segments covering the distinguishing mapping in each combined data information segment to the combined data information segment of the centralized source data center, and when the sequencing interval between two labeled combined data information segments in the initial combined data information segment sequence is smaller than an interval threshold value; obtaining the sorting of the concentrated source data centers belonging to the difference mapping in the labeled combined data information fragment, and calculating an average sorting value; and the sorting of the two marked combined data information segments is adjusted again based on the size of the average sorting value.
In order to better realize the method, a quick search system is also provided, and the quick search system comprises: the system comprises a retrieval request acquisition and analysis module, a data crawling module, a mapping data pair generation module, a combined data information fragment sequencing primary adjustment module, a distinguishing data pair capturing module, a transit data center generation module and a combined data information fragment sequencing secondary adjustment module;
the retrieval request acquisition and analysis module is used for acquiring a retrieval request input by a user and analyzing the retrieval request to generate a combined retrieval tag corresponding to the retrieval request;
the data crawling module is used for receiving the retrieval request to acquire the combined retrieval tag data in the analysis module, and the rapid search system generates corresponding retrieval instructions based on the main tags in the combined retrieval tags to perform data crawling in the data centers of the remote sensing data websites in a large range to acquire a plurality of combined data information fragments meeting the retrieval request;
the mapping data pair generation module is used for receiving the combined data information fragment data in the data crawling module and decomposing a plurality of combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data center and the fragment data;
the combined data information fragment sequencing module is used for receiving mapping data pair data in the mapping data pair generating module, and sequencing and primarily adjusting the priority of each combined data information fragment based on the mapping data pair;
the transit data center generating module is used for receiving the data in the combined data information fragment sequencing module and storing the data and the sequencing information;
the combined data information fragment sequencing initial adjustment module is used for receiving data in the transit data center generation module and sequencing all combined data information fragments from small to large based on respective mapping data logarithms to obtain an initial combined data information fragment sequence; performing primary adjustment on the initial combined data information fragment sequence based on secondary labels and auxiliary labels in the combined retrieval labels;
the distinguishing data pair capturing and analyzing module is used for capturing distinguishing data pairs of all mapping data pairs in the combined data information fragment meeting the retrieval request and gathering the captured distinguishing data pairs into a distinguishing mapping pair set; calculating and predicting the activity and the credibility of different source data centers in the difference mapping pair set;
and the combined data information fragment sequencing readjustment module is used for receiving the calculation data in the distinguishing data pair capture analysis module and readjusting the sequencing of each combined data information fragment based on the calculation data.
Further, the mapping data pair generation module includes: a source tracing unit and a data disassembling unit;
the source tracing unit is used for carrying out data center source tracing on each combined data information fragment meeting the retrieval request and collecting to obtain a source tracing set corresponding to each combined data information fragment;
and the data disassembling unit is used for receiving the data in the tracing unit and disassembling the fragment data of each combined data information fragment based on the tracing set to obtain a data fragment set.
Further, the distinguishing data pair capturing and analyzing module comprises: the activity degree calculating unit and the reliability degree predicting and calculating unit;
the activity calculation unit is used for acquiring historical browsing information, historical data downloading information and historical data segment copying information of each source data center in the differential data pair; calculating the activity of the message leaching meeting based on the historical browsing information, the historical data downloading information and the historical data segment replication information;
a reliability prediction calculation unit; the system comprises a data acquisition module, a data acquisition module and a data analysis module, wherein the data acquisition module is used for acquiring data information of different source data centers in each mapping data pair; and carrying out credibility prediction calculation of different source data centers based on the coverage rate of the fragment data information and the captured and analyzed page information.
Compared with the prior art, the invention has the following beneficial effects: the problem that data resources of a single data center are limited can be solved, the problem of information isolated islands among the data centers can be solved by establishing a prepared data center, data retrieval can be carried out on the prepared data center with high resource concentration ratio by each search requirement, and data reliability judgment is completed; the invention solves the problem that each data center needs to be searched one by one before the project is developed, improves the searching efficiency, invisibly improves the indexing and searching service of the quick boundary, enables a user to quickly know whether satellite remote sensing data meeting the service requirement exists or not, and judges the source and the reliability of the obtained data; the scientificity and the rigor of data acquisition are improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart of a method for rapidly searching remote sensing big data according to the invention;
FIG. 2 is a schematic structural diagram of a rapid search system for remote sensing big data.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-2, the present invention provides the following technical solutions:
a quick search method for remote sensing big data comprises the following steps:
step S100: acquiring a retrieval request input by a user and analyzing the retrieval request to generate a combined retrieval tag corresponding to the retrieval request; the form of the combined search tag is such as { master tag; a secondary label; auxiliary tags }; the quick search system generates a corresponding retrieval instruction based on the main label in the combined retrieval label; the quick search system performs data crawling on the data centers of all remote sensing data websites in a large range based on the retrieval instruction to obtain a plurality of combined data information fragments meeting the retrieval request;
wherein, step S100 includes:
step S101: correspondingly decomposing a retrieval request input by a user into retrieval condition parameters of each part, wherein the retrieval condition parameters of each part comprise a region, a time range, cloud cover, resolution, a technical source and a field to be applied; the technical source refers to different remote sensors for obtaining various remote sensing data;
step S102: taking two parts of retrieval condition parameters of a technical source and a field to be applied as auxiliary labels of a retrieval request; taking retrieval condition parameters of an area, a time range and a cloud cover as main labels of retrieval requests; taking the resolution as a secondary label of the retrieval request; combining the auxiliary label, the main label and the secondary label to form a form such as { main label; a secondary label; auxiliary tags } a combined retrieval tag;
step S103: sending out the retrieval condition parameters of each part in a json format, and receiving by a server interface by using @ RequestBody; the server analyzes the retrieval condition parameters in the main label, wherein the analysis comprises processing the region coordinate information into a corresponding space data type through an gis algorithm, and converting the received time range information in a character string form into time information in a date format;
step S104: performing data crawling in a data center of each remote sensing data website in a large range based on regional information in a main tag of a retrieval request; using the crawled data as an initial data range; further screening data in an initial data range based on the time range and the cloud cover information in the main label of the retrieval request;
step S200: decomposing a plurality of combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data center and the fragment data; sequencing the priority of each combined data information fragment to obtain an initial combined data information fragment sequence; and carrying out primary adjustment on the initial combined data information fragment sequence; sending the combined data information fragment sequence after primary adjustment to a transit data center;
wherein, step S200 includes:
step S201: respectively carrying out data center traceability on a plurality of combined data information fragments meeting the retrieval request to obtain a traceability set { A) corresponding to each combined data information fragment1,A2,…,Ai,…,An}; wherein A isiRepresenting the ith source data center; n represents the total number of source data centers; and (3) carrying out fragment data disassembly on each combined data information fragment based on the corresponding traceable set obtained by tracing to obtain a data fragment set { B }1,B2,…,Bk,…,Bm}; wherein, BkRepresents the kth fragment data; m represents the total number of fragment data; and n is m;
step S202: respectively establishing one-to-one mapping relation between the tracing sets and the data fragment sets of a plurality of combined data information fragments meeting the retrieval request to obtain a plurality of mapping data pairs { A }i,Bk}; and i ═ k; traversing the mapping data logarithm in all combined data information fragments meeting the retrieval request; sequencing all combined data information fragments from small to large based on respective mapping data logarithms to obtain an initial combined data information fragment sequence;
step S203: extracting combined retrieval tag information of each combined data information fragment in the initial combined data information fragment sequence, and performing primary adjustment on the initial combined data information fragment sequence based on secondary tags and auxiliary tags in the combined retrieval tags of each combined data information fragment;
step S300: traversing all mapping data pairs in the combined data information fragment sequence, capturing the difference data pairs which have the same fragment data and are mapped with different source data centers, and converging the captured difference data pairs into a difference mapping pair set; calculating and predicting the liveness and credibility of different source data centers in the difference mapping pair set;
wherein step S300 includes a method of activity calculation for different source data centers within the set of distinct mapping pairs:
step S301: respectively acquiring historical browsing information, historical data downloading information and historical data fragment copying information of each source data center in a differential data pair; capturing browsing rules of each browsing record in the historical browsing information, calculating browsing frequency, and setting a standard browsing frequency fluctuation interval; totalizing the browsers meeting the standard browsing frequency fluctuation intervalNumber of views Q1Number of browsing Q within fluctuation range of browsing frequency not meeting standard2
The browsing rule capture of each browsing record in the historical browsing information refers to capturing of an effective browsing record in the historical browsing information; the determination of valid browsing records includes:
the historical browsing information contains a sliding record of a user on a page, and when the average duration of the sliding page is greater than a preset average duration threshold value, a prepared effective record is recorded;
capturing the interval dwell time of each page in the sliding process in the sliding record capturing process; recording as an effective record when the interval stay time is longer than the preset interval stay time, and acquiring the total page number x and the total word number y of the sliding record when the interval stay time is shorter than the preset interval stay time; when y: when x is smaller than a preset ratio threshold value, recording as an effective record; when y: when x is smaller than the preset ratio threshold value, deleting the sliding record of the part
Step S302: establishing information association between the historical browsing information and historical data downloading information and historical data fragment copying information respectively; cumulative Q1Number of times L of downloading historical data under association1,Q2Number of times L of downloading historical data under association2(ii) a Cumulative Q1Associated historical data segment copy times H1,Q2Associated historical data segment copy times H2(ii) a Calculating the activity A of each source data center:
A=a×Q1+b×L1+c×H1
wherein a is Q2/Q1;b=L2/L1;c=H2/H1
Step S300 includes a method for predicting credibility of different source data centers in a set of distinct mapping pairs:
step S311: calculating the coverage rate of the fragment data information of different source data centers in each mapping data pair, wherein the coverage rate is P1F/G; where F represents a source within a set of distinct mapping pairsThe number of times that the data center appears in each mapping data pair of a certain combined data information fragment; g represents the total number of the difference mapping pairs in the difference mapping pair set;
step S312: capturing page information of different source data centers in each mapping data pair; for frequency P of advertisement page appearing in page message2Capturing; frequency P of technical word irregularity appearing in page information3Capturing; the technical terms are not standard, and do not conform to the technical terms special for remote sensing data generated based on a large database or common alternative description terms;
step S313: credibility of different source data centers according to formula
Figure BDA0003474350490000101
Predicting the reliability value;
step S400: the priority sequence of each combined data information fragment is adjusted again based on the activity calculation result and the reliability prediction result; the user retrieves and acquires data in the adjusted transfer data center; the user can generate user data feedback after the data is used; the rapid searching system stores the priority sequence of each combined data information fragment or adjusts the priority sequence of each combined data information fragment based on the user data feedback;
in step S400, readjusting the priority ranking of each combined data information fragment based on the liveness calculation result and the reliability prediction result includes:
step S401: acquiring liveness calculation results and credibility calculation results of the centralized source data centers by the differential mapping; sequencing the values of the source data centers from large to small according to the reliability calculation results; when the reliability values are equal to or smaller than the deviation threshold value of the reliability values in the sorting process, sequentially sorting according to the activity calculation result;
step S402: traversing all combined data information segments meeting the retrieval request, locking and labeling the combined data information segments covering the distinguishing mapping in each combined data information segment to the combined data information segment of the centralized source data center, and when the sequencing interval between two labeled combined data information segments in the initial combined data information segment sequence is smaller than an interval threshold value; obtaining the sorting of the concentrated source data centers belonging to the difference mapping in the labeled combined data information fragment, and calculating an average sorting value; the sorting of the two marked combined data information fragments is adjusted again based on the size of the average sorting value;
in order to better realize the method, a quick search system is also provided, and the quick search system comprises: the system comprises a retrieval request acquisition and analysis module, a data crawling module, a mapping data pair generation module, a combined data information fragment sequencing primary adjustment module, a distinguishing data pair capturing module, a transit data center generation module and a combined data information fragment sequencing secondary adjustment module;
the retrieval request acquisition and analysis module is used for acquiring a retrieval request input by a user and analyzing the retrieval request to generate a combined retrieval tag corresponding to the retrieval request;
the data crawling module is used for receiving the retrieval request to acquire the combined retrieval tag data in the analysis module, and the rapid search system generates corresponding retrieval instructions based on the main tags in the combined retrieval tags to perform data crawling in the data centers of the remote sensing data websites in a large range to acquire a plurality of combined data information fragments meeting the retrieval request;
the mapping data pair generation module is used for receiving the combined data information fragment data in the data crawling module and decomposing a plurality of combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data center and the fragment data;
wherein the mapping data pair generation module comprises: a source tracing unit and a data disassembling unit;
the source tracing unit is used for carrying out data center source tracing on each combined data information fragment meeting the retrieval request and collecting to obtain a source tracing set corresponding to each combined data information fragment;
the data disassembling unit is used for receiving the data in the tracing unit and disassembling the fragment data of each combined data information fragment based on the tracing set to obtain a data fragment set;
the combined data information fragment sequencing module is used for receiving mapping data pair data in the mapping data pair generating module, and sequencing and primarily adjusting the priority of each combined data information fragment based on the mapping data pair;
the transit data center generating module is used for receiving the data in the combined data information fragment sequencing module and storing the data and the sequencing information;
the combined data information fragment sequencing initial adjustment module is used for receiving data in the transit data center generation module and sequencing all combined data information fragments from small to large based on respective mapping data logarithms to obtain an initial combined data information fragment sequence; performing primary adjustment on the initial combined data information fragment sequence based on secondary labels and auxiliary labels in the combined retrieval labels;
the distinguishing data pair capturing and analyzing module is used for capturing distinguishing data pairs of all mapping data pairs in the combined data information fragment meeting the retrieval request and gathering the captured distinguishing data pairs into a distinguishing mapping pair set; calculating and predicting the activity and the credibility of different source data centers in the difference mapping pair set;
wherein, the distinguishing data pair capturing and analyzing module comprises: the activity degree calculating unit and the reliability degree predicting and calculating unit;
the activity calculation unit is used for acquiring historical browsing information, historical data downloading information and historical data segment copying information of each source data center in the differential data pair; calculating the activity of the message leaching meeting based on the historical browsing information, the historical data downloading information and the historical data segment replication information;
a reliability prediction calculation unit; the system comprises a data acquisition module, a data acquisition module and a data analysis module, wherein the data acquisition module is used for acquiring data information of different source data centers in each mapping data pair; credibility prediction calculation of different source data centers based on fragment data information coverage rate and captured and analyzed page information
And the combined data information fragment sequencing readjustment module is used for receiving the calculation data in the distinguishing data pair capture analysis module and readjusting the sequencing of each combined data information fragment based on the calculation data.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A quick search method for remote sensing big data is characterized by comprising the following steps:
step S100: acquiring a retrieval request input by a user, analyzing the retrieval request and generating a combined retrieval tag corresponding to the retrieval request; the form of the combined search tag is such as { master tag; a secondary label; auxiliary tags }; the quick search system generates a corresponding retrieval instruction based on the main label in the combined retrieval label; the rapid searching system carries out data crawling on the data center of each remote sensing data website in a large range based on the retrieval instruction to obtain a plurality of combined data information fragments meeting the retrieval request;
step S200: decomposing the combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data centers and the fragment data; sequencing the priority of each combined data information fragment to obtain an initial combined data information fragment sequence; and carrying out primary adjustment on the initial combined data information fragment sequence; sending the combined data information fragment sequence after primary adjustment to a transit data center;
step S300: traversing all mapping data pairs in the combined data information fragment sequence, capturing the difference data pairs which have the same fragment data and are mapped with different source data centers, and converging the captured difference data pairs into a difference mapping pair set; calculating and predicting the liveness and credibility of different source data centers in the difference mapping pair set;
step S400: readjusting the sequence of the combined data information fragments based on the liveness calculation result and the reliability prediction result; and the user retrieves and acquires the data in the readjusted transit data center.
2. The method for rapidly searching remote sensing big data according to claim 1, characterized in that: the step S100 includes:
step S101: correspondingly decomposing the retrieval request input by the user into retrieval condition parameters of each part, wherein the retrieval condition parameters of each part comprise a region, a time range, cloud cover, resolution, a technical source and a field to be applied; the technical source refers to different remote sensors for obtaining various remote sensing data;
step S102: taking two parts of retrieval condition parameters of a technical source and a field to be applied as auxiliary labels of a retrieval request; taking retrieval condition parameters of an area, a time range and a cloud cover as main labels of retrieval requests; taking the resolution as a secondary label of the retrieval request; combining the auxiliary tag, the main tag and the secondary tag into a form such as { main tag; a secondary label; auxiliary tags } a combined retrieval tag;
step S103: sending out the retrieval condition parameters of each part in a json format, and receiving by a server interface by using @ RequestBody; the server analyzes the retrieval condition parameters in the main label, wherein the analysis comprises processing the region coordinate information into a corresponding space data type through an gis algorithm, and converting the received time range information in a character string form into time information in a date format;
step S104: performing data crawling in a data center of each remote sensing data website in a large range based on the regional information in the main label; using the crawled data as an initial data range; and further screening data in the initial data range based on the time range and cloud cover information in the main label.
3. The method for rapidly searching remote sensing big data according to claim 1, characterized in that: the step S200 includes:
step S201: respectively carrying out data center traceability on the plurality of combined data information fragments meeting the retrieval request to obtain a traceability set { A) corresponding to each combined data information fragment1,A2,…,Ai,…,An}; wherein A isiRepresenting the ith source data center; n represents the total number of source data centers; and (3) carrying out fragment data disassembly on each combined data information fragment based on the corresponding traceable set obtained by tracing to obtain a data fragment set { B }1,B2,…,Bk,…,Bm}; wherein, BkRepresents the kth fragment data; m represents the total number of fragment data; and n is m;
step S202: respectively establishing one-to-one mapping relation between the tracing sets and the data fragment sets of the combined data information fragments meeting the retrieval request to obtain a plurality of mapping data pairs { A }i,Bk}; and i ═ k; traversing the mapping data logarithm in all combined data information fragments meeting the retrieval request; sequencing all combined data information fragments from small to large based on respective mapping data logarithms to obtain an initial combined data information fragment sequence;
step S203: and extracting combined retrieval tag information of each combined data information fragment in the initial combined data information fragment sequence, and performing primary adjustment on the initial combined data information fragment sequence based on secondary tags and auxiliary tags in the combined retrieval tags of each combined data information fragment.
4. The method for rapidly searching remote sensing big data according to claim 1, characterized in that: the step S300 includes a method of calculating liveness for different source data centers within the set of distinct mapping pairs:
step S301: respectively acquiring historical browsing information, historical data downloading information and historical data fragment copying information of each source data center in the differential data pair; capturing browsing rules of each browsing record in the historical browsing information, calculating browsing frequency, and setting a standard browsing frequency fluctuation interval; accumulating the browsing times Q meeting the standard browsing frequency fluctuation interval1Browsing times Q in the fluctuation interval of browsing frequency not meeting the standard2
Step S302: establishing information association between the historical browsing information and the historical data downloading information and the historical data fragment copying information respectively; cumulative Q1Number of times L of downloading historical data under association1,Q2Number of times L of downloading historical data under association2(ii) a Cumulative Q1Associated historical data segment copy times H1,Q2Associated historical data segment copy times H2(ii) a Calculating the activity A of each source data center:
A=a×Q1+b×L1+c×H1
wherein a is Q2/Q1;b=L2/L1;c=H2/H1
5. The method for rapidly searching remote sensing big data according to claim 1, characterized in that: the step S300 includes a method of predicting credibility of different source data centers within the set of distinct mapping pairs:
step S311: for each mapping data pairCalculating the coverage rate of the fragment data information of different source data centers, wherein the coverage rate is P1F/G; wherein F represents the times of occurrence of a certain source data center in a different mapping pair set in each mapping data pair of a certain combined data information fragment; g represents the total number of the difference mapping pairs in the difference mapping pair set;
step S312: capturing page information of different source data centers in each mapping data pair; for the frequency P of the advertisement pages in the page message2Capturing; the frequency P with irregular technical words in the page information3Capturing; the technical term is not standard, and is not in accordance with the technical term special for remote sensing data generated based on a large database or common alternative description terms;
step S313: credibility of different source data centers according to formula
Figure FDA0003474350480000031
And predicting the reliability value.
6. The method for rapidly searching remote sensing big data according to claim 4, characterized in that: the step of capturing the browsing rule of each browsing record in the historical browsing information refers to capturing an effective browsing record in the historical browsing information; the judging of the effective browsing record comprises:
the historical browsing information contains a sliding record of a user on a page, and when the average duration of the sliding page is greater than a preset average duration threshold value, a prepared effective record is recorded;
capturing the interval dwell time of each page in the sliding process in the sliding record capturing process; recording as an effective record when the interval stay time is longer than a preset interval stay time, and acquiring the total page number x and the total word number y of the sliding record when the interval stay time is shorter than the preset interval stay time; when y: when x is smaller than a preset ratio threshold value, recording as an effective record; when y: and when x is smaller than a preset ratio threshold value, deleting the part of the sliding record.
7. The method for rapidly searching remote sensing big data according to claim 1, characterized in that: the step S400 of adjusting the priority ranking of each combined data information fragment again based on the liveness calculation result and the reliability prediction result includes:
step S401: acquiring liveness calculation results and credibility calculation results of the centralized source data centers by the differential mapping; sequencing the values of the source data centers from large to small according to the reliability calculation results; when the reliability values are equal to or smaller than the deviation threshold value of the reliability values in the sorting process, sequentially sorting according to the activity calculation result;
step S402: traversing all combined data information segments meeting the retrieval request, locking and labeling the combined data information segments covering the distinguishing mapping in each combined data information segment to the combined data information segment of the centralized source data center, and when the sequencing interval between two labeled combined data information segments in the initial combined data information segment sequence is smaller than an interval threshold value; obtaining the sorting of the concentrated source data centers belonging to the difference mapping in the labeled combined data information fragment, and calculating an average sorting value; and adjusting the sequencing of the two labeling combined data information fragments again based on the size of the average sequencing value.
8. A fast search system applied to the fast search method of the remote sensing big data of any one of claims 1 to 7, characterized in that: the quick search system includes: the system comprises a retrieval request acquisition and analysis module, a data crawling module, a mapping data pair generation module, a combined data information fragment sequencing primary adjustment module, a distinguishing data pair capturing module, a transit data center generation module and a combined data information fragment sequencing secondary adjustment module;
the retrieval request acquisition and analysis module is used for acquiring a retrieval request input by a user and analyzing the retrieval request to generate a combined retrieval tag corresponding to the retrieval request;
the data crawling module is used for receiving the combined retrieval tag data in the retrieval request acquisition and analysis module, and the rapid search system generates corresponding retrieval instructions based on main tags in the combined retrieval tags to perform data crawling in a data center of each remote sensing data website in a large range to obtain a plurality of combined data information fragments meeting the retrieval request;
the mapping data pair generation module is used for receiving the combined data information fragment data in the data crawling module and decomposing the combined data information fragments meeting the retrieval request into a plurality of mapping data pairs between the source data centers and the fragment data;
the combined data information fragment sequencing module is used for receiving mapping data pair data in the mapping data pair generating module, and sequencing and primarily adjusting the priority of each combined data information fragment based on the mapping data pair;
the transit data center generating module is used for receiving the data in the combined data information fragment sequencing module and storing the data and sequencing information;
the combined data information fragment sequencing initial adjustment module is used for receiving the data in the transit data center generation module and sequencing all combined data information fragments from small to large based on respective mapping data logarithm to obtain an initial combined data information fragment sequence; performing primary adjustment on the initial combined data information fragment sequence based on a secondary label and an auxiliary label in the combined retrieval label;
the distinguishing data pair capturing and analyzing module is used for capturing distinguishing data pairs of all mapping data pairs in the combined data information fragment meeting the retrieval request and gathering the captured distinguishing data pairs into a distinguishing mapping pair set; calculating and predicting the liveness and credibility of different source data centers in the difference mapping pair set;
and the combined data information fragment sequencing readjustment module is used for receiving the calculated data in the distinguishing data pair capture analysis module and readjusting the sequencing of each combined data information fragment based on the calculated data.
9. The system for rapidly searching remote sensing big data according to claim 8, wherein: the mapping data pair generation module comprises: a source tracing unit and a data disassembling unit;
the source tracing unit is used for carrying out data center source tracing on each combined data information fragment meeting the retrieval request and collecting to obtain a source tracing set corresponding to each combined data information fragment;
and the data disassembling unit is used for receiving the data in the tracing unit and disassembling the data of each combined data information fragment based on the tracing set to obtain a data fragment set.
10. The system for rapidly searching remote sensing big data according to claim 8, wherein: the distinguishing data pair capturing and analyzing module comprises: the activity degree calculating unit and the reliability degree predicting and calculating unit;
the activity calculation unit is used for acquiring historical browsing information, historical data downloading information and historical data segment copying information of each source data center in the differential data pair; calculating the activity of the message leaching meeting based on the historical browsing information, the historical data downloading information and the historical data segment replication information;
the reliability prediction calculation unit; the system comprises a data acquisition module, a data acquisition module and a data analysis module, wherein the data acquisition module is used for acquiring data information of different source data centers in each mapping data pair; and carrying out credibility prediction calculation of different source data centers based on the coverage rate of the fragment data information and the captured and analyzed page information.
CN202210051029.7A 2022-01-17 2022-01-17 Quick search system and method for remote sensing big data Active CN114372185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210051029.7A CN114372185B (en) 2022-01-17 2022-01-17 Quick search system and method for remote sensing big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210051029.7A CN114372185B (en) 2022-01-17 2022-01-17 Quick search system and method for remote sensing big data

Publications (2)

Publication Number Publication Date
CN114372185A true CN114372185A (en) 2022-04-19
CN114372185B CN114372185B (en) 2024-03-19

Family

ID=81143225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210051029.7A Active CN114372185B (en) 2022-01-17 2022-01-17 Quick search system and method for remote sensing big data

Country Status (1)

Country Link
CN (1) CN114372185B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050222987A1 (en) * 2004-04-02 2005-10-06 Vadon Eric R Automated detection of associations between search criteria and item categories based on collective analysis of user activity data
CN102081669A (en) * 2011-01-24 2011-06-01 哈尔滨工业大学 Hierarchical retrieval method for multi-source remote sensing resource heterogeneous databases
CN103036956A (en) * 2012-11-30 2013-04-10 航天恒星科技有限公司 Filing system and implement method of distributed configured massive data
US20160379388A1 (en) * 2014-07-16 2016-12-29 Digitalglobe, Inc. System and method for combining geographical and economic data extracted from satellite imagery for use in predictive modeling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050222987A1 (en) * 2004-04-02 2005-10-06 Vadon Eric R Automated detection of associations between search criteria and item categories based on collective analysis of user activity data
CN102081669A (en) * 2011-01-24 2011-06-01 哈尔滨工业大学 Hierarchical retrieval method for multi-source remote sensing resource heterogeneous databases
CN103036956A (en) * 2012-11-30 2013-04-10 航天恒星科技有限公司 Filing system and implement method of distributed configured massive data
US20160379388A1 (en) * 2014-07-16 2016-12-29 Digitalglobe, Inc. System and method for combining geographical and economic data extracted from satellite imagery for use in predictive modeling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
左宪禹;熊明豪;黄祥志;臧文乾;商东东;: "面向遥感瓦片数据的一次全覆盖检索模式和方法", 河南大学学报(自然科学版), no. 03, 16 May 2018 (2018-05-16) *

Also Published As

Publication number Publication date
CN114372185B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN102541999B (en) The picture search of object sensitivity
Gray et al. Country-specific effects of climate variability on human migration
Tang et al. Big data in forecasting research: a literature review
Garcillán et al. Sampling procedures and species estimation: testing the effectiveness of herbarium data against vegetation sampling in an oceanic island
CN105765559A (en) Interactive case management system
US20090198681A1 (en) Real property evaluation and scoring method and system
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
Genave et al. An assessment of energy vulnerability in Small Island Developing States
CN113313170B (en) Full-time global training big data platform based on artificial intelligence
CN104679827A (en) Big data-based public information association method and mining engine
Zhang Application of data mining technology in digital library.
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
CN111143689A (en) Method for constructing recommendation engine according to user requirements and user portrait
Goncalves et al. Gathering alumni information from a web social network
JE et al. The Polar Data Catalogue: best practices for sharing and archiving canada's polar data
CN105512224A (en) Search engine user satisfaction automatic assessment method based on cursor position sequence
Nagdive et al. Web server log analysis for unstructured data using apache flume and pig
CN114372185B (en) Quick search system and method for remote sensing big data
CN115510074B (en) Distributed data management and application system based on table
CN115379308B (en) Internet of things equipment data acquisition system based on satellite remote communication
CN115982429A (en) Knowledge management method and system based on flow control
CN111027771A (en) Scenic spot passenger flow volume estimation method, system and device and storable medium
Bai RETRACTED ARTICLE: Data cleansing method of talent management data in wireless sensor network based on data mining technology
CN111143653B (en) Credibility verification method for mass science popularization resources
CN115239060A (en) Airworthiness approval risk assessment system and method based on big data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant