CN112988863A - Elasticissearch-based efficient search engine method for heterogeneous multiple data sources - Google Patents
Elasticissearch-based efficient search engine method for heterogeneous multiple data sources Download PDFInfo
- Publication number
- CN112988863A CN112988863A CN202110176379.1A CN202110176379A CN112988863A CN 112988863 A CN112988863 A CN 112988863A CN 202110176379 A CN202110176379 A CN 202110176379A CN 112988863 A CN112988863 A CN 112988863A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- search
- database
- elasticissearch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/252—Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
Abstract
The invention discloses an efficient search engine method for heterogeneous multiple data sources based on an elastic search. The efficient full text search method for heterogeneous and multi-data source is realized for information in a space-time object database (a relational database and a non-relational database), a distributed file system and an information resource management system. The search engine is composed of an index database, an index document structure, a data acquisition unit and a search interface. The method comprises the following steps: firstly, a data acquisition device acquires content data to be searched in a system and organizes and constructs the content data into corresponding index documents; then storing the constructed index document into an index database; and finally, searching and inquiring through a search interface. The invention can converge the data of a plurality of data sources with different types at enterprise level to a system, and provides support for data fusion analysis.
Description
Technical Field
The invention relates to the technical field of information search, in particular to an efficient search engine method for heterogeneous multiple data sources based on an elastic search.
Background
At present, commercial search engines such as a hundred degree search engine collect information from the internet by using a specific computer program according to a certain policy, provide retrieval services for users after organizing and processing the information, and display relevant information retrieved by the users to the users. The system consists of four parts, namely a searcher, an indexer, a retriever and a user interface. The function of the searcher is to roam the internet, discover and gather information. The function of the indexer is to understand the information searched by the searcher, extract index terms therefrom, to represent documents and to generate an index table of the document repository. The function of the retriever is to quickly detect documents in the index base according to the query of a user, evaluate the relevance of the documents and the query, sort the results to be output and realize a certain user relevance feedback mechanism. The user interface functions to input user queries, display query results, and provide a user relevance feedback mechanism.
For enterprise level users, the owned information resources include multiple types. For example, a geographic information system enterprise manages a multi-granularity space-time object library, the managed object includes multiple aspects of space-time reference, spatial position, spatial form, incidence relation, composition structure, behavior, cognitive ability, attribute characteristics and the like of the multi-granularity space-time object, and the information is stored in various databases such as PostgreSQL, MySQL, MongoDB and the like, and a distributed file system HDFS, and also includes information resources displayed by an enterprise website and the like. Obviously, the spatiotemporal object database does not have the capability of rapidly querying the required information in all these resources, and this problem needs to be solved urgently.
Disclosure of Invention
In order to realize efficient search of multiple data sources on the basis of a space-time object database system, a distributed file system and an information resource management system and solve the problem of search capability of a database, the invention provides an efficient search engine method of heterogeneous multiple data sources based on an elastic search, which has the following specific technical scheme:
an efficient search engine method for heterogeneous multiple data sources based on an elastic search comprises an index database, index documents, a data collector and a search interface, wherein the data collector collects content data to be searched in a system and organizes the content data into corresponding index documents, the constructed index documents are stored in the index database, and finally, a user searches and queries through the search interface;
the index database Elasticissearch is matched with a relational database or a non-relational database for use, and a large amount of data in the Hadoop database is processed by utilizing the real-time searching and analyzing functions of the Elasticissearch and using an Elasticissearch-Hadoop (ES-Hadoop) connector;
the index document structure adopts an index document type JSON supported by an Elasticissearch, a space-time object can be created into an index document and a JSON data document format through the Elasticissearch, the document content type supported by a search engine comprises a multi-granularity space-time object and resources contained in an integrated development framework resource service, a document index is constructed by creating JSON objects for the two data contents, each object is used as a JSON document, and the index is established;
the data acquisition unit acquires content data to be searched by the system, organizes the content data into corresponding index documents, and actively and regularly captures multi-granularity space-time objects and integrates data in development framework resource service for document storage through a timing task.
The search interface is used for receiving a search request initiated by a user through a user terminal, acquiring a corresponding search result from the index database according to the search request to return the search result to the user terminal, and the user performs search, query and other operations through an engine interface.
The method comprises the following specific steps:
the method comprises the steps that firstly, a data acquisition unit acquires content data needing to be searched in a system, and actively and regularly captures data in a multi-granularity time-space object database, a distributed file system and a resource management information system through a timing task, wherein the timing task records capturing time of each time through an internal time stamp and judges whether the currently captured data is analyzed or not. If the data updating time in each data source is earlier than the last capturing time, no processing is performed; if the data updating time in each data source is later than the last capturing time, capturing related content, and entering a second step, wherein the capturing method comprises the steps of establishing different micro services aiming at different data sources, establishing connection with the data sources, and capturing data in the data sources by respectively adopting access interfaces matched with the data sources;
secondly, analyzing and processing the captured data, organizing and constructing the content data into corresponding index documents, generating index documents under corresponding indexes according to indexes corresponding to data sources from the data captured from different data sources, and generating one index document by one database record;
and thirdly, storing the constructed index document into an Elasticissearch cluster of an index database. In order to satisfy the requirement of high concurrent access, the number of clusters is more than one, and the Elasticisearch establishes indexes for all the fields, and writes a reverse index after processing. When searching data, directly searching the index;
and finally, the user carries out search query through a search interface, converts the query condition into an Elasticissearch query request and issues the Elasticissearch query request to the Elasticissearch for query. The search interface supports searching according to keywords, supports a plurality of search keywords and the logical operation of the plurality of keywords, outputs the results matched by the search according to a certain sequence, and can select the sequencing rule and provide a search correlation feedback mechanism.
Further, the index database and the index documents adopt an Elasticissearch search engine.
Further, the Elasticsearch implements a complex search query based on the data content, and only needs to add/update data to the Elasticsearch.
Further, the index database also comprises Spark, data is read from the elastic search through ES-Hadoop by the Spark, and the database is used as a persistent storage component and can provide constraint conditions, accuracy guarantee and robustness conditions.
Further, the timing task records each capturing time through an internal time stamp, and judges whether index storage is performed on currently captured data.
By adopting the technical scheme, the method has the following beneficial effects:
the Elasticsearch is an open-source full-text search engine that can quickly store, search, and analyze large amounts of data. The Elasticissearch allows multiple servers to work together, each server can run multiple Elasticissearch instances, and a group of instances form a cluster. The invention carries out rapid full-text search on enterprise-level information resources based on the elastic search, and the method comprises various types of databases (relational databases and non-relational databases), a distributed file system, an information resource management system and the like.
Drawings
FIG. 1 is a diagram of a search engine architecture of the present invention;
FIG. 2 is a diagram of a data collector of the present invention;
FIG. 3 is a diagram of a data acquisition timing task processor according to the present invention.
Detailed Description
Example 1: as shown in fig. 1-3, the method for efficiently searching an engine based on an Elasticsearch heterogeneous multiple data sources includes an index database, index documents, a data collector and a search interface, where the data collector collects content data to be searched in a system and organizes the content data into corresponding index documents, and then stores the constructed index documents into the index database, and finally, a user performs search query through the search interface;
the index database Elasticissearch is matched with a relational database or a non-relational database for use, and a large amount of data in the Hadoop database is processed by utilizing the real-time searching and analyzing functions of the Elasticissearch and using an Elasticissearch-Hadoop (ES-Hadoop) connector;
the index document structure adopts an index document type JSON supported by an Elasticissearch, a space-time object can be created into an index document and a JSON data document format through the Elasticissearch, the document content type supported by a search engine comprises a multi-granularity space-time object and resources contained in an integrated development framework resource service, a document index is constructed by creating JSON objects for the two data contents, each object is used as a JSON document, and the index is established;
the data acquisition unit acquires content data to be searched by the system, organizes the content data into corresponding index documents, and actively and regularly captures multi-granularity space-time objects and integrates data in development framework resource service for document storage through a timing task. The search interface is used for receiving a search request initiated by a user through a user terminal, acquiring a corresponding search result from the index database according to the search request and returning the search result to the user terminal, and the user performs search, query and other operations through the engine interface.
The method comprises the following specific steps:
the method comprises the steps that firstly, a data acquisition unit acquires content data needing to be searched in a system, and actively and regularly captures data in a multi-granularity time-space object database, a distributed file system and a resource management information system through a timing task, wherein the timing task records capturing time of each time through an internal time stamp and judges whether the currently captured data is analyzed or not. If the data updating time in each data source is earlier than the last capturing time, no processing is performed; if the data updating time in each data source is later than the last capturing time, capturing related content, and entering a second step, wherein the capturing method comprises the steps of establishing different micro services aiming at different data sources, establishing connection with the data sources, and capturing data in the data sources by respectively adopting access interfaces matched with the data sources;
secondly, analyzing and processing the captured data, organizing and constructing the content data into corresponding index documents, generating index documents under corresponding indexes according to indexes corresponding to data sources from the data captured from different data sources, and generating one index document by one database record;
and thirdly, storing the constructed index document into an Elasticissearch cluster of an index database. In order to satisfy the requirement of high concurrent access, the number of clusters is more than one, and the Elasticisearch establishes indexes for all the fields, and writes a reverse index after processing. When searching data, directly searching the index;
and finally, the user carries out search query through a search interface, converts the query condition into an Elasticissearch query request and issues the Elasticissearch query request to the Elasticissearch for query. The search interface supports searching according to keywords, supports a plurality of search keywords and the logical operation of the plurality of keywords, outputs the results matched by the search according to a certain sequence, and can select the sequencing rule and provide a search correlation feedback mechanism.
The index database and the index documents adopt an Elasticissearch search engine.
The Elasticsearch implements complex search queries based on the data content, only requiring the addition/updating of data to the Elasticsearch.
The index database also comprises Spark, data is read from the elastic search by ES-Hadoop, and the database is used as a persistent storage component and can provide constraint limitation, accuracy guarantee and robustness conditions.
And the timing task records the capturing time of each time through an internal time stamp and judges whether the currently captured data is subjected to index storage.
By adopting the technical scheme, the method has the following beneficial effects:
the Elasticsearch is an open-source full-text search engine that can quickly store, search, and analyze large amounts of data. The Elasticissearch allows multiple servers to work together, each server can run multiple Elasticissearch instances, and a group of instances form a cluster. The invention carries out rapid full-text search on enterprise-level information resources based on the elastic search, and the method comprises various types of databases (relational databases and non-relational databases), a distributed file system, an information resource management system and the like.
Having thus described the basic principles and principal features of the invention, it will be appreciated by those skilled in the art that the invention is not limited by the embodiments described above, which are given by way of illustration only, but that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.
Claims (5)
1. An efficient search engine method for heterogeneous multiple data sources based on an elastic search is characterized by comprising an index database, index documents, a data collector and a search interface, wherein the data collector collects content data to be searched in a system and organizes the content data into corresponding index documents, the constructed index documents are stored in the index database, and finally, a user searches and queries through the search interface;
the index database Elasticissearch is matched with a relational database or a non-relational database for use, and a large amount of data in the Hadoop database is processed by utilizing the real-time searching and analyzing functions of the Elasticissearch and using an Elasticissearch-Hadoop (ES-Hadoop) connector;
the index document structure adopts an index document type JSON supported by an Elasticissearch, a space-time object can be created into an index document and a JSON data document format through the Elasticissearch, the document content type supported by a search engine comprises a multi-granularity space-time object and resources contained in an integrated development framework resource service, a document index is constructed by creating JSON objects for the two data contents, each object is used as a JSON document, and the index is established;
the data acquisition unit acquires content data to be searched by the system, organizes and constructs the content data into corresponding index documents, and actively and regularly captures multi-granularity space-time objects and data in an integrated development framework resource service for document storage through a timing task;
the search interface is used for receiving a search request initiated by a user through a user terminal, acquiring a corresponding search result from the index database according to the search request to return the search result to the user terminal, and the user performs search, query and other operations through an engine interface;
the method comprises the following specific steps:
the method comprises the steps that firstly, a data acquisition unit acquires content data needing to be searched in a system, and actively and regularly captures data in a multi-granularity time-space object database, a distributed file system and a resource management information system through a timing task, wherein the timing task records capturing time of each time through an internal time stamp and judges whether the currently captured data is analyzed or not. If the data updating time in each data source is earlier than the last capturing time, no processing is performed; if the data updating time in each data source is later than the last capturing time, capturing related content, and entering a second step, wherein the capturing method comprises the steps of establishing different micro services aiming at different data sources, establishing connection with the data sources, and capturing data in the data sources by respectively adopting access interfaces matched with the data sources;
secondly, analyzing and processing the captured data, organizing and constructing the content data into corresponding index documents, generating index documents under corresponding indexes according to indexes corresponding to data sources from the data captured from different data sources, and generating one index document by one database record;
and thirdly, storing the constructed index document into an Elasticissearch cluster of an index database. In order to satisfy the requirement of high concurrent access, the number of clusters is more than one, and the Elasticisearch establishes indexes for all the fields, and writes a reverse index after processing. When searching data, directly searching the index;
and finally, the user carries out search query through a search interface, converts the query condition into an Elasticissearch query request and issues the Elasticissearch query request to the Elasticissearch for query. The search interface supports searching according to keywords, supports a plurality of search keywords and the logical operation of the plurality of keywords, outputs the results matched by the search according to a certain sequence, and can select the sequencing rule and provide a search correlation feedback mechanism.
2. The method of claim 1, wherein the indexing database and the indexing document adopt an elastic search engine.
3. The method for the efficient search engine based on the isomeric multiple data sources of the Elasticsearch of claim 1, wherein the Elasticsearch implements a complex search query based on the data content, and only the data needs to be added/updated to the Elasticsearch.
4. The efficient search engine method based on the isomeric multiple data sources of the Elasticsearch of claim 1, wherein the index database further comprises Spark, the Spark reads data from the Elasticsearch through ES-Hadoop, and the database is used as a persistent storage component, which can provide constraint, accuracy guarantee and robustness conditions.
5. The method for the efficient search engine based on the isomeric multiple data sources of the Elasticsearch of claim 1, wherein the timing task records each time of fetching through an internal time stamp, and judges whether the currently fetched data is subjected to index storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110176379.1A CN112988863A (en) | 2021-02-09 | 2021-02-09 | Elasticissearch-based efficient search engine method for heterogeneous multiple data sources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110176379.1A CN112988863A (en) | 2021-02-09 | 2021-02-09 | Elasticissearch-based efficient search engine method for heterogeneous multiple data sources |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112988863A true CN112988863A (en) | 2021-06-18 |
Family
ID=76392457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110176379.1A Pending CN112988863A (en) | 2021-02-09 | 2021-02-09 | Elasticissearch-based efficient search engine method for heterogeneous multiple data sources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112988863A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486096A (en) * | 2021-06-21 | 2021-10-08 | 上海百秋电子商务有限公司 | Multi-library timing execution report data preprocessing and query method and system |
CN113722426A (en) * | 2021-07-30 | 2021-11-30 | 福建拓尔通软件有限公司 | Government website searching method, system, equipment and medium |
CN113742519A (en) * | 2021-08-31 | 2021-12-03 | 杭州登虹科技有限公司 | Multi-object storage cloud video Timeline storage method and system |
CN113886505A (en) * | 2021-09-28 | 2022-01-04 | 西安阳易信息技术有限公司 | Management system for realizing dynamic modeling based on search engine and relational database |
CN114443728A (en) * | 2022-01-04 | 2022-05-06 | 广州粤建三和软件股份有限公司 | Detection report searching method and device based on elastic search |
CN114969255A (en) * | 2022-05-26 | 2022-08-30 | 山东浪潮科学研究院有限公司 | ERP document searching method and system |
CN117235309A (en) * | 2023-09-14 | 2023-12-15 | 哈尔滨哈工智慧嘉利通科技股份有限公司 | Urban management similar case recommendation method based on acquisition and elastic search technology |
CN117290384A (en) * | 2023-11-27 | 2023-12-26 | 同方赛威讯信息技术有限公司 | Graphic and text retrieval system and method based on combination of big data and computer vision |
CN113886505B (en) * | 2021-09-28 | 2024-04-30 | 西安阳易信息技术有限公司 | Management system for realizing dynamic modeling based on search engine and relational database |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105183884A (en) * | 2015-09-24 | 2015-12-23 | 西安未来国际信息股份有限公司 | Search engine system and method based on big data technique |
CN110543517A (en) * | 2019-08-26 | 2019-12-06 | 汉纳森(厦门)数据股份有限公司 | Method, device and medium for realizing complex query of mass data based on elastic search |
CN111382226A (en) * | 2018-12-29 | 2020-07-07 | 北京神州泰岳软件股份有限公司 | Database query retrieval method and device and electronic equipment |
-
2021
- 2021-02-09 CN CN202110176379.1A patent/CN112988863A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105183884A (en) * | 2015-09-24 | 2015-12-23 | 西安未来国际信息股份有限公司 | Search engine system and method based on big data technique |
CN111382226A (en) * | 2018-12-29 | 2020-07-07 | 北京神州泰岳软件股份有限公司 | Database query retrieval method and device and electronic equipment |
CN110543517A (en) * | 2019-08-26 | 2019-12-06 | 汉纳森(厦门)数据股份有限公司 | Method, device and medium for realizing complex query of mass data based on elastic search |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486096A (en) * | 2021-06-21 | 2021-10-08 | 上海百秋电子商务有限公司 | Multi-library timing execution report data preprocessing and query method and system |
CN113722426A (en) * | 2021-07-30 | 2021-11-30 | 福建拓尔通软件有限公司 | Government website searching method, system, equipment and medium |
CN113742519A (en) * | 2021-08-31 | 2021-12-03 | 杭州登虹科技有限公司 | Multi-object storage cloud video Timeline storage method and system |
CN113886505A (en) * | 2021-09-28 | 2022-01-04 | 西安阳易信息技术有限公司 | Management system for realizing dynamic modeling based on search engine and relational database |
CN113886505B (en) * | 2021-09-28 | 2024-04-30 | 西安阳易信息技术有限公司 | Management system for realizing dynamic modeling based on search engine and relational database |
CN114443728A (en) * | 2022-01-04 | 2022-05-06 | 广州粤建三和软件股份有限公司 | Detection report searching method and device based on elastic search |
CN114969255A (en) * | 2022-05-26 | 2022-08-30 | 山东浪潮科学研究院有限公司 | ERP document searching method and system |
CN117235309A (en) * | 2023-09-14 | 2023-12-15 | 哈尔滨哈工智慧嘉利通科技股份有限公司 | Urban management similar case recommendation method based on acquisition and elastic search technology |
CN117290384A (en) * | 2023-11-27 | 2023-12-26 | 同方赛威讯信息技术有限公司 | Graphic and text retrieval system and method based on combination of big data and computer vision |
CN117290384B (en) * | 2023-11-27 | 2024-02-02 | 同方赛威讯信息技术有限公司 | Graphic and text retrieval system and method based on combination of big data and computer vision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112988863A (en) | Elasticissearch-based efficient search engine method for heterogeneous multiple data sources | |
US11288231B2 (en) | Reproducing datasets generated by alert-triggering search queries | |
CN105122243B (en) | Expansible analysis platform for semi-structured data | |
US20200372007A1 (en) | Trace and span sampling and analysis for instrumented software | |
US10713269B2 (en) | Determining a presentation format for search results based on a presentation recommendation machine learning model | |
CN107451149B (en) | Monitoring method and device for flow data query task | |
EP0981097A1 (en) | Search system and method for providing a fulltext search over web pages of world wide web servers | |
CN107861981B (en) | Data processing method and device | |
CN112269816B (en) | Government affair appointment correlation retrieval method | |
US11494395B2 (en) | Creating dashboards for viewing data in a data storage system based on natural language requests | |
US20190034499A1 (en) | Navigating hierarchical components based on an expansion recommendation machine learning model | |
US9846740B2 (en) | Associative search systems and methods | |
US20080147631A1 (en) | Method and system for collecting and retrieving information from web sites | |
US20190034430A1 (en) | Disambiguating a natural language request based on a disambiguation recommendation machine learning model | |
CN105760418B (en) | Method and system for performing cross-column search on relational database table | |
Hassanzadeh et al. | Helix: Online enterprise data analytics | |
WO2019048879A1 (en) | System for detecting data relationships based on sample data | |
CN107004036B (en) | Method and system for searching logs containing a large number of entries | |
CN114969036A (en) | Data retrieval method and device | |
CN113722296A (en) | Agricultural information processing method and device, electronic equipment and storage medium | |
KR101223813B1 (en) | Apparatus and Method for information search by inquiry | |
CN105159899A (en) | Searching method and searching device | |
US20190034555A1 (en) | Translating a natural language request to a domain specific language request based on multiple interpretation algorithms | |
CN112597207B (en) | Metadata management system | |
US11409738B2 (en) | Method and system for query federation based on natural language processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210618 |
|
RJ01 | Rejection of invention patent application after publication |