Full-text information retrieval method oriented to Mongo database
Technical Field
The invention relates to the technical field of computer application, in particular to a full-text information retrieval method oriented to a Mongo database.
Background
The Mongo database is a high-performance, open-source and modeless document type database, and has the following outstanding advantages:
(1) and the data of the object type is easy to store by using set-oriented storage.
(2) The mode is free.
(3) Large file storage is supported using mongogrid fs.
(4) Support replication and failover.
(5) And automatically processing fragments to support the expansibility of the cloud computing hierarchy.
(6) Accessible over a network.
Lucene is a suite of open source libraries for full-text retrieval and search, supported and provided by the Apache software foundation. It provides a simple yet powerful application program interface that enables full-text indexing and searching. Lucene, as a full-text search engine, has the following outstanding advantages:
(1) the excellent object-oriented system architecture reduces the learning difficulty of Lucene expansion and facilitates the expansion of new functions.
(2) A text analysis interface independent of languages and file formats is designed, the indexer completes creation of an index file by receiving Token streams, and a user only needs to realize the interface of text analysis by expanding new languages and file formats.
(3) A set of powerful query engines is realized by default, a user can enable the system to obtain powerful query capability without writing codes by himself, and Boolean operation, fuzzy query, grouping query and the like are realized by default in Lucene query realization.
In summary, the full-text search of the mongo database can be realized by combining the mongo database with the Lucene full-text search engine.
The patent of "optical disc library full text retrieval system based on Lucene" with application number "201510640451.6 discloses a construction index library, a construction analyzer, a construction index creator, and an index process for file data, but the scheme has such differences from construction indexes oriented to Mongo database:
(1) the Mongo database is a high-performance, open-source and modeless document type database, is very suitable for the storage of unstructured data, needs a system for managing the reading of optical disc data relative to the storage of data to an optical disc, and has a very large cost advantage.
(2) The full text retrieval of big data can be realized by using a compact disc library which can only need a great number of compact discs for storage, the compact discs are managed after more compact discs are used, the method is very simple by using a Mongo database, and the problem can be solved by only adding fragments to the Mongo database.
Disclosure of Invention
The invention aims to solve the defect that the full-text retrieval function carried by a Mongo database in the prior art does not support Chinese retrieval, and provides a full-text information retrieval method facing the Mongo database.
The purpose of the invention can be achieved by adopting the following technical scheme:
a full-text information retrieval method oriented to a Mongo database comprises the following steps:
s1, directly establishing a file index on the Mongo database, which is as follows:
s101, constructing a MongoGridFS file index library, and storing file indexes generated by lucene into a MongoGridFS file set;
s102, constructing a Lucene analyzer for performing word segmentation processing on the resource information;
s103, constructing a Lucene index creator for writing the index file generated by the Lucene into a MongoGridFS file set;
s104, converting the resource information stored in the Mongo database into a Lucene identified file, and setting corresponding attribute domain Filed content;
s105, performing word segmentation processing on the resource information, and writing index contents of the resources into a MongoGrdiFS set through an index constructor;
s2, retrieving resource information related to the keywords based on the Mongo database, which is as follows:
s201, constructing a lucene searcher, and opening an index file in a MongoGridFS file set in a read-only mode for searching;
s202, constructing a Lucene analyzer, and converting the keywords to be retrieved into query conditions;
s203, performing word segmentation processing on the information searched by the analyzer to obtain a plurality of keywords;
s204, changing the key words into Lucene-identified query conditions;
s205, searching is carried out through a Lucene searcher, the result information is concentrated, each record only has one attribute domain, and the content is the unique number of the Mongo resource information data;
and S206, acquiring the data details of the resource information in the Mongo database through the unique resource information serial number in the result set.
Further, the step S101 is specifically as follows:
s1011, building a Mongo database cluster service;
s1012, storing the acquired resource information in a Mongo database;
s1013, expanding a Lucene index storage interface, and establishing a Lucene index file library by using a MongoGridFS file set;
s1014, reading resource information set data stored in a Mongoo database;
s1015, write the resource information set data into the MongoGridFS file set.
Further, in step S104, except for the unique number of the resource information data of the Mongo database, no other attribute fields store the original information.
Compared with the prior art, the invention has the following advantages and effects:
(1) the full-text information retrieval method oriented to the Mongo database solves the problem of big data storage through the good transverse capacity expansion capability of the Mongo database.
(2) According to the full-text information retrieval method oriented to the Mongo database, the index file is directly established in the MongoGridFS file set, and misoperation of the index file by a user is avoided.
(3) The full-text information retrieval method facing the Mongo database adopts a Lucene search engine to create indexes for the resource information in the Mongo database, and solves the defect that the Mongo database can not retrieve the Chinese full-text information.
Drawings
FIG. 1 is a flowchart of a method of the present invention for directly building a file index on a Mongo database based on Lucene;
FIG. 2 is a flow chart of a method for retrieving keyword-related resource information based on a Mongo database.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
The embodiment discloses a full-text information retrieval method facing a Mongo database, which comprises the following steps:
s1, directly establishing a file index on the Mongo database;
s2, searching the resource information related to the keywords based on the Mongo database.
As shown in FIG. 1, the method for realizing Chinese retrieval by directly establishing a file index on a Mongo database comprises the following steps:
and S101, constructing a MongoGridFS file index library, and storing all file indexes generated by lucene into a MongoGridFS file set.
Wherein, step S101 is further specifically:
and S1011, building a Mongo database cluster service.
The step builds the cluster service of the Mongo database, and solves the problem of distributed storage of big data.
And step S1012, storing the acquired resource information in a Mongo database.
The step extracts local or network resources to a Mongo database as a target of full text retrieval.
And S1013, expanding a Lucene index storage interface, and establishing a Lucene index file library by using the MongoGridFS file set.
And step S1014, reading the resource information set data stored in the Mongoo database.
Step S1015, writes the resource information collection data into the mongogrid fs file collection.
The steps S1011 to S1015 complete the process of collecting the scattered file data to the Mongo database, and establish the Mongo gridfs file index library for the collected data by expanding the Lucene index storage interface.
And S102, constructing a Lucene analyzer for performing word segmentation processing on the resource information.
And step S103, constructing a Lucene index creator for writing the index file generated by the Lucene into the MongoGridFS file set.
And step S104, converting the resource information stored in the Mongo database into a Lucene identified file, and setting corresponding attribute field Filed content, wherein the original information is not stored in other attribute fields except the unique number of the resource information data of the Mongo database.
Step S105, performing word segmentation processing on the resource information, and writing the index content of the resource into the MongoGrdiFS set through the index constructor.
As shown in FIG. 2, the method for searching resource information related to keywords based on the Mongo database comprises the following steps:
step S201, constructing a lucene searcher, and opening an index file in the MongoGridFS file set in a read-only mode for searching.
And S202, constructing a Lucene analyzer and converting the keywords to be retrieved into query conditions.
In step S203, the search information is participled using the analyzer to obtain a plurality of keywords.
And step S204, changing the key words into the query conditions identified by Lucene.
And S205, searching through a Lucene searcher, and concentrating the result information, wherein each record only has one attribute domain, and the content is the unique number of the Mongo resource information data.
And S206, acquiring the details of the resource information data in the Mongo database through the unique resource information number in the result set.
In summary, the invention collects the resource information, firstly stores the collected information in the Mongo database to become an unstructured data list, then establishes index files for the unstructured data through a Lucene search engine, and stores the index files in the GridFS set of the Mongo database. The keyword retrieval only returns the document number in the Mongo database, and the document information is obtained through the number. The full-text information retrieval method facing the Mongo database solves the problem of big data storage through the good transverse capacity expansion capability of the Mongo database; by directly establishing the index file in the MongoGridFS file set, misoperation of the index file by a user is avoided; the Lucene search engine is adopted to create indexes for the resource information in the Mongo database, and the defect that the Mongo database cannot search the Chinese full-text information is overcome.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.