CN107818126B - Full-text information retrieval method oriented to Mongo database - Google Patents

Full-text information retrieval method oriented to Mongo database Download PDF

Info

Publication number
CN107818126B
CN107818126B CN201710777316.5A CN201710777316A CN107818126B CN 107818126 B CN107818126 B CN 107818126B CN 201710777316 A CN201710777316 A CN 201710777316A CN 107818126 B CN107818126 B CN 107818126B
Authority
CN
China
Prior art keywords
lucene
mongo database
file
index
resource information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710777316.5A
Other languages
Chinese (zh)
Other versions
CN107818126A (en
Inventor
颜克旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huiruisitong Technology Co Ltd
Original Assignee
Guangzhou Huiruisitong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huiruisitong Information Technology Co Ltd filed Critical Guangzhou Huiruisitong Information Technology Co Ltd
Priority to CN201710777316.5A priority Critical patent/CN107818126B/en
Publication of CN107818126A publication Critical patent/CN107818126A/en
Application granted granted Critical
Publication of CN107818126B publication Critical patent/CN107818126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Abstract

The invention discloses a full text information retrieval method facing a Mongo database, which comprises the steps of collecting resource information, firstly storing the collected information into the Mongo database to form an unstructured data list, then establishing index files for the unstructured data through a Lucene search engine, and storing the index files into a GridFS set of the Mongo database. The keyword retrieval only returns the document number in the Mongo database, and the document information is obtained through the number. The full-text information retrieval method facing the Mongo database solves the problem of big data storage through the good transverse capacity expansion capability of the Mongo database; by directly establishing the index file in the MongoGridFS file set, misoperation of the index file by a user is avoided; the Lucene search engine is adopted to create indexes for the resource information in the Mongo database, and the defect that the Mongo database cannot search the Chinese full-text information is overcome.

Description

Full-text information retrieval method oriented to Mongo database
Technical Field
The invention relates to the technical field of computer application, in particular to a full-text information retrieval method oriented to a Mongo database.
Background
The Mongo database is a high-performance, open-source and modeless document type database, and has the following outstanding advantages:
(1) and the data of the object type is easy to store by using set-oriented storage.
(2) The mode is free.
(3) Large file storage is supported using mongogrid fs.
(4) Support replication and failover.
(5) And automatically processing fragments to support the expansibility of the cloud computing hierarchy.
(6) Accessible over a network.
Lucene is a suite of open source libraries for full-text retrieval and search, supported and provided by the Apache software foundation. It provides a simple yet powerful application program interface that enables full-text indexing and searching. Lucene, as a full-text search engine, has the following outstanding advantages:
(1) the excellent object-oriented system architecture reduces the learning difficulty of Lucene expansion and facilitates the expansion of new functions.
(2) A text analysis interface independent of languages and file formats is designed, the indexer completes creation of an index file by receiving Token streams, and a user only needs to realize the interface of text analysis by expanding new languages and file formats.
(3) A set of powerful query engines is realized by default, a user can enable the system to obtain powerful query capability without writing codes by himself, and Boolean operation, fuzzy query, grouping query and the like are realized by default in Lucene query realization.
In summary, the full-text search of the mongo database can be realized by combining the mongo database with the Lucene full-text search engine.
The patent of "optical disc library full text retrieval system based on Lucene" with application number "201510640451.6 discloses a construction index library, a construction analyzer, a construction index creator, and an index process for file data, but the scheme has such differences from construction indexes oriented to Mongo database:
(1) the Mongo database is a high-performance, open-source and modeless document type database, is very suitable for the storage of unstructured data, needs a system for managing the reading of optical disc data relative to the storage of data to an optical disc, and has a very large cost advantage.
(2) The full text retrieval of big data can be realized by using a compact disc library which can only need a great number of compact discs for storage, the compact discs are managed after more compact discs are used, the method is very simple by using a Mongo database, and the problem can be solved by only adding fragments to the Mongo database.
Disclosure of Invention
The invention aims to solve the defect that the full-text retrieval function carried by a Mongo database in the prior art does not support Chinese retrieval, and provides a full-text information retrieval method facing the Mongo database.
The purpose of the invention can be achieved by adopting the following technical scheme:
a full-text information retrieval method oriented to a Mongo database comprises the following steps:
s1, directly establishing a file index on the Mongo database, which is as follows:
s101, constructing a MongoGridFS file index library, and storing file indexes generated by lucene into a MongoGridFS file set;
s102, constructing a Lucene analyzer for performing word segmentation processing on the resource information;
s103, constructing a Lucene index creator for writing the index file generated by the Lucene into a MongoGridFS file set;
s104, converting the resource information stored in the Mongo database into a Lucene identified file, and setting corresponding attribute domain Filed content;
s105, performing word segmentation processing on the resource information, and writing index contents of the resources into a MongoGrdiFS set through an index constructor;
s2, retrieving resource information related to the keywords based on the Mongo database, which is as follows:
s201, constructing a lucene searcher, and opening an index file in a MongoGridFS file set in a read-only mode for searching;
s202, constructing a Lucene analyzer, and converting the keywords to be retrieved into query conditions;
s203, performing word segmentation processing on the information searched by the analyzer to obtain a plurality of keywords;
s204, changing the key words into Lucene-identified query conditions;
s205, searching is carried out through a Lucene searcher, the result information is concentrated, each record only has one attribute domain, and the content is the unique number of the Mongo resource information data;
and S206, acquiring the data details of the resource information in the Mongo database through the unique resource information serial number in the result set.
Further, the step S101 is specifically as follows:
s1011, building a Mongo database cluster service;
s1012, storing the acquired resource information in a Mongo database;
s1013, expanding a Lucene index storage interface, and establishing a Lucene index file library by using a MongoGridFS file set;
s1014, reading resource information set data stored in a Mongoo database;
s1015, write the resource information set data into the MongoGridFS file set.
Further, in step S104, except for the unique number of the resource information data of the Mongo database, no other attribute fields store the original information.
Compared with the prior art, the invention has the following advantages and effects:
(1) the full-text information retrieval method oriented to the Mongo database solves the problem of big data storage through the good transverse capacity expansion capability of the Mongo database.
(2) According to the full-text information retrieval method oriented to the Mongo database, the index file is directly established in the MongoGridFS file set, and misoperation of the index file by a user is avoided.
(3) The full-text information retrieval method facing the Mongo database adopts a Lucene search engine to create indexes for the resource information in the Mongo database, and solves the defect that the Mongo database can not retrieve the Chinese full-text information.
Drawings
FIG. 1 is a flowchart of a method of the present invention for directly building a file index on a Mongo database based on Lucene;
FIG. 2 is a flow chart of a method for retrieving keyword-related resource information based on a Mongo database.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
The embodiment discloses a full-text information retrieval method facing a Mongo database, which comprises the following steps:
s1, directly establishing a file index on the Mongo database;
s2, searching the resource information related to the keywords based on the Mongo database.
As shown in FIG. 1, the method for realizing Chinese retrieval by directly establishing a file index on a Mongo database comprises the following steps:
and S101, constructing a MongoGridFS file index library, and storing all file indexes generated by lucene into a MongoGridFS file set.
Wherein, step S101 is further specifically:
and S1011, building a Mongo database cluster service.
The step builds the cluster service of the Mongo database, and solves the problem of distributed storage of big data.
And step S1012, storing the acquired resource information in a Mongo database.
The step extracts local or network resources to a Mongo database as a target of full text retrieval.
And S1013, expanding a Lucene index storage interface, and establishing a Lucene index file library by using the MongoGridFS file set.
And step S1014, reading the resource information set data stored in the Mongoo database.
Step S1015, writes the resource information collection data into the mongogrid fs file collection.
The steps S1011 to S1015 complete the process of collecting the scattered file data to the Mongo database, and establish the Mongo gridfs file index library for the collected data by expanding the Lucene index storage interface.
And S102, constructing a Lucene analyzer for performing word segmentation processing on the resource information.
And step S103, constructing a Lucene index creator for writing the index file generated by the Lucene into the MongoGridFS file set.
And step S104, converting the resource information stored in the Mongo database into a Lucene identified file, and setting corresponding attribute field Filed content, wherein the original information is not stored in other attribute fields except the unique number of the resource information data of the Mongo database.
Step S105, performing word segmentation processing on the resource information, and writing the index content of the resource into the MongoGrdiFS set through the index constructor.
As shown in FIG. 2, the method for searching resource information related to keywords based on the Mongo database comprises the following steps:
step S201, constructing a lucene searcher, and opening an index file in the MongoGridFS file set in a read-only mode for searching.
And S202, constructing a Lucene analyzer and converting the keywords to be retrieved into query conditions.
In step S203, the search information is participled using the analyzer to obtain a plurality of keywords.
And step S204, changing the key words into the query conditions identified by Lucene.
And S205, searching through a Lucene searcher, and concentrating the result information, wherein each record only has one attribute domain, and the content is the unique number of the Mongo resource information data.
And S206, acquiring the details of the resource information data in the Mongo database through the unique resource information number in the result set.
In summary, the invention collects the resource information, firstly stores the collected information in the Mongo database to become an unstructured data list, then establishes index files for the unstructured data through a Lucene search engine, and stores the index files in the GridFS set of the Mongo database. The keyword retrieval only returns the document number in the Mongo database, and the document information is obtained through the number. The full-text information retrieval method facing the Mongo database solves the problem of big data storage through the good transverse capacity expansion capability of the Mongo database; by directly establishing the index file in the MongoGridFS file set, misoperation of the index file by a user is avoided; the Lucene search engine is adopted to create indexes for the resource information in the Mongo database, and the defect that the Mongo database cannot search the Chinese full-text information is overcome.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (3)

1. A full-text information retrieval method oriented to a Mongo database is characterized by comprising the following steps:
s1, directly establishing a file index on the Mongo database, which is as follows:
s101, constructing a MongoGridFS file index library, and storing index files generated by lucene into a MongoGridFS file set;
s102, constructing a Lucene analyzer for performing word segmentation processing on the resource information;
s103, constructing a Lucene index creator for writing the index file generated by the Lucene into a MongoGridFS file set;
s104, converting the resource information stored in the Mongo database into a Lucene identified file, and setting corresponding attribute domain Filed content;
s105, performing word segmentation processing on the resource information, and writing index contents of the resources into a MongoGrdiFS set through an index constructor;
s2, retrieving resource information related to the keywords based on the Mongo database, which is as follows:
s201, constructing a lucene searcher, and opening an index file in a MongoGridFS file set in a read-only mode for searching;
s202, constructing a Lucene analyzer, and converting the keywords to be retrieved into query conditions;
s203, performing word segmentation processing on the information searched by the analyzer to obtain a plurality of keywords;
s204, changing the key words into Lucene-identified query conditions;
s205, searching is carried out through a Lucene searcher, the result information is concentrated, each record only has one attribute domain, and the content is the unique number of the Mongo resource information data;
and S206, acquiring the data details of the resource information in the Mongo database through the unique resource information serial number in the result set.
2. The method for retrieving full-text information oriented to the Mongo database according to claim 1, wherein the step S101 is as follows:
s1011, building a Mongo database cluster service;
s1012, storing the acquired resource information in a Mongo database;
s1013, expanding a Lucene index storage interface, and establishing a Lucene index file library by using a MongoGridFS file set;
s1014, reading resource information set data stored in a Mongoo database;
s1015, write the resource information set data into the MongoGridFS file set.
3. The method for full-text information retrieval oriented to the Mongo database as claimed in claim 1, wherein in the step S104, except for the unique number of the resource information data of the Mongo database, no original information is stored in other attribute domains.
CN201710777316.5A 2017-09-01 2017-09-01 Full-text information retrieval method oriented to Mongo database Active CN107818126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710777316.5A CN107818126B (en) 2017-09-01 2017-09-01 Full-text information retrieval method oriented to Mongo database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710777316.5A CN107818126B (en) 2017-09-01 2017-09-01 Full-text information retrieval method oriented to Mongo database

Publications (2)

Publication Number Publication Date
CN107818126A CN107818126A (en) 2018-03-20
CN107818126B true CN107818126B (en) 2020-06-05

Family

ID=61601592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710777316.5A Active CN107818126B (en) 2017-09-01 2017-09-01 Full-text information retrieval method oriented to Mongo database

Country Status (1)

Country Link
CN (1) CN107818126B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9672497B1 (en) * 2013-11-04 2017-06-06 Snap-On Incorporated Methods and systems for using natural language processing and machine-learning to produce vehicle-service content
CN106227788A (en) * 2016-07-20 2016-12-14 浪潮软件集团有限公司 Database query method based on Lucene
CN106682148A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Method and device based on Solr data search

Also Published As

Publication number Publication date
CN107818126A (en) 2018-03-20

Similar Documents

Publication Publication Date Title
Seltzer et al. Hierarchical File Systems Are Dead.
US9507807B1 (en) Meta file system for big data
US20180314721A1 (en) Incremental out-of-place updates for index structures
US8527490B2 (en) Structuring and searching data in a hierarchical confidence-based configuration
KR20130049111A (en) Forensic index method and apparatus by distributed processing
CN104239377A (en) Platform-crossing data retrieval method and device
CN102810114A (en) Personal computer resource management system based on body
CN110570928A (en) HBase and ozone based medical image file access method
Mostajabi et al. A Systematic Review of Data Models for the Big Data Problem
Franciscus et al. Precomputing architecture for flexible and efficient big data analytics
US9275059B1 (en) Genome big data indexing
CN107818126B (en) Full-text information retrieval method oriented to Mongo database
Pitoura Historical graphs: models, storage, processing
Pokorný et al. Graph pattern index for Neo4j graph databases
Ragavan et al. A Novel Big Data Storage Reduction Model for Drill Down Search.
Kanojia et al. IT Infrastructure for Smart City: Issues and Challenges in Migration from Relational to NoSQL Databases
Katz et al. Using authority data in VuFind
CN106776772B (en) Data retrieval method and device
CN117349401B (en) Metadata storage method, device, medium and equipment for unstructured data
Pan et al. Research on Mass Image Data Storage Method for Data Center
CN115098755A (en) Scientific and technological information service platform construction method and scientific and technological information service platform
Pelekis et al. The case of big mobility data
Team Data Migration from Relational to NoSQL Database: Review and Comparative Study
Guo et al. Research on application framework of electronic document business based on big data technology
Chen et al. Research of distributed index based on lucene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 510000 no.2-8, North Street, Nancun Town, Panyu District, Guangzhou City, Guangdong Province

Patentee after: Guangzhou huiruisitong Technology Co.,Ltd.

Address before: 605, No.8, 2nd Street, Ping'an 2nd Road, Xianzhuang, lirendong village, Nancun Town, Panyu District, Guangzhou City, Guangdong Province 511442

Patentee before: GUANGZHOU HUIRUI SITONG INFORMATION TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address
PP01 Preservation of patent right

Effective date of registration: 20230207

Granted publication date: 20200605

PP01 Preservation of patent right
PD01 Discharge of preservation of patent

Date of cancellation: 20240402

Granted publication date: 20200605