CN115080602A - Method for realizing accurate search of data assets based on NLP algorithm - Google Patents
Method for realizing accurate search of data assets based on NLP algorithm Download PDFInfo
- Publication number
- CN115080602A CN115080602A CN202210275470.3A CN202210275470A CN115080602A CN 115080602 A CN115080602 A CN 115080602A CN 202210275470 A CN202210275470 A CN 202210275470A CN 115080602 A CN115080602 A CN 115080602A
- Authority
- CN
- China
- Prior art keywords
- data
- asset
- library
- index
- keywords
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2452—Query translation
- G06F16/24522—Translation of natural language queries to structured queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/243—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24539—Query rewriting; Transformation using cached or materialised query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a method for realizing accurate search of data assets based on an NLP algorithm, which is used for retrieving natural language of the data assets, and performing automatic maintenance, model generation, intelligent retrieval and the like of relationships by taking data as one asset. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And the construction can be effectively carried out for some TopN data which are inquired to be hot. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a method for realizing accurate search of data assets based on an NLP algorithm.
Background
In the existing data asset searching method, the type of asset retrieval is fixed to the maintenance and retrieval of physical assets, such as data retrieval of building materials, the query link is long, and interfaces and calling relations need to be manually maintained. For some metadata asset tags and cross-asset consanguineous relationships, the prior art is not able to efficiently retrieve. In addition, there is also a problem of low hit and recall.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a method for realizing accurate search of data assets based on an NLP algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for realizing accurate search of data assets based on NLP algorithm comprises the following specific processes:
firstly, generation of asset metadata:
secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
third, data asset retrieval
The NLP2SQL service receives contents to be retrieved from an input entry, and then generates an SQL statement of a query after mapping a code table; the NLP2SQL service analyzes the natural language through lexical analysis to generate an SQL statement executable by a machine;
searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the keywords, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags, and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;
after being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
Further, the specific process of the step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.
The invention has the beneficial effects that: the invention searches the natural language of the data assets, and takes the data as one asset to carry out automatic maintenance, model generation, intelligent search and the like of the relationship. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And also can be effectively established for some inquiry hot TopN data. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.
Drawings
FIG. 1 is a flow chart illustrating the construction of an index according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of data retrieval according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a method for realizing accurate data asset search based on an NLP algorithm, as shown in fig. 1, the specific process is as follows:
firstly, generation of asset metadata:
the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; collecting the industry terms and storing the industry terms in a code table/term library; for example, the blood relationship library, the asset metadata library and the code table/term library have new data, the new data are firstly cached into the redis, the redis writes the cached data into es at fixed time intervals, and finally an asset directory is formed.
Secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; and after extraction is finished, removing duplication of all keywords to generate a dictionary, and writing the dictionary into es through synchronization/construction service to serve as an index of data.
Third, data asset retrieval
As shown in fig. 2, the NLP2SQL service receives content (natural language) to be retrieved from an input entry and then generates an SQL statement of a query by mapping code tables. SQL here encompasses both structured and unstructured query languages (structured primarily queries from a mapped code table repository (mysql.) unstructured primarily queries from the es index repository). The NLP2SQL service parses the natural language through lexical analysis, generating a machine executable SQL statement.
Searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the position of the database, table and key field where the property data is located (for example, oil is in the field form _ Point 100 th row), the mapping relationship between the field and other databases and tables, the mapping relationship between the field and the label, and the mapping relationship between the field and the index; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword (index) does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned.
After being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.
Claims (2)
1. A method for realizing accurate search of data assets based on NLP algorithm is characterized by comprising the following specific processes:
firstly, generation of asset metadata:
secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
third, data asset retrieval
The NLP2SQL service receives contents to be retrieved from an input entry, and then generates an SQL statement of a query after mapping a code table; the NLP2SQL service analyzes the natural language through lexical analysis to generate an SQL statement executable by a machine;
searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the libraries and tables where the asset data are located and the fields where the keywords are located, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;
after being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
2. The method according to claim 1, wherein the specific process of step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275470.3A CN115080602B (en) | 2022-03-21 | 2022-03-21 | Method for realizing accurate search of data assets based on NLP algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275470.3A CN115080602B (en) | 2022-03-21 | 2022-03-21 | Method for realizing accurate search of data assets based on NLP algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115080602A true CN115080602A (en) | 2022-09-20 |
CN115080602B CN115080602B (en) | 2023-05-26 |
Family
ID=83247512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210275470.3A Active CN115080602B (en) | 2022-03-21 | 2022-03-21 | Method for realizing accurate search of data assets based on NLP algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115080602B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6094649A (en) * | 1997-12-22 | 2000-07-25 | Partnet, Inc. | Keyword searches of structured databases |
AU2008259833A1 (en) * | 2007-06-01 | 2008-12-11 | Getty Images, Inc. | Method and system for searching for digital assets |
CN101789006A (en) * | 2010-01-29 | 2010-07-28 | 华东电网有限公司 | Intelligent search based quick searching method of power grid enterprise information integrating system |
CN103646032A (en) * | 2013-11-11 | 2014-03-19 | 漆桂林 | Database query method based on body and restricted natural language processing |
CN109739893A (en) * | 2018-12-28 | 2019-05-10 | 上海连尚网络科技有限公司 | A kind of metadata management method, equipment and computer-readable medium |
-
2022
- 2022-03-21 CN CN202210275470.3A patent/CN115080602B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6094649A (en) * | 1997-12-22 | 2000-07-25 | Partnet, Inc. | Keyword searches of structured databases |
AU2008259833A1 (en) * | 2007-06-01 | 2008-12-11 | Getty Images, Inc. | Method and system for searching for digital assets |
CN101789006A (en) * | 2010-01-29 | 2010-07-28 | 华东电网有限公司 | Intelligent search based quick searching method of power grid enterprise information integrating system |
CN103646032A (en) * | 2013-11-11 | 2014-03-19 | 漆桂林 | Database query method based on body and restricted natural language processing |
CN109739893A (en) * | 2018-12-28 | 2019-05-10 | 上海连尚网络科技有限公司 | A kind of metadata management method, equipment and computer-readable medium |
Also Published As
Publication number | Publication date |
---|---|
CN115080602B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102407510B1 (en) | Method, apparatus, device and medium for storing and querying data | |
US6931408B2 (en) | Method of storing, maintaining and distributing computer intelligible electronic data | |
CN107273506A (en) | A kind of method of database multi-list conjunctive query | |
CN104750681A (en) | Method and device for processing mass data | |
CN109947796B (en) | Caching method for query intermediate result set of distributed database system | |
JPH10505690A (en) | X. 500 System and Method | |
CN105608232A (en) | Bug knowledge modeling method based on graphic database | |
CN104331446A (en) | Memory map-based mass data preprocessing method | |
CN101196900A (en) | Information searching method based on metadata | |
CN109857898A (en) | A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval | |
CN102339315A (en) | Index updating method and system of advertisement data | |
CN112231321B (en) | Oracle secondary index and index real-time synchronization method | |
CN106611053A (en) | Data cleaning and indexing method | |
CN106708814B (en) | Retrieval method and device based on relational database | |
CN106407360A (en) | Data processing method and device | |
CN109446358A (en) | A kind of chart database accelerator and method based on ID caching technology | |
CN113051382A (en) | Intelligent power failure question-answering method and device based on knowledge graph | |
CN109241259A (en) | Natural language querying method, apparatus and system based on ER model | |
CN104391908A (en) | Locality sensitive hashing based indexing method for multiple keywords on graphs | |
US7809674B2 (en) | Supporting B+tree indexes on primary B+tree structures with large primary keys | |
CN102201007A (en) | Large-scale data retrieving system | |
US6826563B1 (en) | Supporting bitmap indexes on primary B+tree like structures | |
CN115080602A (en) | Method for realizing accurate search of data assets based on NLP algorithm | |
Song et al. | Materialization and decomposition of dataspaces for efficient search | |
CN114218277A (en) | Efficient query method and device for relational database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |