CN115080602A - Method for realizing accurate search of data assets based on NLP algorithm - Google Patents

Method for realizing accurate search of data assets based on NLP algorithm Download PDF

Info

Publication number
CN115080602A
CN115080602A CN202210275470.3A CN202210275470A CN115080602A CN 115080602 A CN115080602 A CN 115080602A CN 202210275470 A CN202210275470 A CN 202210275470A CN 115080602 A CN115080602 A CN 115080602A
Authority
CN
China
Prior art keywords
data
asset
library
index
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210275470.3A
Other languages
Chinese (zh)
Other versions
CN115080602B (en
Inventor
于洋
高经郡
谢晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kejie Technology Co ltd
Original Assignee
Beijing Kejie Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kejie Technology Co ltd filed Critical Beijing Kejie Technology Co ltd
Priority to CN202210275470.3A priority Critical patent/CN115080602B/en
Publication of CN115080602A publication Critical patent/CN115080602A/en
Application granted granted Critical
Publication of CN115080602B publication Critical patent/CN115080602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method for realizing accurate search of data assets based on an NLP algorithm, which is used for retrieving natural language of the data assets, and performing automatic maintenance, model generation, intelligent retrieval and the like of relationships by taking data as one asset. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And the construction can be effectively carried out for some TopN data which are inquired to be hot. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.

Description

Method for realizing accurate search of data assets based on NLP algorithm
Technical Field
The invention relates to the technical field of data processing, in particular to a method for realizing accurate search of data assets based on an NLP algorithm.
Background
In the existing data asset searching method, the type of asset retrieval is fixed to the maintenance and retrieval of physical assets, such as data retrieval of building materials, the query link is long, and interfaces and calling relations need to be manually maintained. For some metadata asset tags and cross-asset consanguineous relationships, the prior art is not able to efficiently retrieve. In addition, there is also a problem of low hit and recall.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a method for realizing accurate search of data assets based on an NLP algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for realizing accurate search of data assets based on NLP algorithm comprises the following specific processes:
firstly, generation of asset metadata:
secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
third, data asset retrieval
The NLP2SQL service receives contents to be retrieved from an input entry, and then generates an SQL statement of a query after mapping a code table; the NLP2SQL service analyzes the natural language through lexical analysis to generate an SQL statement executable by a machine;
searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the keywords, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags, and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;
after being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
Further, the specific process of the step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.
The invention has the beneficial effects that: the invention searches the natural language of the data assets, and takes the data as one asset to carry out automatic maintenance, model generation, intelligent search and the like of the relationship. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And also can be effectively established for some inquiry hot TopN data. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.
Drawings
FIG. 1 is a flow chart illustrating the construction of an index according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of data retrieval according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a method for realizing accurate data asset search based on an NLP algorithm, as shown in fig. 1, the specific process is as follows:
firstly, generation of asset metadata:
the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; collecting the industry terms and storing the industry terms in a code table/term library; for example, the blood relationship library, the asset metadata library and the code table/term library have new data, the new data are firstly cached into the redis, the redis writes the cached data into es at fixed time intervals, and finally an asset directory is formed.
Secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; and after extraction is finished, removing duplication of all keywords to generate a dictionary, and writing the dictionary into es through synchronization/construction service to serve as an index of data.
Third, data asset retrieval
As shown in fig. 2, the NLP2SQL service receives content (natural language) to be retrieved from an input entry and then generates an SQL statement of a query by mapping code tables. SQL here encompasses both structured and unstructured query languages (structured primarily queries from a mapped code table repository (mysql.) unstructured primarily queries from the es index repository). The NLP2SQL service parses the natural language through lexical analysis, generating a machine executable SQL statement.
Searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the position of the database, table and key field where the property data is located (for example, oil is in the field form _ Point 100 th row), the mapping relationship between the field and other databases and tables, the mapping relationship between the field and the label, and the mapping relationship between the field and the index; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword (index) does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned.
After being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims (2)

1. A method for realizing accurate search of data assets based on NLP algorithm is characterized by comprising the following specific processes:
firstly, generation of asset metadata:
secondly, index construction:
(1) and (3) constructing a metadata index:
the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;
(2) and (3) index construction of data:
1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;
third, data asset retrieval
The NLP2SQL service receives contents to be retrieved from an input entry, and then generates an SQL statement of a query after mapping a code table; the NLP2SQL service analyzes the natural language through lexical analysis to generate an SQL statement executable by a machine;
searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the libraries and tables where the asset data are located and the fields where the keywords are located, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;
after being sorted, the returned data result is returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.
2. The method according to claim 1, wherein the specific process of step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.
CN202210275470.3A 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm Active CN115080602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210275470.3A CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210275470.3A CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Publications (2)

Publication Number Publication Date
CN115080602A true CN115080602A (en) 2022-09-20
CN115080602B CN115080602B (en) 2023-05-26

Family

ID=83247512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210275470.3A Active CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Country Status (1)

Country Link
CN (1) CN115080602B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
AU2008259833A1 (en) * 2007-06-01 2008-12-11 Getty Images, Inc. Method and system for searching for digital assets
CN101789006A (en) * 2010-01-29 2010-07-28 华东电网有限公司 Intelligent search based quick searching method of power grid enterprise information integrating system
CN103646032A (en) * 2013-11-11 2014-03-19 漆桂林 Database query method based on body and restricted natural language processing
CN109739893A (en) * 2018-12-28 2019-05-10 上海连尚网络科技有限公司 A kind of metadata management method, equipment and computer-readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
AU2008259833A1 (en) * 2007-06-01 2008-12-11 Getty Images, Inc. Method and system for searching for digital assets
CN101789006A (en) * 2010-01-29 2010-07-28 华东电网有限公司 Intelligent search based quick searching method of power grid enterprise information integrating system
CN103646032A (en) * 2013-11-11 2014-03-19 漆桂林 Database query method based on body and restricted natural language processing
CN109739893A (en) * 2018-12-28 2019-05-10 上海连尚网络科技有限公司 A kind of metadata management method, equipment and computer-readable medium

Also Published As

Publication number Publication date
CN115080602B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
KR102407510B1 (en) Method, apparatus, device and medium for storing and querying data
US6931408B2 (en) Method of storing, maintaining and distributing computer intelligible electronic data
CN107273506A (en) A kind of method of database multi-list conjunctive query
CN104750681A (en) Method and device for processing mass data
CN109947796B (en) Caching method for query intermediate result set of distributed database system
JPH10505690A (en) X. 500 System and Method
CN105608232A (en) Bug knowledge modeling method based on graphic database
CN104331446A (en) Memory map-based mass data preprocessing method
CN101196900A (en) Information searching method based on metadata
CN109857898A (en) A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval
CN102339315A (en) Index updating method and system of advertisement data
CN112231321B (en) Oracle secondary index and index real-time synchronization method
CN106611053A (en) Data cleaning and indexing method
CN106708814B (en) Retrieval method and device based on relational database
CN106407360A (en) Data processing method and device
CN109446358A (en) A kind of chart database accelerator and method based on ID caching technology
CN113051382A (en) Intelligent power failure question-answering method and device based on knowledge graph
CN109241259A (en) Natural language querying method, apparatus and system based on ER model
CN104391908A (en) Locality sensitive hashing based indexing method for multiple keywords on graphs
US7809674B2 (en) Supporting B+tree indexes on primary B+tree structures with large primary keys
CN102201007A (en) Large-scale data retrieving system
US6826563B1 (en) Supporting bitmap indexes on primary B+tree like structures
CN115080602A (en) Method for realizing accurate search of data assets based on NLP algorithm
Song et al. Materialization and decomposition of dataspaces for efficient search
CN114218277A (en) Efficient query method and device for relational database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant