CN115080602B - Method for realizing accurate search of data assets based on NLP algorithm - Google Patents

Method for realizing accurate search of data assets based on NLP algorithm Download PDF

Info

Publication number
CN115080602B
CN115080602B CN202210275470.3A CN202210275470A CN115080602B CN 115080602 B CN115080602 B CN 115080602B CN 202210275470 A CN202210275470 A CN 202210275470A CN 115080602 B CN115080602 B CN 115080602B
Authority
CN
China
Prior art keywords
data
asset
library
index
pulling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210275470.3A
Other languages
Chinese (zh)
Other versions
CN115080602A (en
Inventor
于洋
高经郡
谢晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kejie Technology Co ltd
Original Assignee
Beijing Kejie Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kejie Technology Co ltd filed Critical Beijing Kejie Technology Co ltd
Priority to CN202210275470.3A priority Critical patent/CN115080602B/en
Publication of CN115080602A publication Critical patent/CN115080602A/en
Application granted granted Critical
Publication of CN115080602B publication Critical patent/CN115080602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for realizing accurate search of data assets based on an NLP algorithm, which is used for searching natural language of the data assets, taking the data as an asset, and carrying out automatic maintenance, model generation, intelligent search and the like on the relationship. The method has high-efficiency retrieval and hit rate aiming at the large data generated asset catalogue and the field blood relationship, label, index and the like in the asset catalogue. While also efficiently building up the TopN data for some queries that are hotter. And the results of similar keyword query are responded quickly by using the references cache, so that the length of a query link and the pressure on a relational database are greatly shortened.

Description

Method for realizing accurate search of data assets based on NLP algorithm
Technical Field
The invention relates to the technical field of data processing, in particular to a method for realizing accurate search of data assets based on an NLP algorithm.
Background
In the existing data asset searching method, the types of asset searching are solidified to the maintenance and searching of physical assets, such as the data searching of building materials, the inquiry link is long, and the manual maintenance interface and the calling relation are needed. For some metadata asset tags and blood-related relationships across assets, the prior art does not provide efficient retrieval. In addition, there is a problem that hit rate and recall rate are low.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a method for realizing accurate search of data assets based on an NLP algorithm.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a method for realizing accurate search of data assets based on NLP algorithm comprises the following specific processes:
1. generation of asset metadata:
2. construction of the index:
(1) Construction of metadata indexes:
the acquisition module performs pulling on asset metadata according to timed acquisition tasks, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are performed on each field content, and keywords are reserved to serve as indexes of the metadata;
(2) And (3) constructing an index of data:
1) For existing tables, the synchronization/build service performs incremental pulling of data; taking a code table/term library and the keyword with the largest access times as a keyword library, and extracting keywords from the fields; after the extraction is finished, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data;
2) For the newly added list, then the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access times as a keyword library, and extracts keywords from the fields; after the extraction is finished, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data;
3. data asset retrieval
The NLP2SQL service receives the content to be retrieved from the input entrance and then generates an SQL statement of the query after passing through the mapping code table; the NLP2SQL service analyzes natural language through lexical analysis to generate SQL sentences executable by a machine;
searching a redis cache firstly by NLP service searching, if historical data to be searched exists in the cache, caching hit, and returning a hit data result obtained by NLP service; if all hit, directly ending the search; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the key words are located, the mapping blood relationship between the fields and other libraries and tables, the mapping relationship between the fields and the labels, and the mapping relationship between the fields and the indexes; if the results of the references cache do not hit all, the data results are required to be obtained from the es library according to the index; if the index is not hit, the key word does not exist in the es library, and if the index is hit after the retrieval is completed, a data result is returned;
the returned data result is arranged and returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center table when inquiring details of the asset data, and inquires hive by using presto according to a library, a table and a row position of the asset data; the result of the inquiry forms topN hot spot data which are cached in redis and are synchronized to the es library every set time period.
Further, the specific process of the first step is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached in the redis, and the redis writes the cached data into the es library every fixed time period, so that an asset catalog is finally formed.
The invention has the beneficial effects that: the invention searches the natural language of the data asset, takes the data as an asset, and performs automatic maintenance, model generation, intelligent search and the like of the relation. The method has high-efficiency retrieval and hit rate aiming at the large data generated asset catalogue and the field blood relationship, label, index and the like in the asset catalogue. While also efficiently building up the TopN data for some queries that are hotter. And the results of similar keyword query are responded quickly by using the references cache, so that the length of a query link and the pressure on a relational database are greatly shortened.
Drawings
FIG. 1 is a schematic flow chart of index construction in an embodiment of the invention;
fig. 2 is a schematic flow chart of data retrieval in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a method for realizing accurate search of data assets based on an NLP algorithm, as shown in fig. 1, comprising the following specific processes:
1. generation of asset metadata:
the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; collecting industry terms to store in a code table/term library; if there is new data in the blood relation library, the asset metadata library and the code table/term library, the new data is cached in the redis, and the redis writes the cached data into the es every fixed time period, and finally the asset catalog is formed.
2. Construction of the index:
(1) Construction of metadata indexes:
the acquisition module performs pulling on asset metadata according to timed acquisition tasks, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are performed on each field content, and keywords are reserved to serve as indexes of the metadata;
(2) And (3) constructing an index of data:
1) For existing tables, the synchronization/build service performs incremental pulling of data; taking a code table/term library and the keyword with the largest access times as a keyword library, and extracting keywords from the fields; after the extraction is finished, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data;
2) For the newly added list, then the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access times as a keyword library, and extracts keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data.
3. Data asset retrieval
As shown in fig. 2, the NLP2SQL service receives content (natural language) to be retrieved from an input portal and then generates SQL statements of a query after passing through a mapping code table. The SQL herein contains structured and unstructured query languages (structured primarily queries from mapped code table libraries (mysql). Unstructured primarily queries from the es index library). The NLP2SQL service analyzes natural language through lexical analysis to generate machine executable SQL statements.
Searching a redis cache firstly by NLP service searching, if historical data to be searched exists in the cache, caching hit, and returning a hit data result obtained by NLP service; if all hit, directly ending the search; the returned data results include: the location of the fields where the asset data is located (e.g., petroleum in field form_post line 100), the mapping blood relationship of the fields with other libraries, tables, the mapping relationship of the fields with labels, the mapping relationship of the fields with indexes; if the results of the references cache do not hit all, the data results are required to be obtained from the es library according to the index; if the index is not hit, the key word (index) does not exist in the es library, and if the index is hit after the retrieval is completed, a data result is returned.
The returned data result is arranged and returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center table when inquiring details of the asset data, and inquires hive by using presto according to a library, a table and a row position of the asset data; the result of the inquiry forms topN hot spot data which are cached in redis and are synchronized to the es library every set time period.
Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims (1)

1. A method for realizing accurate search of data assets based on NLP algorithm is characterized in that the specific process is as follows:
1. generation of asset metadata:
the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached in a redis, and the redis writes the cached data into the es library at fixed time intervals to finally form an asset catalog;
2. construction of the index:
(1) Construction of metadata indexes:
the acquisition module performs pulling on asset metadata according to timed acquisition tasks, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are performed on each field content, and keywords are reserved to serve as indexes of the metadata;
(2) And (3) constructing an index of data:
1) For existing tables, the synchronization/build service performs incremental pulling of data; taking a code table/term library and the keyword with the largest access times as a keyword library, and extracting keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into an es library through a synchronization/construction service to be used as an index of data;
2) For the newly added list, the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access frequency as a keyword library, and extracts the keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into an es library through a synchronization/construction service to be used as an index of data;
3. data asset retrieval
The NLP2SQL service receives the content to be retrieved from the input entrance, and generates an SQL statement of the query after passing the mapping code table; the NLP2SQL service analyzes natural language through lexical analysis to generate SQL sentences executable by a machine;
searching a redis cache firstly by NLP service searching, if historical data to be searched exists in the cache, caching hit, and returning a hit data result obtained by NLP service; if all hit, directly ending the search; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the key words are located, the mapping blood relationship between the fields and other libraries and tables, the mapping relationship between the fields and the labels, and the mapping relationship between the fields and the indexes; if the results of the references cache do not hit all, the data results are required to be obtained from the es library according to the index; if the index is not hit, the key word does not exist in the es library, and if the index is hit after the retrieval is completed, a data result is returned;
the returned data result is arranged and returned to the front end of NLP2SQL for list display;
when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center table when inquiring details of the asset data, and inquires hive by using presto according to a library, a table and a row position of the asset data; the result of the inquiry forms topN hot spot data which are cached in redis and are synchronized to the es library every set time period.
CN202210275470.3A 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm Active CN115080602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210275470.3A CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210275470.3A CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Publications (2)

Publication Number Publication Date
CN115080602A CN115080602A (en) 2022-09-20
CN115080602B true CN115080602B (en) 2023-05-26

Family

ID=83247512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210275470.3A Active CN115080602B (en) 2022-03-21 2022-03-21 Method for realizing accurate search of data assets based on NLP algorithm

Country Status (1)

Country Link
CN (1) CN115080602B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
AU2008259833A1 (en) * 2007-06-01 2008-12-11 Getty Images, Inc. Method and system for searching for digital assets

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101789006A (en) * 2010-01-29 2010-07-28 华东电网有限公司 Intelligent search based quick searching method of power grid enterprise information integrating system
CN103646032B (en) * 2013-11-11 2017-01-04 漆桂林 A kind of based on body with the data base query method of limited natural language processing
CN109739893B (en) * 2018-12-28 2022-04-22 上海尚往网络科技有限公司 Metadata management method, equipment and computer readable medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
AU2008259833A1 (en) * 2007-06-01 2008-12-11 Getty Images, Inc. Method and system for searching for digital assets

Also Published As

Publication number Publication date
CN115080602A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
US6970882B2 (en) Unified relational database model for data mining selected model scoring results, model training results where selection is based on metadata included in mining model control table
US6931408B2 (en) Method of storing, maintaining and distributing computer intelligible electronic data
US8219563B2 (en) Indexing mechanism for efficient node-aware full-text search over XML
CN104361127A (en) Multilanguage question and answer interface fast constituting method based on domain ontology and template logics
CN102184222B (en) Quick searching method in large data volume storage
CN107169033A (en) Relation data enquiring and optimizing method with parallel framework is changed based on data pattern
CN107291807A (en) A kind of SPARQL enquiring and optimizing methods based on figure traversal
CN109947796B (en) Caching method for query intermediate result set of distributed database system
CN112000773B (en) Search engine technology-based data association relation mining method and application
CN114116716A (en) Hierarchical data retrieval method, device and equipment
CN112231321B (en) Oracle secondary index and index real-time synchronization method
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN113051382A (en) Intelligent power failure question-answering method and device based on knowledge graph
CN109241259A (en) Natural language querying method, apparatus and system based on ER model
US20070239656A1 (en) Removal of Database Query Function Calls
CN104391908A (en) Locality sensitive hashing based indexing method for multiple keywords on graphs
CN111324631B (en) Method for automatically generating sql statement by human natural language of query data
CN102508901A (en) Content-based massive image search method and content-based massive image search system
CN113934750A (en) Data blood relationship analysis method based on compiling mode
CN115840589A (en) Publishing method supporting heterogeneous distributed database
KR20100066919A (en) Triple indexing and searching scheme for efficient information retrieval
CN109446293B (en) Parallel high-dimensional neighbor query method
CN115080602B (en) Method for realizing accurate search of data assets based on NLP algorithm
CN106021306A (en) Ontology matching based case search system
CN103365960A (en) Off-line searching method of structured data of electric power multistage dispatching management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant