CN117453971A - Vectorized data retrieval management method and device - Google Patents

Vectorized data retrieval management method and device Download PDF

Info

Publication number
CN117453971A
CN117453971A CN202311441225.6A CN202311441225A CN117453971A CN 117453971 A CN117453971 A CN 117453971A CN 202311441225 A CN202311441225 A CN 202311441225A CN 117453971 A CN117453971 A CN 117453971A
Authority
CN
China
Prior art keywords
data
vector
unstructured
queried
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311441225.6A
Other languages
Chinese (zh)
Inventor
邓永翠
颜晓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Lingdong Shuzhi Technology Co ltd
Original Assignee
Nanjing Lingdong Shuzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Lingdong Shuzhi Technology Co ltd filed Critical Nanjing Lingdong Shuzhi Technology Co ltd
Priority to CN202311441225.6A priority Critical patent/CN117453971A/en
Publication of CN117453971A publication Critical patent/CN117453971A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a vectorization data retrieval management method and a vectorization data retrieval management device, wherein when a data platform receives data to be queried, the data to be queried is classified, and a classification label corresponding to the data to be queried is determined; performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors; selecting a target data area with the classification label from a vector database at the current moment; and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area. Therefore, structured and unstructured data query is realized by arranging a vector database in the data platform, meanwhile, the data transmission risk brought by externally connecting the database is reduced, and the data retrieval safety and accuracy are improved.

Description

Vectorized data retrieval management method and device
Technical Field
The invention relates to the technical field of artificial intelligence data management, in particular to a vectorization data retrieval management method and device.
Background
With the rapid development of internet and data technology, the data volume has been explosively increased, and how to realize effective data management on a traditional data platform has become one of the technical problems to be solved in the development of enterprise management. Among the multiple functions of data management, data retrieval is determined to be a function which is frequently used by enterprise personnel in daily life, and the function not only needs to meet the inquiry of structured data, but also needs to meet the inquiry of unstructured data.
Conventional data platforms typically do not support vectorized retrieval of unstructured data such as text data, picture data, or other types. Aiming at the defects of the data query of the traditional data platform, a vectorization data query management technology which supports unstructured data query and can provide data query and processing with higher accuracy appears in recent years, and the compatibility, the accuracy and the operation efficiency of different types of data query are greatly improved.
However, the vectorized data query still has some problems in practical applications, such as incompatibility of the vectorized data query with the data query of the traditional data platform. At present, most enterprises are directly externally connected with an independent vector database or server to be compatible with vectorized query on the premise of not modifying a data platform, structured and unstructured data query is difficult to realize in the data platform in the mode, and the mode of externally connecting with the server or database easily causes increased risk of data leakage, so that the security of vectorized data retrieval management is reduced.
Disclosure of Invention
The invention provides a vectorization data retrieval management method and device, which solve the technical problems that at present, most enterprises directly connect an independent vector database or server in an external mode to be compatible with vectorization query on the premise of not modifying a data platform, structured and unstructured data query is difficult to realize in the data platform at the same time, the risk of data leakage is increased easily in the mode of connecting the server or the database in an external mode, and the security of vectorization data retrieval management is reduced.
The invention provides a vectorization data retrieval management method, which is applied to a data platform, wherein the data platform comprises a preset vector database, and the method comprises the following steps:
when data to be queried is received, classifying the data to be queried, and determining a classification label corresponding to the data to be queried;
performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors;
selecting a target data area with the classification label from a vector database at the current moment;
and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
Optionally, when receiving the data to be queried, classifying the data to be queried, and determining a classification label corresponding to the data to be queried, including:
when data to be queried is received, analyzing the data to be queried, and judging whether query conditions exist or not;
if the initial data area exists, selecting the initial data area from the vector database according to the query condition, and determining the initial data area as the vector database at the current moment;
invoking a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried;
And if the data to be queried does not exist, calling a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried.
Optionally, the target data area stores a plurality of vectorized data; the step of retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area comprises the following steps:
judging whether the number of the vectorized data is larger than or equal to a preset number threshold value;
if not, calculating a first vector similarity between each vectorized data and the unstructured vector;
if yes, adopting a general vector index algorithm to screen at least one intermediate vector from a plurality of vectorized data according to the unstructured vector;
calculating a second vector similarity between each of the intermediate vectors and the unstructured vector;
selecting target vector data according to the first vector similarity or the second vector similarity;
and returning the target vector data to a transmitting end to which the data to be queried belongs and displaying the target vector data.
Optionally, the step of selecting the target vector data according to the first vector similarity or the second vector similarity includes:
Selecting a plurality of vectorized data from large to small according to the first vector similarity or the second vector similarity, and determining the vectorized data as target vector data;
or selecting a plurality of vectorized data with the first vector similarity or the second vector similarity in a preset similarity range to determine the vectorized data as target vector data.
Optionally, the sending end to which the data to be queried belongs is a functional service application of the data platform, and each functional service application is provided with a corresponding preset similarity range, and the method further includes:
returning the target vector data to the function service application;
and updating an existing model by the function service application by using the target vector data.
Optionally, the method further comprises:
obtaining unlabeled data in the data platform;
calling a large language model service to label the unlabeled data according to a preset hierarchical classification label system, and generating labeled data;
training a hierarchical classification interface service and a vectorization interface service according to the marked data and the unmarked data;
invoking the vectorization interface service to vectorize and convert unstructured data in the data platform, and invoking the hierarchical classification interface service to classify and identify the unstructured data in the data platform, so as to create a corresponding business data table;
And loading all the service data tables, associating the vectorization interface service and the hierarchical classification interface service, and constructing a vector database.
Optionally, the step of training a hierarchical classification interface service and a vectorization interface service according to the labeling data and the unlabeled data includes:
performing parameter fine adjustment on a plurality of preset first classification models by adopting the marking data to obtain a plurality of intermediate classification models;
the intermediate classification model with screening accuracy greater than a preset classification threshold is determined as a target classification model and deployed into a hierarchical classification interface service;
performing parameter fine adjustment and screening on a plurality of preset second classification models by adopting the unlabeled data to obtain semantic extraction models;
and deploying the semantic extraction model as a vectorization interface service.
Optionally, the step of calling the vectorization interface service to vectorize and convert unstructured data in the data platform and calling the hierarchical classification interface service to classify and identify unstructured data in the data platform and create a corresponding service data table includes:
creating an initial data table on the data platform; the initial data table includes a plurality of job category fields;
Invoking the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the data platform;
converting the unstructured data into vectorized data through the vectorized interface service;
classifying and identifying the unstructured data, correlating the unstructured data with the vectorized data, and determining a classification label field corresponding to the unstructured data;
and respectively storing the vectorization data and the classification label field into corresponding function class fields to generate a service data table.
Optionally, the method further comprises:
responding to input login information, and matching data authority corresponding to the login information; the data authority comprises a multi-level labeling authority and a multi-level user authority;
when a user operation instruction is received and accords with the data right, executing management operation corresponding to the user operation instruction on the vector database;
when an operation completion instruction is received, checking the hierarchical classification interface service and the vectorization interface service to generate a checking result;
when receiving an evaluation failing instruction input in response to the execution result, skipping to execute the step of calling the large language model service to label the unlabeled data and generating labeled data;
And when an evaluation passing instruction responding to the execution result input is received, maintaining the hierarchical classification interface service and the vectorization interface service at the current moment.
Optionally, the method further comprises:
when a service update data table is received, judging whether the service update data table comprises the job category field or not;
if not, the job category field is newly added on the service update data table, and the hierarchical classification interface service and the vectorization interface service are called to respectively extract unstructured data in the service update data table;
if yes, calling the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the business update data table;
skipping performs the step of converting the unstructured data into vectorized data through the vectorized interface service.
The invention provides a vectorization data retrieval management device which is applied to a data platform, wherein the data platform comprises a preset vector database, and the device comprises:
the classification response module is used for classifying the data to be queried when the data to be queried is received, and determining a classification label corresponding to the data to be queried;
The vectorization conversion module is used for vectorizing and converting unstructured data in the data to be queried to generate unstructured vectors;
the data area selection module is used for selecting a target data area with the classification tag from a vector database at the current moment;
and the target vector data retrieval module is used for retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
From the above technical scheme, the invention has the following advantages:
when the data platform receives data to be queried, classifying the data to be queried, and determining a classification label corresponding to the data to be queried; performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors; selecting a target data area with the classification label from a vector database at the current moment; and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area. Therefore, structured and unstructured data query is realized by arranging a vector database in the data platform, meanwhile, the data transmission risk brought by externally connecting the database is reduced, and the data retrieval safety and accuracy are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flowchart illustrating a method for managing vectorized data retrieval according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for managing vectorized data retrieval according to an embodiment of the present invention;
FIG. 3 is a schematic view of rights management for a data platform according to an embodiment of the present invention;
FIG. 4 is a flowchart of a vectorized data query and application provided by an embodiment of the present invention;
fig. 5 is a block diagram of a vectorized data retrieval management device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a vectorization data retrieval management method and device, which are used for solving the technical problems that at present, most enterprises directly connect an independent vector database or server in an external mode to be compatible with vectorization query on the premise of not modifying a data platform, structured and unstructured data query is difficult to realize in the data platform at the same time, the risk of data leakage is increased easily in the mode of connecting the server or the database in an external mode, and the security of vectorization data retrieval management is reduced.
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a method for managing vectorized data retrieval according to an embodiment of the present invention.
The invention provides a vectorization data retrieval management method, which is applied to a data platform, wherein the data platform comprises a preset vector database, and the method comprises the following steps:
step 101, classifying the data to be queried when the data to be queried is received, and determining a classification label corresponding to the data to be queried;
the data to be queried refers to data for defining query conditions, and may include query sentences, query conditions, query keywords, query regions and the like, and the data form may include text, pictures, voice or other forms.
The class labels are used to identify the data regions in the vector database that are required for the data to be queried.
In the embodiment of the application, when the data platform receives externally input data to be queried, classifying the data to be queried by calling a hierarchical classification interface service in the data platform, and determining a classification label corresponding to the data to be queried.
102, vectorizing conversion is carried out on unstructured data in data to be queried to generate unstructured vectors;
after the classification labels of the data to be queried are obtained, if unstructured data exist in the data to be queried, vectorization interface service can be called to carry out vectorization conversion on the unstructured data in the data to be queried, and if the data to be queried are structured data, the data to be queried are directly used as unstructured vectors to obtain semantic representation of the data to be queried, and the unstructured vectors are generated.
It should be noted that, the unstructured vector can be obtained by adopting the trimmed BERT, roBERTa and other model algorithms to convert the data to be queried into a vector array form, and compared with the traditional machine learning methods (such as word2vec, doc2vec and the like), the unstructured vector can more fully capture the context information of the text and can more obtain the semantic representation of the text.
Step 103, selecting a target data area with a classification label from a vector database at the current moment;
a vector database refers to a database made up of a plurality of data regions that include class labels. Wherein each data area stores at least one service data table, and each service data table comprises field data such as a vectorized data field, a metadata field, and other classification label fields.
Since in a specific implementation, the vector database is updated with the continuous update of the vectorized data, if searching is performed with only unstructured vectors, the searching efficiency may be reduced due to the continuous increase of the data volume. After vector conversion is carried out on data to be queried of an input data platform and a classification label is determined, the data platform can carry out preliminary retrieval on the classification label from a vector database at the current moment, and a target data area with the classification label is determined.
And 104, retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
In this embodiment of the present application, after determining the target data area, since the data size is reduced, the similarity between the unstructured vector and each vector data in the target data area may be further calculated, so as to select a plurality of vector data with similarity exceeding a threshold value or from large to small, and use the vector data as the target vector data and return to display.
It should be noted that, the target vector data may be in a simple array form, so that for convenience of viewing and displaying, the target vector data may be restored to generate corresponding displayable information such as text, picture, audio or video.
In the embodiment of the application, when the data platform receives the data to be queried, the data to be queried is classified, and a classification label corresponding to the data to be queried is determined; performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors; selecting a target data area with a classification label from a vector database at the current moment; and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area. Therefore, structured and unstructured data query is realized by arranging a vector database in the data platform, meanwhile, the data transmission risk brought by externally connecting the database is reduced, and the data retrieval safety and accuracy are improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a vectorized data retrieval management method according to an embodiment of the present invention.
The invention provides a vectorization data retrieval management method, which is applied to a data platform, wherein the data platform comprises a preset vector database, and the method comprises the following steps:
Step 201, obtaining unlabeled data in a data platform;
unlabeled data refers to various types of structured or unstructured data within the enterprise system to which the data platform pertains, including but not limited to text data, picture data, or other types of data.
In the embodiment of the application, unlabeled data in the data are obtained through a data platform and serve as a training data base of subsequent interface services.
Step 202, calling a large language model service to label unlabeled data according to a preset hierarchical classification label system, and generating labeled data;
after the unlabeled data is obtained, the data types and the data contents corresponding to the enterprise systems are different, so that the unlabeled data is labeled according to a preset hierarchical classification label system to generate labeled data by calling a large language model service.
It should be noted that, the hierarchical classification tag system may construct a hierarchical classification tag system corresponding to the enterprise by performing cluster analysis on existing data by using industry service knowledge. After the hierarchical classification tag system is obtained, the hierarchical classification tag system is used as a large language model service such as Llama, chatGLM, qwen, openAI, and unlabeled data is automatically labeled by changing a prompt text, so that the corresponding labeled data are generated. Unlike the existing technology of directly using large model reasoning, the method introduces prompt information rich in industry business knowledge into a large model application scene by constructing a vector database in the vertical field, and can obtain a large model reasoning result more accurately.
Step 203, training a hierarchical classification interface service and a vectorization interface service according to the marked data and the unmarked data;
optionally, step 203 may comprise the sub-steps of:
performing parameter fine adjustment on a plurality of preset first classification models by adopting marking data to obtain a plurality of intermediate classification models;
the intermediate classification model with screening accuracy greater than a preset classification threshold is determined as a target classification model and deployed into a hierarchical classification interface service;
performing parameter fine adjustment and screening on a plurality of preset second classification models by adopting unlabeled data to obtain semantic extraction models;
the semantic extraction model is deployed as a vectorized interface service.
Because of the high deployment or use cost and slow prediction speed of large model services, in practical production applications, a classification model with smaller parameter scale and good capability of capturing context information can be used for substitution, such as BERT, roBERTa, etc. After the corpus of the unlabeled data is labeled preliminarily, performing BERT series classification model fine adjustment based on classification to which the labeled data belongs, and continuously iterating and optimizing the corpus and the models to obtain a plurality of intermediate classification models.
In order to further improve classification accuracy, an intermediate classification model with the screening accuracy greater than a preset classification threshold value can be selected from a plurality of intermediate classification models to be determined as a target classification model and deployed into the component classification interface service.
In this embodiment, a plurality of preset second classification models can be subjected to parameter fine adjustment and screening based on unlabeled data by adopting model algorithms such as BERT and RoBERTa to obtain a semantic extraction model, so that context information of a text can be more fully captured, semantic representation of the text can be obtained, and the semantic extraction model can be deployed into an API service, namely a vectorization interface service, for use in vectorization processing of subsequent unstructured data.
In a specific implementation, for a semantic extraction model used for the vectorization interface service, if the semantic extraction model is used for industries with strong industry knowledge dependence and a lot of data are accumulated, the vectorization model of the industry can be trained based on the existing data for subsequent deployment of the vectorization model service, and if a general model semantic representation is adopted, the service application effect can be achieved, the vectorization model can be not trained independently. The vectorization processing adopts GPU acceleration technology to accelerate processing, is independently deployed into API service, and is separated from database query, namely storage and calculation separation, wherein a model of vectorization processing is fine-tuned by utilizing industry data, so that the model contains more industry knowledge semantic representations.
Step 204, calling a vectorization interface service to vectorize and convert unstructured data in the data platform, and calling a hierarchical classification interface service to classify and identify the unstructured data in the data platform, so as to create a corresponding business data table;
Optionally, step 204 may include the sub-steps of:
creating an initial data table on a data platform; the initial data table includes a plurality of job category fields;
invoking a hierarchical classification interface service and a vectorization interface service to respectively extract unstructured data in a data platform;
converting unstructured data into vectorized data through vectorized interface service;
classifying and identifying unstructured data, correlating the unstructured data with vectorized data, and determining a classification label field corresponding to the unstructured data;
and respectively storing the vectorization data and the classification label fields into corresponding function class fields to generate a service data table.
The job category field refers to a field for storing related attributes of unstructured data and vectorized data, including, but not limited to, unstructured data fields, metadata fields, class label fields, vectorized data fields, and the like. Unstructured data such as: text, pictures, audio and video, etc., structured data such as: sales amount, sales quantity, etc.
In an embodiment of the present application, at least one initial data table may be created on the data platform, and an unstructured data field, a metadata field, a class label field, and a vectorized data field may be created at the same time, so as to store different kinds of data, where each initial data table includes multiple function class fields. Invoking a hierarchical classification interface service and a vectorization interface service to respectively extract unstructured data in a data platform; converting unstructured data into vectorized data through vectorized interface service; classifying and identifying unstructured data, correlating the unstructured data with vectorized data, and determining a classification label field corresponding to the unstructured data; and respectively storing the vectorization data and the classification label fields into corresponding function class fields to generate a service data table.
Step 205, loading all business data tables, associating vectorization interface service and hierarchical classification interface service, and constructing a vector database;
it should be noted that the existing vector database only has vector computing capability, and the vector database not only can realize vector computing but also can realize capability of a traditional data platform. The system is simple, portable and flexible, not only supports independent query and processing of unstructured data and structured data, but also flexibly supports data query and processing of the combination of unstructured data and structured data, is compatible with the capability of a traditional data platform, realizes the vectorization query function, and improves the accuracy rate, the diversification and the running speed of data query.
Further, the method comprises the following steps:
when a service update data table is received, judging whether the service update data table comprises a function category field or not;
if the function category field does not exist, a function category field is newly added on the service update data table, and the hierarchical classification interface service and the vectorization interface service are called to respectively extract unstructured data in the service update data table;
if yes, calling a hierarchical classification interface service and a vectorization interface service to respectively extract unstructured data in a business update data table;
The step of converting unstructured data into vectorized data through a vectorized interface service is performed by the jump.
On the data platform, for the existing data table, the unstructured data and the structured data need to be compatible at the same time, the unstructured data needs to be vectorized, the vectorized data after vectorization is stored on the existing data table, the classification labels are used for managing, facilitating and accurately retrieving the unstructured data, the metadata are used for further explaining the sources of the unstructured data and the like, and if the existing data table of the data platform is used, the metadata can be ignored.
In one example of the present application, the method may further comprise the steps of:
responding to the input login information, and matching the data authority corresponding to the login information; the data authority comprises a multi-level labeling authority and a multi-level user authority;
when a user operation instruction is received and accords with the data right, executing management operation corresponding to the user operation instruction on the vector database;
when an operation completion instruction is received, checking the hierarchical classification interface service and the vectorization interface service to generate a checking result;
when receiving an evaluation failing instruction input in response to an execution result, performing skip execution to call a large language model service to label unlabeled data and generating labeled data;
When an evaluation pass instruction input in response to an execution result is received, the hierarchical classification interface service and the vectorization interface service at the current time are maintained.
The enterprise internal data has the differences of privacy and security, and a corresponding security management system such as data import authority, employee data query relationship, cross-department data query authority and the like should be set for the management of the enterprise internal data.
Therefore, in the embodiment of the application, the data authority corresponding to the login information can be matched by responding to the input login information and according to the login information and a preset data authority table; the data authority comprises a multi-level labeling authority and a multi-level user authority. When the user operation instruction is received and accords with the data right, management operations such as deletion, examination, labeling and the like corresponding to the user operation instruction are executed on the vector database. When an operation completion instruction is received, the hierarchical classification interface service and the vectorization interface service are executed to verify the newly added, modified or marked data of the vector database at the current moment to determine classification labels and vectorization data thereof, so that an execution result is generated and displayed. When the external input fails to pass the instruction for the evaluation of the execution result, it indicates that there may be erroneous data or erroneous classification in the current vector database, and step 202 may be skipped to perform model updating on the hierarchical classification interface service and the vectorization interface service, and if the evaluation pass instruction in response to the input of the execution result is received, the hierarchical classification interface service and the vectorization interface service at the current moment are maintained. In the vector database, existing data are corrected according to a preset hierarchical classification label, in addition, updated data can be pulled through an AI platform, a hierarchical classification model is retrained, the accuracy of the hierarchical classification model is evaluated, and model deployment service is updated.
In a specific implementation, as shown in fig. 3, data management inside an enterprise may set different labeling rights and user rights for different users.
For labeling rights, the rights order is listed from high to low: management personnel, examination personnel and labeling personnel. The manager is used for all rights of the vector database, including rights such as personnel creation, rights authorization, corpus allocation, corpus labeling, corpus examination and the like; the censoring personnel has corpus labeling and corpus censoring authorities; the labeling personnel only have corpus labeling authority.
For user rights, the rights order is listed from high to low: super user, internal user, external user, super user possess all rights of the database, including: rights such as sharing, creation, authorization, database addition, deletion, modification, query, and the like; sharing rights: means sharing the specified vector database to other users by authorization; creating permission: an internal user and an external user can be created; authorization rights: granting different rights to the user; the internal users generally have the authority of increasing, changing and inquiring the database; the external user typically only has database query rights.
Step 206, classifying the data to be queried when the data to be queried is received, and determining a classification label corresponding to the data to be queried;
optionally, step 206 may include the sub-steps of:
when data to be queried is received, analyzing the data to be queried, and judging whether query conditions exist or not;
if the initial data area exists, selecting the initial data area from the vector database according to the query condition, and determining the initial data area as the vector database at the current moment;
invoking a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried;
and if the data to be queried does not exist, calling a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried.
In a specific implementation, the data to be queried may have a query condition, so when the data to be queried is received, the data to be queried can be analyzed to determine whether the query condition exists. If the data to be queried exists, an initial data area can be selected from a vector database according to the query conditions such as query sentences and the like, the initial data area is used as the vector database at the current moment, a hierarchical classification interface service is called to classify the data to be queried, a classification label corresponding to the data to be queried is determined, if the data to be queried does not exist, the hierarchical classification interface service is called to classify the data to be queried, and a classification label corresponding to the data to be queried is determined.
Where query terms refer to descriptive data defining an initial data region, may include, but is not limited to, structured data, query statements, or data identifications.
Step 207, performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors;
step 208, selecting a target data area with a classification label from a vector database at the current moment;
in the embodiment of the present application, the specific implementation process of steps 207-208 is similar to steps 102-103, and will not be repeated here.
And step 209, retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
Optionally, the target data area stores a plurality of vectorized data; step 209 may comprise the sub-steps of:
judging whether the number of the vectorized data is larger than or equal to a preset number threshold value;
if not, calculating a first vector similarity between each vectorized data and the unstructured vector;
if yes, adopting a general vector index algorithm to screen at least one intermediate vector from a plurality of vectorized data according to unstructured vectors;
calculating a second vector similarity between each intermediate vector and the unstructured vector;
Selecting target vector data according to the first vector similarity or the second vector similarity;
and returning the target vector data to a transmitting end to which the data to be queried belong and displaying the target vector data.
In a specific implementation, the classification interface service can be used for acquiring the classification labels corresponding to the data, and then the classification labels are directly searched in the classification data area corresponding to the data. If the data volume of the classified data area is less than 2 ten thousand, direct violent search can be performed, and target vector data is selected according to the first vector similarity or the second vector similarity by calculating the first vector similarity between each vectorized data and unstructured vectors; if the data volume of the classified data area is larger, a general vector index algorithm (e.g. HNSW based on graph index) is used in combination, at least one intermediate vector is selected from the plurality of vectorized data, the second vector similarity between each intermediate vector and the unstructured vector is calculated, and the target vector data is selected according to the second vector similarity, so that the query efficiency is improved.
Further, the step of selecting the target vector data according to the first vector similarity or the second vector similarity includes:
selecting a plurality of vectorized data from large to small according to the first vector similarity or the second vector similarity, and determining the vectorized data as target vector data;
Or selecting a plurality of vectorized data with the first vector similarity or the second vector similarity in a preset similarity range to determine the vectorized data as target vector data.
In an example of the present application, a sending end to which data to be queried belongs is a functional service application of a data platform, and each functional service application is provided with a corresponding preset similarity range, and the method further includes:
returning the target vector data to the function service application;
the existing model is updated with the target vector data by the feature service application.
Referring to fig. 4, fig. 4 is a flowchart of a vectorized data query and application in an embodiment of the present invention.
In the embodiment of the application, a user inputs data to be queried, and a vector representation corresponding to the data and a class label belonging to a hierarchical classification model are obtained from an AI platform. The method comprises the steps of inputting a corresponding vector and a class label of data to be queried into a data platform, screening related data according to the class of the data and other conditions, calculating the distance or similarity between the data vector to be queried and the data vector conforming to the conditions, returning data and other information with similar or closest results, and acquiring query results conforming to the conditions according to thresholds set by different applications, wherein accurate information can be quickly provided for downstream functional service applications such as knowledge base question-answering, personalized recommendation, large model application, sensitive word filtering and the like, so that the functional service application can update existing models in the data platform, such as models in hierarchical classification interface service and vectorization interface service, by adopting target vector data.
In the embodiment of the application, when the data platform receives the data to be queried, the data to be queried is classified, and a classification label corresponding to the data to be queried is determined; performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors; selecting a target data area with a classification label from a vector database at the current moment; and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area. Therefore, the method is suitable for vector database construction of small-scale data in enterprises based on the existing data platform, and can rapidly and accurately acquire the query result by introducing data vectorization service and hybrid index under the condition of not changing the structure of the data platform. Meanwhile, the method is also applied to large model application in different vertical fields, and the application effect is further improved. Structured and unstructured data query is realized by arranging a vector database in the data platform, and meanwhile, the data transmission risk brought by externally connecting the database is reduced, and the data retrieval safety and accuracy are improved.
Referring to fig. 5, fig. 5 is a block diagram illustrating a vectorized data retrieval management apparatus according to an embodiment of the present application.
The invention provides a vectorization data retrieval management device, which is applied to a data platform, wherein the data platform comprises a preset vector database, and the device comprises:
the classification response module 501 is configured to classify data to be queried when receiving the data to be queried, and determine a classification label corresponding to the data to be queried;
the vectorization conversion module 502 is configured to vectorize and convert unstructured data in the data to be queried to generate unstructured vectors;
a data area selection module 503, configured to select a target data area with a classification tag from a vector database at the current time;
the target vector data retrieving module 504 is configured to retrieve and display target vector data corresponding to the unstructured vector in the target data area.
Optionally, the classification response module 501 is specifically configured to:
when data to be queried is received, analyzing the data to be queried, and judging whether query conditions exist or not;
if the initial data area exists, selecting the initial data area from the vector database according to the query condition, and determining the initial data area as the vector database at the current moment;
invoking a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried;
And if the data to be queried does not exist, calling a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried.
Optionally, the target data area stores a plurality of vectorized data; the target vector data retrieval module 504 includes:
the quantity judging sub-module is used for judging whether the quantity of the vectorized data is larger than or equal to a preset quantity threshold value;
a first similarity calculation submodule, configured to calculate a first vector similarity between each vectorized data and the unstructured vector if not;
the intermediate vector screening sub-module is used for screening at least one intermediate vector from the plurality of vectorized data by adopting a general vector indexing algorithm according to unstructured vectors if the intermediate vector is selected;
a second similarity calculation submodule for calculating a second vector similarity between each intermediate vector and the unstructured vector;
the data screening sub-module is used for selecting target vector data according to the first vector similarity or the second vector similarity;
and the data return sub-module is used for returning the target vector data to the transmitting end to which the data to be queried belong and displaying the target vector data.
Optionally, the data screening submodule is specifically configured to:
selecting a plurality of vectorized data from large to small according to the first vector similarity or the second vector similarity, and determining the vectorized data as target vector data;
Or selecting a plurality of vectorized data with the first vector similarity or the second vector similarity in a preset similarity range to determine the vectorized data as target vector data.
Optionally, the sending end to which the data to be queried belongs is a function service application of the data platform, each function service application is provided with a corresponding preset similarity range, and the device further includes:
the application data return module is used for returning the target vector data to the functional service application;
and the model updating module is used for updating the existing model by using the target vector data through the function service application.
Optionally, the apparatus further comprises:
the unlabeled data acquisition module is used for acquiring unlabeled data in the data platform;
the data labeling module is used for calling the large language model service to label the unlabeled data according to a preset hierarchical classification label system to generate labeled data;
the service generation module is used for training the hierarchical classification interface service and the vectorization interface service according to the marked data and the unmarked data;
the service calling module is used for calling the vectorization interface service to vectorize and convert unstructured data in the data platform, calling the hierarchical classification interface service to classify and identify the unstructured data in the data platform and creating a corresponding service data table;
And the vector database generating module is used for loading all the service data tables, correlating the vectorization interface service and the hierarchical classification interface service, and constructing a vector database.
Optionally, the service generating module is specifically configured to:
performing parameter fine adjustment on a plurality of preset first classification models by adopting marking data to obtain a plurality of intermediate classification models;
the intermediate classification model with screening accuracy greater than a preset classification threshold is determined as a target classification model and deployed into a hierarchical classification interface service;
performing parameter fine adjustment and screening on a plurality of preset second classification models by adopting unlabeled data to obtain semantic extraction models;
the semantic extraction model is deployed as a vectorized interface service.
Optionally, the service calling module is specifically configured to:
creating an initial data table on a data platform; the initial data table includes a plurality of job category fields;
invoking a hierarchical classification interface service and a vectorization interface service to respectively extract unstructured data in a data platform;
converting unstructured data into vectorized data through vectorized interface service;
classifying and identifying unstructured data, correlating the unstructured data with vectorized data, and determining a classification label field corresponding to the unstructured data;
And respectively storing the vectorization data and the classification label fields into corresponding function class fields to generate a service data table.
Optionally, the apparatus further comprises:
the permission matching module is used for responding to the input login information and matching the data permission corresponding to the login information; the data authority comprises a multi-level labeling authority and a multi-level user authority;
the operation module is used for executing management operation corresponding to the user operation instruction on the vector database when the user operation instruction is received and accords with the data right;
the service verification module is used for verifying the hierarchical classification interface service and the vectorization interface service when receiving an operation completion instruction, and generating a verification result;
the service updating module is used for jumping to execute the step of calling the large language model service to label the unlabeled data and generating the labeled data when receiving an evaluation failing instruction input in response to the execution result;
and the service maintaining module is used for maintaining the hierarchical classification interface service and the vectorization interface service at the current moment when receiving an evaluation passing instruction input in response to the execution result.
Optionally, the apparatus further comprises:
the data table updating module is used for judging whether the service updating data table comprises a function category field or not when the service updating data table is received;
The first judging module is used for adding a function category field on the service updating data table if the function category field does not exist, and calling the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the service updating data table;
the second judging module is used for calling the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the business update data table if the data exists;
and the loop module is used for skipping and executing the step of converting unstructured data into vectorized data through the vectorization interface service.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (11)

1. A vectorized data retrieval management method, characterized by being applied to a data platform, wherein the data platform comprises a preset vector database, the method comprising:
when data to be queried is received, classifying the data to be queried, and determining a classification label corresponding to the data to be queried;
performing vectorization conversion on unstructured data in the data to be queried to generate unstructured vectors;
selecting a target data area with the classification label from a vector database at the current moment;
and retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
2. The method according to claim 1, wherein the step of classifying the data to be queried when the data to be queried is received, and determining a classification label corresponding to the data to be queried comprises:
when data to be queried is received, analyzing the data to be queried, and judging whether query conditions exist or not;
if the initial data area exists, selecting the initial data area from the vector database according to the query condition, and determining the initial data area as the vector database at the current moment;
invoking a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried;
and if the data to be queried does not exist, calling a hierarchical classification interface service to classify the data to be queried, and determining a classification label corresponding to the data to be queried.
3. The method of claim 1, wherein the target data area stores a plurality of vectorized data; the step of retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area comprises the following steps:
judging whether the number of the vectorized data is larger than or equal to a preset number threshold value;
If not, calculating a first vector similarity between each vectorized data and the unstructured vector;
if yes, adopting a general vector index algorithm to screen at least one intermediate vector from a plurality of vectorized data according to the unstructured vector;
calculating a second vector similarity between each of the intermediate vectors and the unstructured vector;
selecting target vector data according to the first vector similarity or the second vector similarity;
and returning the target vector data to a transmitting end to which the data to be queried belongs and displaying the target vector data.
4. A method according to claim 3, wherein the step of selecting the target vector data based on the first vector similarity or the second vector similarity comprises:
selecting a plurality of vectorized data from large to small according to the first vector similarity or the second vector similarity, and determining the vectorized data as target vector data;
or selecting a plurality of vectorized data with the first vector similarity or the second vector similarity in a preset similarity range to determine the vectorized data as target vector data.
5. The method of claim 4, wherein the sender to which the data to be queried belongs is a functional service application of the data platform, each of the functional service applications being provided with a corresponding preset similarity range, and the method further comprises:
Returning the target vector data to the function service application;
and updating an existing model by the function service application by using the target vector data.
6. The method according to claim 1, wherein the method further comprises:
obtaining unlabeled data in the data platform;
calling a large language model service to label the unlabeled data according to a preset hierarchical classification label system, and generating labeled data;
training a hierarchical classification interface service and a vectorization interface service according to the marked data and the unmarked data;
invoking the vectorization interface service to vectorize and convert unstructured data in the data platform, and invoking the hierarchical classification interface service to classify and identify the unstructured data in the data platform, so as to create a corresponding business data table;
and loading all the service data tables, associating the vectorization interface service and the hierarchical classification interface service, and constructing a vector database.
7. The method of claim 6, wherein the step of training a hierarchical classification interface service and a vectorization interface service based on the annotated data and the unlabeled data comprises:
Performing parameter fine adjustment on a plurality of preset first classification models by adopting the marking data to obtain a plurality of intermediate classification models;
the intermediate classification model with screening accuracy greater than a preset classification threshold is determined as a target classification model and deployed into a hierarchical classification interface service;
performing parameter fine adjustment and screening on a plurality of preset second classification models by adopting the unlabeled data to obtain semantic extraction models;
and deploying the semantic extraction model as a vectorization interface service.
8. The method of claim 6, wherein the step of invoking the vectorization interface service to vectorize the unstructured data in the data platform and invoking the hierarchical classification interface service to classify and identify the unstructured data in the data platform to create the corresponding business data table comprises:
creating an initial data table on the data platform; the initial data table includes a plurality of job category fields;
invoking the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the data platform;
converting the unstructured data into vectorized data through the vectorized interface service;
Classifying and identifying the unstructured data, correlating the unstructured data with the vectorized data, and determining a classification label field corresponding to the unstructured data;
and respectively storing the vectorization data and the classification label field into corresponding function class fields to generate a service data table.
9. The method of claim 6, wherein the method further comprises:
responding to input login information, and matching data authority corresponding to the login information; the data authority comprises a multi-level labeling authority and a multi-level user authority;
when a user operation instruction is received and accords with the data right, executing management operation corresponding to the user operation instruction on the vector database;
when an operation completion instruction is received, checking the hierarchical classification interface service and the vectorization interface service to generate a checking result;
when receiving an evaluation failing instruction input in response to the execution result, skipping to execute the step of calling the large language model service to label the unlabeled data and generating labeled data;
and when an evaluation passing instruction responding to the execution result input is received, maintaining the hierarchical classification interface service and the vectorization interface service at the current moment.
10. The method of claim 8, wherein the method further comprises:
when a service update data table is received, judging whether the service update data table comprises the job category field or not;
if not, the job category field is newly added on the service update data table, and the hierarchical classification interface service and the vectorization interface service are called to respectively extract unstructured data in the service update data table;
if yes, calling the hierarchical classification interface service and the vectorization interface service to respectively extract unstructured data in the business update data table;
skipping performs the step of converting the unstructured data into vectorized data through the vectorized interface service.
11. A vectorized data retrieval management apparatus for use with a data platform, the data platform including a pre-set vector database, the apparatus comprising:
the classification response module is used for classifying the data to be queried when the data to be queried is received, and determining a classification label corresponding to the data to be queried;
the vectorization conversion module is used for vectorizing and converting unstructured data in the data to be queried to generate unstructured vectors;
The data area selection module is used for selecting a target data area with the classification tag from a vector database at the current moment;
and the target vector data retrieval module is used for retrieving and displaying the target vector data corresponding to the unstructured vector in the target data area.
CN202311441225.6A 2023-11-01 2023-11-01 Vectorized data retrieval management method and device Pending CN117453971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311441225.6A CN117453971A (en) 2023-11-01 2023-11-01 Vectorized data retrieval management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311441225.6A CN117453971A (en) 2023-11-01 2023-11-01 Vectorized data retrieval management method and device

Publications (1)

Publication Number Publication Date
CN117453971A true CN117453971A (en) 2024-01-26

Family

ID=89585033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311441225.6A Pending CN117453971A (en) 2023-11-01 2023-11-01 Vectorized data retrieval management method and device

Country Status (1)

Country Link
CN (1) CN117453971A (en)

Similar Documents

Publication Publication Date Title
CN109635171B (en) Fusion reasoning system and method for news program intelligent tags
US7788265B2 (en) Taxonomy-based object classification
CN109325148A (en) The method and apparatus for generating information
US11194797B2 (en) Automatic transformation of complex tables in documents into computer understandable structured format and providing schema-less query support data extraction
US7912816B2 (en) Adaptive archive data management
CN107085583B (en) Electronic document management method and device based on content
CN111694965A (en) Image scene retrieval system and method based on multi-mode knowledge graph
EP2973038A1 (en) Classifying resources using a deep network
US11194798B2 (en) Automatic transformation of complex tables in documents into computer understandable structured format with mapped dependencies and providing schema-less query support for searching table data
CN112131449A (en) Implementation method of cultural resource cascade query interface based on elastic search
US11308083B2 (en) Automatic transformation of complex tables in documents into computer understandable structured format and managing dependencies
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
CN111639228A (en) Video retrieval method, device, equipment and storage medium
KR20120047622A (en) System and method for managing digital contents
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN114491034B (en) Text classification method and intelligent device
CN114491079A (en) Knowledge graph construction and query method, device, equipment and medium
US11625630B2 (en) Identifying intent in dialog data through variant assessment
CN117453971A (en) Vectorized data retrieval management method and device
US10572522B1 (en) Database for unstructured data
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
US20120117449A1 (en) Creating and Modifying an Image Wiki Page
CN115774797A (en) Video content retrieval method, device, equipment and computer readable storage medium
CN113220843A (en) Method, device, storage medium and equipment for determining information association relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination