CN116186298A - Information retrieval method and device - Google Patents

Information retrieval method and device Download PDF

Info

Publication number
CN116186298A
CN116186298A CN202310151989.5A CN202310151989A CN116186298A CN 116186298 A CN116186298 A CN 116186298A CN 202310151989 A CN202310151989 A CN 202310151989A CN 116186298 A CN116186298 A CN 116186298A
Authority
CN
China
Prior art keywords
partition
label
vector
tag
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310151989.5A
Other languages
Chinese (zh)
Inventor
刘雨
刘啸
王凯曦
韦大平
陈政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202310151989.5A priority Critical patent/CN116186298A/en
Publication of CN116186298A publication Critical patent/CN116186298A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an information retrieval method and device, which belong to the field of financial science and technology (Fintech), and the method comprises the following steps: converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute tag of the feature vector; according to the attribute labels of the feature vectors, an attribute partition table is established, and the feature vectors are stored into a first-level label partition in the attribute partition table; according to preset tag classification conditions, carrying out step-by-step partitioning on the first-level tag partitions to obtain at least one N-level tag partition, determining feature vectors stored under the N-level tag partitions, wherein N is a positive integer with a value greater than or equal to 1; and constructing a vector index file according to each N-level tag partition and the feature vector stored under the N-level tag partition, and inquiring target unstructured data matched with the source data. The technical scheme can improve the query efficiency of unstructured data.

Description

Information retrieval method and device
Technical Field
The present application relates to the field of financial technology (Fintech), and in particular, to an information retrieval method and apparatus.
Background
With the development of computer technology, more and more technology is applied to the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech).
At present, in the field of financial science and technology, massive unstructured data are involved, before target data are searched from the massive unstructured data, the unstructured data are required to be converted into a vector form, then a vector index file is constructed, and data retrieval is realized based on the vector index file.
However, this approach, due to the large order of magnitude of unstructured data, results in a large number of index files being constructed and low retrieval efficiency.
Disclosure of Invention
The application provides an information retrieval method and device, which are used for solving the technical problems that unstructured data is converted into a vector form, vector index files are reconstructed, and the number of the constructed index files is large due to the large order of magnitude of the unstructured data, so that the retrieval efficiency is low in the prior art.
In a first aspect, the present application provides an information retrieval method, the method comprising:
converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute tag of the feature vector;
According to the attribute labels of the feature vectors, an attribute partition table is established, and the feature vectors are stored into a first-level label partition in the attribute partition table;
according to preset tag classification conditions, carrying out step-by-step partitioning on the first-level tag partition to obtain at least one N-level tag partition, determining feature vectors stored under the N-level tag partition, wherein N is a positive integer with a value greater than or equal to 1;
constructing a vector index file according to each N-level tag partition and the feature vectors stored under the N-level tag partitions;
and inquiring to obtain target unstructured data matched with the source data according to the vector index file.
In one possible design, the creating an attribute partition table according to the attribute tag of the feature vector includes:
acquiring the category number of the attribute tags of the feature vector;
and establishing an attribute partition table corresponding to each attribute tag, and establishing at least one primary tag partition associated with the attribute tag in the attribute partition table.
In one possible design, the step-by-step partitioning the first-level tag partition according to a preset tag classification condition to obtain at least one N-level tag partition includes:
Determining whether the primary label partition meets preset label classification conditions or not;
if the primary label partition meets the preset label classification condition, expanding the primary label partition to obtain at least one next-stage label partition, and determining whether a target label partition meeting the label classification condition exists in the at least one next-stage label partition;
if the target label partition exists, continuing to expand the target label partition until the next label partition obtained by expansion does not meet the label classification condition or the next label partition obtained by expansion is the N-level label partition.
In one possible design, determining whether a tag classification condition is met includes:
acquiring a reference value of the tag partition;
and determining whether the label partition meets the label grading condition according to the number of the feature vectors divided under the label partition and the size between the reference values of the label partition.
In one possible design, the obtaining the reference value of the tag partition includes:
acquiring a time interval and attenuation degree configured for the tag partition, wherein the time interval comprises an upper limit value and a lower limit value;
Determining a maximum number value of the feature vectors of the tag partition when the response time of vector retrieval is the upper limit value;
determining a minimum number value of the feature vector of the tag partition when the response time of vector retrieval is the lower limit value;
selecting a target value from the maximum number of feature vectors and the minimum data value of the feature vectors;
and multiplying the target value by the attenuation degree, and calculating to obtain the reference value of the label partition.
In one possible design, the method further comprises:
and merging non-target label partitions which do not meet the label classification conditions in each level of label partitions to form a merging partition.
In one possible design, the constructing a vector index file according to each N-level tag partition and the feature vector stored under the N-level tag partition includes:
constructing a vector index file corresponding to the first-level tag partition in the attribute partition table according to the feature vector of each first-level tag partition in the attribute partition table;
selecting at least one tag partition from at least two attribute partition tables to form a multi-attribute tag partition;
and constructing a vector index file of the multi-attribute label partition according to the feature vector in the multi-attribute label partition.
In one possible design, the querying, according to the vector index file, obtains target unstructured data matched with the source data, including:
acquiring a tag in the source data, wherein the source data comprises a tag;
searching a target index file from the vector index file according to the label in the source data;
searching and obtaining a preset number of target feature vectors according to the target index file, wherein the similarity between the target feature vectors and the source data is greater than a preset similarity threshold;
and determining target unstructured data matched with the source data according to the target feature vector.
In one possible design, if the tag in the source data includes more than two tags, the searching the target index file from the vector index data file includes:
constructing a regular expression according to each label in the source data;
and searching out a target index file from the vector index data file according to the regular expression.
In one possible design, the method further comprises:
monitoring access flow change at the current moment of each level of label partition, and determining whether an abnormal label partition with abnormal access flow change exists or not;
When the abnormal label partition exists, an offline data analysis model is constructed according to the feature vector and the vector index file at the previous moment, and an online feature vector at the current moment is obtained;
according to the offline data analysis model, calculating to obtain reference similarity of each feature dimension and similarity of the online feature vector, wherein the feature dimensions comprise picture data, video data, audio data and natural language, the similarity of the online feature vector is used for representing the maximum similarity of the whole members in the user group under each feature dimension, and the reference similarity is used for determining whether the user group is a suspected group;
determining the similarity between the reference similarity and the online feature vector, and performing vector retrieval analysis according to the primary label partition or the secondary label partition when the similarity of the online feature vector is greater than or equal to the reference similarity to obtain online analysis coarse-granularity similarity;
and determining the size of the coarse-grain similarity and the similarity of the online feature vector, and constructing an index file of the abnormal label partition in real time when the similarity of the online feature vector is greater than or equal to the coarse-grain similarity.
In one possible design, the method further comprises:
according to the index file of the abnormal label partition, performing similar retrieval of each characteristic dimension on the user group to obtain accurate similarity;
and determining whether the user group is a suspected group according to the sizes of the accurate similarity and the reference similarity.
In a second aspect, the present application provides an information retrieval apparatus, comprising:
the data conversion module is used for converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute label of the feature vector;
the partition table construction module is used for establishing an attribute partition table according to the attribute tags of the feature vectors and storing the feature vectors into a first-level tag partition in the attribute partition table;
the label classification module is used for carrying out step-by-step partition on the first-level label partition according to preset label classification conditions to obtain at least one N-level label partition, determining a feature vector stored under the N-level label partition, wherein N is a positive integer with a value greater than or equal to 1;
The index construction module is used for constructing a vector index file according to each N-level tag partition and the feature vectors stored under the N-level tag partitions;
and the data query module is used for querying and obtaining target unstructured data matched with the source data according to the vector index file.
According to the information retrieval method and device, unstructured data are converted into the mixed vector formed by combining the feature vector and the attribute tag, the feature vector is stored into the corresponding attribute partition table according to the attribute tag, and then the first-level tag can be expanded step by step according to the tag grading condition, so that the number of tag partitions can be reduced as much as possible, the number of index files required to be constructed is reduced, and when vector retrieval is carried out on the basis of the index files, the efficiency and speed of vector retrieval can be improved, and the query retrieval efficiency of unstructured data is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of a distributed vector search engine;
FIG. 2 is a schematic diagram of a distributed vector search engine provided in an embodiment of the present application;
Fig. 3 is a flow chart of an information retrieval method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a feature vector partition table according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of label classification according to an embodiment of the present application;
FIG. 6 is a schematic tree distribution diagram of feature vector region attribute data according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of label merging according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a general implementation of vector real-time retrieval provided in an embodiment of the present application;
FIG. 9 is a schematic diagram of a vector search query flow provided in an embodiment of the present application;
FIG. 10 is a schematic flow chart of dynamic construction of an index file according to an embodiment of the present application;
FIG. 11 is a schematic diagram of monitoring flow anomalies according to an embodiment of the present disclosure;
FIG. 12 is a schematic flow chart of a vector search query according to an embodiment of the present disclosure;
FIG. 13 is a schematic view of a hierarchical computing model of attribute tags provided in an embodiment of the present application;
fig. 14 is a schematic structural diagram of an information retrieval device according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, including but not limited to combinations of embodiments, which can be made by one of ordinary skill in the art without inventive faculty, are intended to be within the scope of the present application, based on the embodiments herein.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
"it should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of related data is required to comply with related laws and regulations and standards, and is provided with a corresponding operation portal for the user to select authorization or rejection. "
The following explains the terms referred to in the present application:
unstructured data: the data structure is irregular, a unified predefined data model is not available, and the data represented by the two-dimensional logic table of the database is inconvenient. Including pictures, video, audio, natural language, etc., unstructured data may be processed after being converted into vector data through various artificial intelligence (Artificial Intelligence, AI) or Machine Learning (ML) models.
Feature vector: vectors are also known as ebedding vectors, meaning continuous vectors that are transformed from discrete variables (e.g., various unstructured data such as pictures, video, audio, natural language, etc.) by ebedding techniques. In mathematical representation, a vector is an n-dimensional array of floating point numbers or binary data.
Vector similarity retrieval: the similarity retrieval refers to comparing the target object with the data in the database and recalling the most similar result. Similarly, vector similarity retrieval returns the most similar vector data.
And (5) searching a mixed vector: feature vectors often require the addition of other attributes as well. If the face picture is a face picture, tags such as gender, whether glasses are worn, picture grabbing time and the like can be added; the text may be tagged with language type, corpus classification, text creation time, etc. Because of these characteristics, it is often desirable to achieve a hybrid retrieval of structured and unstructured data.
There are various forms of irregular unstructured data in the field of finance and technology, such as pictures, videos, audio and natural language, which are difficult to integrate, are inconvenient to search, and generally need to be converted into feature vectors for query search. Illustratively, fig. 1 is a schematic structural diagram of a distributed vector search engine, and as shown in fig. 1, the present distributed implementation of vector search mainly provides search services by an index cluster 100, where the index cluster 100 includes a plurality of nodes 110 and at least one cluster manager 200, and each node 110 includes a vector index component and a query component. Wherein the vector indexing component is for storing search matching data and providing a nearest neighbor search engine and/or a near nearest neighbor search engine, and the query component is for providing a query aggregation service. Fig. 2 is a schematic diagram of a distributed vector search engine according to an embodiment of the present application, where an index cluster includes 3 nodes, each node stores a copy of a different matching vector set, and overall distribution balance of the index cluster is determined.
However, in the case of the mixed vector search, since the mixed vector data has the attribute feature, the mixed vector search is realized according to the related art (the conventional vector data does not have the attribute feature, so that the conventional method simply searches the content of the vector data, and the attribute feature of the mixed vector is not naturally recognized). For example, to quickly find a feature vector with a certain attribute from a huge amount of feature vectors, an effective solution cannot be achieved, and an implementation scheme needs to be redesigned. The main reasons are as follows: a. the complexity of the technical scheme is determined by the data state difference, the data in the matching vector group of the conventional method is a feature vector, the vector data is stateless, the mixed vector of the invention is stateful vector data, the existing technical scheme can not solve the storage and searching of stateful data, and a new data storage structure and a searching routing scheme are required to be redesigned; b. the feature vectors have no attribute tags in different data magnitudes, the magnitude of vector data generated by combining various attribute tags is far larger than that of a single feature vector if the feature vectors are added with the attribute tags, and the index data file generated by constructing massive data has to be designed reasonably to reduce the number of Cartesian products generated by multi-attribute combination due to massive index files. In addition, in the field of finance and technology, according to the current related technology, the magnitude of the index file constructed and generated by the mixed vector retrieval is huge, and an efficient and real-time vector retrieval system cannot be constructed, which is specifically shown in the following steps: 1. the number of Cartesian integral sheets generated by multi-attribute dimension construction is huge, and the existing scheme of vector retrieval technology can only be used for storing Cartesian products according to dimension expansion at present, so that the index file is generated. When the dimensions are complex, such as: date, region, industry and age, the number of index files formed by the combination of the attributes reaches the hundred million level, and the time for generating the index files by light is long. However, in the technical field of finance and science, since the dimension is required to change in real time according to the business requirement due to the dynamic countermeasure process with black products, the index file needs to be generated quickly according to the requirement. 2. The index file is simply built by the dimension of the fixed attribute, so that data distribution is unbalanced, the size difference between data dense and sparse fragments is large, the query speed of dense fragments is low, and the general service requirements are not met. For example: if the index file is uniformly constructed according to various provinces and cities or various age groups of the whole country, the query efficiency of the area A or the main age group is low. Meanwhile, when the index file of the centralized hot spot data is reconstructed based on the data change, a longer time is required, and the real-time performance required by the financial and technological field cannot be met. 3. The index is too complex, which results in huge memory consumption, but the business needs need not compromise the complexity, such as having a group under the province G, the city S, and the village under the region C, which does need to spread the partition to the village level for the province G, but does not distinguish as finely for other cities. Which dimensions require expanding partitions to meet business needs and which dimensions require shrinking to reduce memory usage requires a set of mechanisms to guarantee.
In order to solve the above problems, in order to implement multi-attribute dimension vector retrieval in the field of financial technology, the embodiments of the present application provide an information retrieval method and apparatus, after unstructured data is converted into feature vectors carrying attribute labels, the feature labels of each feature vector are firstly stored into partition tables corresponding to the attribute labels, then an attribute label classification and merging mechanism is established, attribute label classification is performed on each partition table according to service requirements, and then a scoring mechanism of each level of labels and a merging and construction rule of index files are established, so as to reduce the number of classified attribute labels and the number of index files, and thus, when vector retrieval is performed based on the index files, retrieval efficiency can be improved. In addition, certain specific labels can be accurately unfolded and partitioned according to the real-time information, and index files can be built in real time, so that the requirements of querying and retrieving different business scenes can be met.
The technical scheme of the present application is described in detail below through specific embodiments. It should be noted that the following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 3 is a flow chart of an information retrieval method provided in an embodiment of the present application, as shown in fig. 3, the method may be applied to an electronic device (for example, a computer device), and the method may specifically include the following steps: in step S301, unstructured data is converted into a mixed vector, where the unstructured data includes at least one of picture data, video data, audio data, and natural language, and the mixed vector includes a feature vector and an attribute tag of the feature vector.
In this embodiment, unstructured data is composed of irregular data structures (e.g., pictures, video, audio, and natural language) and attribute tags. By way of example, the picture data, video data, audio data, and natural language are related to the user, for example, the audio data may be sound uttered by a certain user, the picture data may be a face image of a certain user, or the like. In extracting feature vectors of an irregular data structure, taking feature vectors extracted from pictures as an example, there are mainly two methods: the first is by extracting image descriptors (white box algorithm); the second is implemented by a neural network-based method (black box algorithm), here using KAZE descriptors, which are provided directly in the cross-platform computer vision and machine learning software library OpenCV, the basis of the computer programming language python, which can be used directly. By the extraction process as above, irregular data in unstructured data has been converted into feature vectors, and thus unstructured data has been converted into mixed vectors composed of feature vectors and attribute tags, which may be date, region, age, industry, etc., by way of example.
Step S302, according to the attribute labels of the feature vectors, an attribute partition table is established, and the feature vectors are stored into a first-level label partition in the attribute partition table.
In this embodiment, since the unstructured data carries attributes (such as date, region, industry and age), when the unstructured data is converted into a feature vector, the attributes are marked by attribute labels, for example, the attribute label of the feature vector X1 is "20 years", the attribute label of the feature vector X2 is "2023 years", for example, based on the category of the attribute label of the feature vector, an attribute partition table of a corresponding category may be created, for example, the attribute label of the feature vector X1 is "20 years", then a feature vector age partition table of a corresponding category is created, and the attribute label of the feature vector X2 is "2023 years", then a feature vector date partition table of a corresponding category is created.
The labels can also be created in the attribute partition table, for example, taking the characteristic vector age partition table as an example, one-level label partitions of different age groups can be created based on age, for example, one-level label partitions are "young", "middle-aged and" middle-aged "and the characteristic vector with the attribute label of" 20 years "can be stored in one-level label partition" young ".
For example, fig. 4 is a schematic diagram of a feature vector partition table provided in an embodiment of the present application, where a feature vector region partition table, a feature vector age partition table, and a feature vector industry partition table may be created, where each partition table has a partition table name and a partition label corresponding thereto (each partition table name and partition label cannot be repeated). Vector data can be stored into a corresponding partition table through an attribute tag route, the partition table simultaneously sets the number of copies of the generated index file, the number of the copies is set to be 3 in an exemplary mode, in addition, the vector data can be adjusted according to actual conditions, multiple redundancy of the index file generated later can be guaranteed, multiple redundancy data are loaded by a cluster manager, and the high availability of clusters is improved. As shown in fig. 4, the feature vector region partition table includes a first-level tag partition divided into "province 1", "province 2", and "province 3". The first-level label included in the feature vector age partition table is divided into a young group, a middle-aged group and a middle-aged and elderly group, and the first-level label included in the feature vector industry partition table is divided into a first industry, a second industry and a third industry.
Step S303, according to preset label grading conditions, carrying out step-by-step partitioning on the first-level label partition to obtain at least one N-level label partition, determining the feature vector stored under the N-level label partition, wherein N is a positive integer with a value greater than or equal to 1.
In this embodiment, the primary label partition in each attribute partition table may be subdivided into secondary label partitions, tertiary label partitions, and N-level label partitions. For example, fig. 5 is a schematic label classification diagram provided in the embodiment of the present application, as shown in fig. 5, taking an attribute type as a date as an example, a first-level label partition is a year, a second-level label partition is a month, a third-level label partition is a day, and a fourth-level label partition is an hour, and in a feature vector date partition table, based on preset label classification conditions, the first-level label partition, the second-level label partition, the third-level label partition and the fourth-level label partition may be created correspondingly. Illustratively, taking the attribute tag of the feature vector as "2023, 1, 15, then when a fourth-level tag partition is created in the feature vector date partition table, it is correspondingly stored into the fourth-level tag partition.
In some embodiments, the grading of primary label partitions may be accomplished by: determining whether the primary label partition meets preset label grading conditions; if the primary label partition meets the preset label classification condition, expanding the primary label partition to obtain at least one next-stage label partition, and determining whether a target label partition meeting the label classification condition exists in the at least one next-stage label partition; if the target label partition exists, continuing to expand the target label partition until the next label partition obtained by expansion does not meet the label classification condition or the next label partition obtained by expansion is an N-level label partition. For example, fig. 6 is a schematic tree distribution diagram of attribute data of a feature vector region provided in the embodiment of the present application, as shown in fig. 6, taking a first-level tag partition of province 1 as an example, where 500 ten thousand parts of feature vector data may be included, and after the first-level tag partition, three second-level tag partitions, that is, a third-level tag partition, namely, a third-level tag partition 1 (including 300 ten thousand parts of feature vector data), a third-level tag partition 2 (including 100 ten thousand parts of feature vector data), and a third-level tag partition 3 (including 100 ten thousand parts of feature vector data), may be further partitioned step by step, which is not described herein.
The preset tag classification condition may be the size of the feature vector data in the tag partition, for example, when the feature vector data in the first-level tag partition exceeds a preset threshold, the first-level tag partition may be partitioned downward. In other embodiments, the tag classification condition may be set according to the actual service requirement, for example, in order to query a feature vector for a town, the tag partition of the previous stage may be classified.
And step S304, constructing a vector index file according to each N-level tag partition and the feature vectors stored in the N-level tag partition. In this embodiment, after a certain tag partition is classified, a tag partition of a next level is obtained, and then the tag partition and the tag partition of the next level both construct corresponding index files, and if a certain tag partition is not classified, only the tag partition constructs the corresponding index files, so that the number of index files can be reduced. And the label partitions are classified according to the needs, so that the balanced distribution of the feature vectors can be ensured.
When the index file is constructed, the index file can be generated through offline training, and the required feature vector data is stored in the partition table, so that the vector retrieval tool Faiss can be selected to train the vector sample data, and the vector retrieval tool Faiss index file can be generated. The vector retrieval tool Faiss is fully called (Facebook AI Similarity Search) a search library for clustering and similarity of Facebook AI team open sources, provides efficient similarity search and clustering for dense vectors, supports billion-level vector search, and is a mature approximate neighbor search library at present.
Step S305, inquiring to obtain target unstructured data matched with the source data according to the vector index file. In this embodiment, with continued reference to fig. 1 and fig. 2, according to the number of copies configured in step 302 and the index file generated offline, the cluster manager notifies each node to load a vector index file with attribute characteristics, and the query component is used to provide a query aggregation service, where the query component of each node is connected to a plurality of nodes, so as to provide a high available query service capability.
According to the embodiment of the application, after the feature vector is stored into the corresponding attribute partition table according to the attribute tag, the first-level tag partition can be expanded step by step according to the tag grading condition, so that the number of tag partitions is reduced as much as possible, the number of index files to be built is reduced, and the vector retrieval efficiency and speed are improved.
In some embodiments, when establishing the attribute partition table, the method may specifically be implemented by the following steps: acquiring the category number of attribute tags of the feature vector; and establishing an attribute partition table corresponding to each attribute tag, and establishing at least one primary tag partition associated with the attribute tag in the attribute partition table. In this embodiment, referring to fig. 4, a plurality of attribute partition tables may be created in the partition table, for example, a feature vector region partition table, a feature vector age partition table, and a feature vector industry partition table. After the unstructured data is converted, a corresponding attribute partition table is established according to the category of the attribute label of the feature vector. After the attribute partition table is built, a first-level tag partition of the attribute partition table needs to be built. For each primary label partition, a corresponding index file is illustratively required to be constructed. In addition, in other embodiments, if the amount of feature vector data in the primary label partition is smaller, for example, referring to fig. 4, if the amounts of feature vector data in the primary label partitions of province 1 and province 2 are smaller, province 1 and province 2 may be combined to construct an index file.
According to the embodiment of the application, the attribute partition tables corresponding to different attribute categories are created, the feature vectors are classified and stored according to the attribute labels of the feature vectors, the feature vectors of certain attributes can be conveniently and rapidly found from a large number of feature vectors, and the timeliness of vector retrieval is further improved.
In some embodiments, in determining whether the tag ranking condition is met, this may be accomplished by: acquiring a reference value of the tag partition; and determining whether the tag partition meets the tag classification condition according to the number of feature vectors divided into the tag partition and the size between the reference values of the tag partition. In this embodiment, a scoring mechanism of each level of labels may be established, the mixed vector may be written into a corresponding feature vector partition table according to the attribute route, and at this time, the partition table stores attribute primary label partition data, and the primary label partition data is analyzed according to level expansion, so as to obtain detailed data of each attribute classification. By creating a feature vector attribute data tree profile (see fig. 6), the data profile of the data in each level of tags can be clearly seen.
Further, in other embodiments, when determining the reference value, it may be implemented according to the following steps: acquiring a time interval and attenuation degree configured for the tag partition, wherein the time interval comprises an upper limit value and a lower limit value; determining a maximum number value of the feature vectors of the tag partition when the response time of the vector retrieval is an upper limit value; determining the minimum number value of the feature vector of the tag partition when the response time of vector retrieval is a lower limit value; selecting a target value from the maximum number of feature vectors and the minimum data value of the feature vectors; the target value is multiplied by the attenuation degree, and the reference value of the tag partition is calculated.
In this embodiment, according to the feature vector data statistical analysis and the on-line query response time combination analysis, a reference value Nij of whether a tag partition is expanded and combined is determined, and the determination logic refers to the following formula:
M i =M i1 +M i2 +…+M ij
N ij =K j ×γ ij
Figure BDA0004091143240000091
S i =S i1 +S i2 +…+S ij
1≤i≤n,1≤j≤m,K j ∈[K jt1 ,K jt2 ],0<γ ij ≤1
in the above formula, the relevant parameters are described in detail as follows:
(1) Subscript i represents the ith feature dimension and subscript j represents the jth stage of the hierarchical label;
(2) Mij represents the total distribution number of the jth level tags of the ith feature dimension, and Mi represents the total tag number of the ith feature dimension;
(3) Kjt1, kjt are the vector data amounts when the response time is the lower limit value t1 and the upper limit value t2 (wherein, when the response time is the lower limit value t1, the vector data amount corresponds to the minimum value of the feature vector mentioned above, and when the response time is the upper limit value t2, the vector data amount corresponds to the maximum value of the feature vector mentioned above), the service response time is acceptable in the time interval of [ t1, t2], the Kj takes the value of [ Kjt, kjt2], the Kj represents the target value of the label partition at the j-th level, and a certain value can be selected from the values of [ Kjt1, kjt ] as the target value according to the actual situation, in order to ensure that the reference value Nij can be dynamically adjusted according to the actual situation;
(3) γii represents the attenuation degree of the jth label of the ith characteristic dimension; when j=1, γ=1, when j > 1, γ < 1, and when the level increases in turn, the data amount of each tag decreases, but in order to ensure that the hot spot data of the tag with a large level can meet the construction standard, the hot spot data tag is expanded as much as possible, and the expansion tightness can be adjusted by γ;
(5) The Nij represents a reference value for merging and expanding the jth level tag of the ith feature dimension, the Nij value is reasonable to set, if the Nij value is too small, the number of the tags is too large, the tags are not merged, the Cartesian number formed by each attribute dimension is also too large, if the Nij value is too large, the tags are not fully expanded, and the service cannot be effectively expanded in real time in a hot spot data area;
(6) Wij represents the total feature data vector number of the ith feature dimension jth level tag, sij represents the number of the ith feature dimension jth level tag after the combination and expansion processing, and Si represents the number of the ith feature combination and expansion processing tag;
(7) The total hierarchical label number is reduced from Mi to Si through label merging and expanding.
For example, if the product C coverage group is mainly concentrated in the region D1, the first-level tag partition province G (province G is located in the region D) needs to construct an index file, and meanwhile, the cities in the region D1 can get a relatively high score, so that the second-level tag partition can also construct a corresponding index file, and the third-level tag partition and the fourth-level tag partition combine the scores to determine whether to construct. However, the corresponding region D2 has fewer coverage groups, so that the provinces of the first-level tag partitions in the region D2 can be combined into one index file, and the second-level tag partitions, the third-level tag partitions and the fourth-level tag partitions can not establish the corresponding index file due to lower initial scores; similar initial scoring mechanisms can be adopted for similar other attribute dimensions, and finally, the number of index files actually required to be constructed is greatly reduced.
For example, when analysis is performed in conjunction with fig. 6 described above, for example (γi1=1, ni1=200 tens of thousands, γi2=0.95, ni2=190 tens of thousands, γi3=0.90, ni3=180 tens of thousands, γi4=0.80, ni4=160 tens of thousands), this first-level tag partition alone constructs an index because of the feature vector data amount of province 1 > =ni1; expanding to a secondary label, finding that feature vector data amount of the ground city 1 > =Ni2, wherein the ground city 1 can construct an index file, and the ground city 2 and the ground city 3 cannot construct; expanding to three-level tag partition and four-level tag partition, the index file is constructed by finding the feature vector data amount of the region 1 > =ni3 and the feature vector data amount of the town 1 > =ni4, and the other index file is not constructed. This effectively builds the index file from the hot spot data alone, while the cold data does not spread.
According to the embodiment of the application, the scoring mechanism of the label partitions is established, which label partitions can be partitioned and which label partitions cannot be partitioned is determined, so that the number of label levels of each attribute dimension is reasonable, the number of index files required to be constructed and formed is reduced, and meanwhile, the balanced distribution of cold and hot data is also solved.
Further, in other embodiments, non-target tag partitions of each level of tag partitions that do not meet the tag classification condition may be merged to form a merged partition.
In this embodiment, a label merging mechanism may be set, and when there are label partitions that do not satisfy the transition classification condition in each level of label partitions, a merging partition may be formed by merging, and then an index file thereof is constructed. For example, fig. 7 is a schematic diagram of label merging provided in the embodiment of the present application, for example, (γi1=1, ni1=200 tens of thousands, γi2=0.95, ni2=190 tens of thousands, γi3=0.90, ni3=180 tens of thousands, γi4=0.80, and ni4=160 tens of thousands), where, as shown in fig. 7, the feature vector data amount of province 2 is 150 tens of thousands, the feature vector data amount of province 3 is 100 tens of thousands, and all the feature vector data amounts are smaller than Ni1, and at this time, the data of province 2 and province 3 may be merged to form a merging partition, and an index file may be separately constructed.
According to the embodiment of the application, the label partitions are combined, and a plurality of label partitions which do not meet the combination condition are combined to form one combined partition, so that only one index file is needed to be constructed, the number of index files needed to be constructed is reduced, and meanwhile, the balanced distribution of cold and hot data can be ensured.
In some embodiments, when constructing the vector index file, a vector index file of a single attribute partition may be constructed, and a vector index file of a multi-attribute partition may also be constructed, which may be specifically implemented by the following steps: constructing a vector index file corresponding to the first-level tag partition in the attribute partition table according to the feature vector of each first-level tag partition in the attribute partition table; selecting at least one tag partition from at least two attribute partition tables to form a multi-attribute tag partition; and constructing a vector index file of the multi-attribute label partition according to the feature vectors in the multi-attribute label partition.
In this embodiment, the required feature vector data is already stored in the attribute partition table, and the Faiss may be selected to train on the vector sample data to generate the Faiss index file. For example, fig. 8 is a schematic diagram of a general implementation of vector real-time search provided in the embodiment of the present application, and as shown in fig. 8, four partition tables (including a feature vector date partition table, a feature vector industry partition table, and a feature vector age partition table) may be respectively established according to attribute fields of "date" ", region", "industry", "age". Taking a feature vector region partition table as an example, three regions (namely province GD, region SX and province GX) are used as region primary label partitions, when feature vector data and attributes are written, the feature vector data and the attributes are written into the corresponding region primary label partitions according to a regional field route, and at the moment, the vector data are maintained in each partition table.
For vector index files for processing mixed attribute data structures to construct multi-attribute tag partitions, the Cartesian product formed by date, region, industry and age attributes is too large, so that the Cartesian product can be reduced through a tag grading mechanism. Specifically, referring to fig. 8, the feature vector date partition table may partition the labels according to the month of the secondary labels, for example, only 2022-10 and 2022-9 months of the secondary labels in 2022 and 2021 reach the condition of label classification, and the number of the secondary labels with the attribute is 2 from 24. The region can be developed according to a secondary tag city, for example, a primary tag partition of the region comprises province GD, region SX and province GX, and only the secondary tag cities GZ and cities SZ after the province GD partition meet the standard, so that the number of tag partitions becomes 2, and by this example, the number of cartesian products formed by dates and regions can be seen to be 4. By adopting a similar method, index files with corresponding quantity can be built in industry and age attributes, so that the quantity of vector index files of multi-attribute label partitions is reduced.
For the primary label partition, a corresponding vector index file needs to be constructed, and with continued reference to fig. 8, taking a feature vector date partition table and a feature vector area partition table as an example, the primary label partitions in the feature vector date partition table include 2022, 2021 and 2020, and then the vector index files of the three primary label partitions need to be constructed. The first-level tag partitions in the feature vector region partition table comprise province GD, region SX and province GX, and then vector index files of the three first-level tag partitions need to be constructed.
Wherein, in constructing the vector index file, offline calculation can be performed. Based on the data of the partition table, a local file of index ID and vector data is generated. Vector search tool Faiss may be selected to perform vector calculation to generate vector index files of the corresponding partitions, for example, three index files index_file_1, index_file_2, index_file_3 are generated. The specific process of generating the vector index file of the corresponding partition by the vector retrieval tool Faiss vector calculation is as follows: adopting inverted product quantization (also called IVFxPQy), wherein IVFx refers to vector IDs under each cluster center by utilizing the inverted idea, hanging a pile of non-center vectors behind each center ID, finding out a plurality of nearest center IDs when inquiring the vectors each time, and searching the non-center vectors under the plurality of centers respectively, wherein x in IVFx is the number of k-means cluster centers; PQy it is a product quantization method, which improves the general search, cuts the dimension of a vector into y segments, searches each segment separately, and obtains the final TopK (i.e. K segments in the front of the order) after the search result of each segment of vector is intersected, where y is the number of segments for cutting the vector, so y needs to be divided by the vector dimension, and the greater y, the finer the cut, and the higher the time complexity.
According to the method and the device, the number of the tag partitions is reduced through the tag grading mechanism, the number of the formed multi-attribute tag partitions can be reduced, so that the number of vector index files for constructing the multi-attribute tag partitions is reduced, and when vector retrieval is carried out, retrieval from massive vector index files is not needed, and the timeliness of the vector retrieval is further improved.
In some embodiments, when querying the target unstructured data, the method can be specifically implemented by the following steps: acquiring a tag in source data, wherein the source data comprises a tag; searching a target index file from the vector index file according to the label in the source data; searching and obtaining a preset number of target feature vectors according to the target index file, wherein the similarity between the target feature vectors and the source data is greater than a preset similarity threshold; and determining target unstructured data matched with the source data according to the target feature vector.
In this embodiment, since the attribute partition table has been created according to the attribute tag segment, the feature vector identifier to be queried can be quickly found through the matching route.
For example, set topk=10 (i.e. find the most similar 10 vectors), the source data is: { "region": "province GD", "vector data": "[0.8656856, 0.83863276, … ]" }, searching for the 10 feature vector data closest to the source data, and finding abnormal users (e.g., black products) in the financial and technological field by obtaining unstructured data (e.g., background pictures) corresponding to the feature vector data. For example, fig. 9 is a schematic diagram of a vector search query flow provided in the embodiment of the present application, as shown in fig. 9, firstly, determining "province GD" according to a tag "region" in source data, and then, based on the "province GD", finding an index file index_file_area1 of a specified tag partition, and using a vector search tool Faiss, can quickly find similar vector data.
According to the method and the device, the index file of the designated label partition is found based on the label in the source data, the similar vector data can be found by using the vector search tool rapidly, the corresponding target unstructured data can be determined based on the similar vector data, and the vector query and search efficiency is further improved.
Further, based on the above embodiments, in some embodiments, the tags in the source data may include a plurality of tags, for example, the source data includes two tags of "province GD" and "2022-07-27", and the target index file may be found from the vector index data file by the following steps: constructing a regular expression according to each label in the source data; and searching out the target index file from the vector index data file according to the regular expression.
In this embodiment, when the labels in the source data include a plurality of labels, that is, it is explained that the query dimension is multi-attribute, the regular expression may be used to match the vector index file of the corresponding label partition, so as to implement the search query. Illustratively, taking two labels of "province GD" and "2022-07-27" as examples, the constructed regular expression may include: "2022-07-27/province GD", "2022-07-/province GD". Wherein, through regular expression, can match and get a plurality of vector index files. For example, taking fig. 8 as an example, the vector index file obtained by matching may include at least "table_partition_data_area-tag (2022-07/province GD) -index_file_data_area", "table_partition_area-tag (province GD) -index_file_area".
According to the embodiment of the application, the regular expression is constructed, so that the target index file can be quickly searched when the complex vector search query scene with multiple attributes in the query dimension is dealt with, and the timeliness of vector query retrieval is further improved.
Further, on the basis of the above embodiments, in other embodiments, the method further includes the following steps: monitoring access flow change at the current moment of each level of label partition, and determining whether an abnormal label partition with abnormal access flow change exists or not; when an abnormal label partition exists, an offline data analysis model is constructed according to the feature vector and the vector index file at the previous moment, and an online feature vector at the current moment is obtained; according to the offline data analysis model, calculating to obtain reference similarity of each feature dimension and similarity of online feature vectors, wherein the feature dimensions comprise picture data, video data, audio data and natural language, the similarity of the online feature vectors is used for representing the maximum similarity of the whole members in the user group under each feature dimension, and the reference similarity is used for determining whether the user group is a suspected group; determining the similarity between the reference similarity and the online feature vector, and performing vector retrieval analysis according to the primary label partition or the secondary label partition when the similarity between the online feature vector and the reference similarity is larger than the online feature vector to obtain coarse-grained similarity of the online analysis; and determining the size of the similarity between the coarse-granularity similarity and the similarity of the online feature vector, and constructing an index file of the abnormal label partition in real time when the similarity of the online feature vector is larger than the coarse-granularity similarity.
In the above embodiment, although the problems of the excessive cartesian products of the mixed vector and the unbalanced distribution of the hot and cold data are solved, the hierarchical label is constructed based on the written characteristic vector partition table data, so that the hierarchical label cannot be constructed by sensing the dynamic change of the label level data in real time. For example, a certain level tag of a certain attribute needs to be automatically unfolded according to an actual service scene or traffic change, and the method has no solution. Therefore, the present embodiment provides a method for dynamically constructing an index file, specifically, taking an anomaly tag partition as a town, for example, if a village finds that there is an anomaly through traffic monitoring, there may be m suspected communities (for example, black communities), but the village level index file is not established, and the method is solved by the following steps:
the first step: an offline data analysis model of the last moment is established, the model comprises a highly developed tag index file, and in each feature dimension of an image, a video, a text and an audio, on one hand, the reference similarity Ti (i=1, 2, 3..n) of each feature dimension can be obtained through analysis, on the other hand, online feature vectors are calculated by using offline data (for example, taking black production as an example, community members generally refer to using similar pictures, voices, videos and the like to perform fraudulent activities, such as the community members usually appear in the same office place, and often have similar background pictures or audios and videos, and the background pictures or videos can be converted into the online feature vectors) so long as i is Si > =Ti, namely, similar search analysis of offline fine granularity is satisfied, and then real-time data analysis of the second step is carried out.
And a second step of: the attribute dimension of the online constructed tag is not highly developed, coarse-granularity similarity Ri is obtained by performing coarse-granularity similarity search analysis through a primary tag partition or a secondary tag partition, only Ri > =Si is satisfied, and the index file entering the third step is constructed in real time.
And a third step of: and constructing the hierarchical labels in real time, completing similar retrieval of fine granularity, and deciding fraud. For example, a four-level tag partition of a village and town belonging to a region can construct an index file for the four-level tag partition.
Further, in other embodiments, after the vector index file of the abnormal tag partition (the four-level tag partition of villages and towns in the above embodiments) is created, it may be determined whether the user community is a suspected community by: according to the index file of the abnormal label partition, carrying out similar retrieval of each characteristic dimension on the user group to obtain accurate similarity; and determining whether the user group is a suspected group according to the sizes of the accurate similarity and the reference similarity.
In this embodiment, the user group may be subjected to similarity search in each dimension to obtain the precise similarity Ui, and when Ui > =ti is satisfied, it is determined that the user group has fraudulent activity. Other suspected communities in the area can also be quickly identified and located because the tags have been deployed. By combining the offline data benchmark analysis and the real-time data coarse-granularity analysis, the accurate expansion of the tag attribute is realized through the real-time data fine-granularity decision, and meanwhile, many other business scenes, such as the marketing activities of the region XJ, can be solved, but the first-level tag partition is in a merging state, and the vector similarity retrieval analysis is not performed on the region XJ by a method, so that an index file is required to be independently constructed for the first-level tag partition to perform the fine-granularity similarity retrieval analysis, and suspected groups are prevented.
Fig. 10 is a schematic flow chart of dynamic construction of an index file according to an embodiment of the present application, as shown in fig. 10, which includes the following steps: step S1001, monitoring flow changes of each level of label partition; step S1002, finding out abnormal flow; step S1003, performing suspicion analysis on user groups with abnormal flows; step 1004, performing offline similar search analysis on each feature dimension; step S1005, performing real-time first-level label partition similarity retrieval analysis on each characteristic dimension; step S1006, deciding whether the hierarchical label partition is unfolded; step S1007, constructing an expanded hierarchical label index file in real time; step S1008, performing similar search analysis of the real-time expansion tag by each feature dimension; step S1009, deciding whether it is a suspected group.
Wherein, for step S1001: and monitoring the flow change of each attribute grading label. Specifically, since the hierarchical labels have been divided for each feature vector attribute, the access flow change of each hierarchical label can be monitored through the abnormal data, and if a flow abnormal alarm occurs, service personnel need to intervene in analyzing whether the abnormal behavior is present. For example, an obvious feature of the abnormal behavior is that a black-date group concentrated in a certain area initiates the abnormal behavior, so that the flow has a significant change, and as shown in fig. 11, for example, fig. 11 is a schematic diagram of monitoring abnormal flow provided in the embodiment of the application, and as shown in fig. 11, the abnormal monitoring finds that the "town 9" visit flow rises by 10%, and abnormal flow changes occur, so that subsequent similar search analysis is needed.
For step S1002: and carrying out offline similarity retrieval analysis. The method comprises the following steps: because members of suspected communities often use some similar index features, mainly pictures, videos, audios, natural languages and the like, feature similarity can be determined by vector similarity search. The feature vector data and the corresponding vector index file at the previous moment are taken as offline data, and the index file with highly-unfolded attribute tag level can be constructed offline, wherein the index file of town 9 is contained in the index file, and the similarity of the vector data of the suspected group is directly calculated by using the index file. The calculation logic is as follows: there is a suspected group in which i holds Si > =ti, and determines that offline data condition analysis is satisfied;
wherein, the parameters are described as follows: i=1, 2,3..n, i stands for characteristic properties (i=1, picture, i=2, video, i=3, audio, i=4, natural language), the value of i can be extended according to the service requirements. Si=max (Si 1, si2,..sim), subscript m represents the mth individual in the community of users, where the maximum similarity of the community members as a whole under the ith characteristic attribute is found, e.g., sim represents the similarity of the mth individual under the "town 9" level four label under the ith characteristic attribute. The specific similarity calculation formula is as follows:
Figure BDA0004091143240000141
In the above formula, dim represents the euclidean distance of the mth member calculated in the hierarchical label vector library by the vector search tool Faiss component, wherein a smaller value represents a higher similarity. max (Dim) represents the maximum of all euclidean distances that the mth member calculates in the hierarchical tag vector library through the vector retrieval tool Faiss component. min (Dim) represents the minimum of all Euclidean distances that the mth member calculates in the hierarchical tag vector library through the vector retrieval tool Faiss component.
Ti is the reference similarity of the decision of the suspected group, i represents the characteristic attribute, m represents the suspected group which has been decided and confirmed, w represents the number of members of the suspected group, and xim represents the similarity of the suspected group.
Figure BDA0004091143240000142
In the above formula, xim=max (xim 1, xim2,.., ximw), ximw represents the similarity of the ith feature attribute, the w-th member in the mth suspect group, and xim is the maximum value of the similarity in the suspect group. Wherein, the calculation formula of ximw is as follows:
Figure BDA0004091143240000143
in the above formula, max (Dimw) represents the maximum value of all euclidean distances calculated by the w member in the ith feature attribute and the mth suspects group through the vector search tool Faiss component in the hierarchical label vector library. min (Dimw) represents the minimum value of all Euclidean distances calculated by the w member in the ith feature attribute, the mth suspects group through the vector retrieval tool Faiss component in the hierarchical label vector library.
For step S1005: and carrying out real-time coarse-grained label similarity retrieval analysis. The method comprises the following steps: and carrying out coarse-granularity similarity retrieval analysis based on the primary label partition or the secondary label partition to obtain coarse-granularity similarity Ri of on-line analysis, wherein Ri=max (Ri 1, ri2, rim), and Rim calculation logic is shown in the following formula:
Figure BDA0004091143240000144
in the above formula, if i exists, so that Ri > =si is satisfied, the subsequent step of constructing the index file in real time is only entered, otherwise suspicion can be eliminated, because the feature similarity satisfying the condition cannot be found in a large range of coarse granularity at this time.
For step S1008: and constructing an expanded hierarchical index tag. The method comprises the following steps: by judging the similar search condition, the suspected group of the town 9 is determined with high probability at this time. "town 9" is a four-level tag of the region attribute, and a vector index file needs to be constructed in real time based on the feature vector partition data of "town 9".
For step S1009: the decision is whether or not to be anti-fraud. The method comprises the following steps: based on the constructed index file, a precision similarity Ui is calculated, where ui=max (Ui 1, ui2,.. Uim), and Uim calculation logic is as follows:
Figure BDA0004091143240000145
in the above formula, as long as i exists, uim > =ti is satisfied, it is determined that the user group is a suspected group, and abnormal behavior exists, and other suspected groups in the region can be rapidly identified and positioned because the tag is already developed.
The embodiment of the application establishes a dynamic accurate construction model of the mixed vector index file. By monitoring the change in abnormal traffic, a possible suspected group is found. In order to determine the authenticity of the suspected group, firstly, the similarity of offline fine granularity is obtained through an offline data analysis model at the previous moment, then, the similarity of online real-time coarse granularity is analyzed, if the condition of label expansion can be achieved, finally, an online fine granularity vector index file is dynamically constructed, so that whether the suspected group is the actual suspected group can be accurately determined, and the abnormal behavior of the suspected group can be rapidly monitored and identified for the financial science and technology field.
Fig. 12 is a schematic flow chart of a vector search query provided in an embodiment of the present application, as shown in fig. 12, including the following steps: step S1201, unstructured data is obtained; step S1202, extracting feature vectors and attribute tags; step S1203, storing the vector and the attribute tag into a feature vector attribute partition table; step S1204, establishing an attribute tag hierarchical calculation model. Step S1205, calculating and generating an index data file corresponding to the attribute; step S1206, multi-machine distributed multi-copy loading index file; in step S1207, a hybrid vector query capability is provided.
Wherein, for step S1204, an attribute tag hierarchical calculation model is established, which is specifically as follows: (1) determining attribute types by product morphology analysis. For example, the types of attributes that product C needs to calculate and analyze are date, area, age, and industry, and referring to fig. 5, fig. 5 is a schematic diagram of label classification of product C.
(2) And deciding each attribute classification based on the written feature vector partition table data. At this time, the table stores the first-level tag partition data of the attributes, and the first-level tag partition data is analyzed according to the level expansion, so that the detailed data of each attribute hierarchy can be obtained. Illustratively, fig. 13 is a schematic view of an attribute tag hierarchy calculation model provided in an embodiment of the present application, and as shown in fig. 13, an attribute tag hierarchy is determined by attribute type and attribute hierarchy. Referring to fig. 6 and 7 described above, the data distribution of the data in each stage of tags can be clearly seen. The combination of feature vector data statistical analysis and on-line query response time analysis determines whether a tag is unfolded and combined according to the reference value Nij, and the logic is determined by referring to the following formula:
M i =M i1 +M i2 +…+M ij
N ij =K j ×γ ij
Figure BDA0004091143240000151
S i =S i1 +S i2 +…+S ij
1≤i≤n,1≤j≤m,K j ∈[K jt1 ,K jt2 ],0<γ ij ≤1
the relevant parameters in the above formula are described in detail as follows: subscript i represents the ith feature dimension and subscript j represents the jth level of the tag partition; mij represents the total distribution quantity of the jth level label partition of the ith feature dimension, and Mi represents the total label quantity of the ith feature dimension; kjt1, kjt represent vector data amount of index file when response time is t1 and t2, service response time is acceptable in the [ t1, t2] interval, in order to ensure that the reference value Nij can be dynamically adjusted according to actual conditions; γij represents the attenuation degree of the jth label partition of the ith characteristic dimension; when j=1, γ=1, when j > 1, γ < 1, and when the hierarchy increases in turn, the data amount of each tag decreases, but in order to ensure that the hot spot data in the tag partition with a large hierarchy can meet the construction standard, the tag partition of the hot spot data is expanded as much as possible, and the degree of tightness of expansion can be adjusted by γ. The Nij represents a reference value for merging and expanding the jth level tag partitions of the ith feature dimension, the Nij value can be set according to actual conditions, if the Nij value is too small, the number of the tag partitions is too large, the tag partitions are not merged, the Cartesian number formed by each attribute dimension is also too large, if the Nij value is too large, the tag partitions are not fully expanded, and the service cannot efficiently discover suspicious groups in real time in a hot spot data area. Wij represents the total feature vector data quantity of the ith feature dimension jth level tag partition, sij represents the number of tag partitions after the ith feature dimension jth level merge and spread processing, and Si represents the number of tag partitions after the ith feature merge and spread processing. The total hierarchical label partition number is reduced from Mi to Si through label merging and expanding processing.
Tag expansion and merging of feature vector data. The tag partition is expanded as follows, specifically analyzed in connection with fig. 6 and 7 above: for example (γi1=1, ni1=200 tens of thousands, γi2=0.95, ni2=190 tens of thousands, γi3=0.90, ni3=180 tens of thousands, γi4=0.80, ni4=160 tens of thousands), this first-level tag partition will build an index alone, since province 1> =ni1; expanding to a secondary label partition, and finding that the ground city 1> =Ni2, the ground city 1 can construct an index file, and the ground cities 2 and 3 cannot construct; expanding to three-level and four-level labels, it is also found that region 1> =ni3, town 1> =ni4, will build the index file, and others will not. This effectively builds the index file from the hot spot data alone, while the cold data does not spread. Label incorporation (see fig. 7 above): because province 2 and province 3 are both < Ni1, but because the primary label partition is necessary to be constructed, the data of province 2 and province 3 are combined and then an index file is separately constructed. Through label expansion and combination, the number of label levels of which the attributes need to be expanded is reduced, and the balanced distribution of cold and hot data is ensured. (4) After the label partitions are unfolded and combined, the number of label levels of each attribute dimension is reasonable, and the number of mixed attribute files to be built is small.
For step S1205: and calculating and generating an index data file corresponding to the attribute. The method comprises the following steps: and generating an index file through offline training. The partition table already stores the required feature vector data, and a vector retrieval tool Faiss can be selected to train the vector sample data to generate a Faiss index file.
For step S1206: the index file is loaded by multiple distributed copies. The method comprises the following steps: and according to the number of copies configured in the cluster and the index file generated offline, the cluster manager informs each node to load the index file with the attribute characteristic.
For step S1207: providing hybrid vector query capability. The method comprises the following steps: the query component is used for providing query aggregation service, and the query component of each node is connected with a plurality of nodes to provide high available query service capability.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application. Fig. 14 is a schematic structural diagram of an information retrieval device provided in the embodiment of the present application, as shown in fig. 14, the information retrieval device 1400 may specifically include a data conversion module 1401, a partition table construction module 1402, a tag classification module 1403, an index construction module 1404, and a data query module 1405. The data conversion module is used for converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute tag of the feature vector. The partition table construction module is used for building an attribute partition table according to the attribute labels of the feature vectors and storing the feature vectors into a first-level label partition in the attribute partition table. The label classification module is used for carrying out step-by-step partition on the first-level label partition according to preset label classification conditions to obtain at least one N-level label partition, determining the feature vector stored under the N-level label partition, wherein N is a positive integer with a value greater than or equal to 1. And the index construction module is used for constructing a vector index file according to each N-level tag partition and the feature vectors stored under the N-level tag partitions. And the data query module is used for querying and obtaining target unstructured data matched with the source data according to the vector index file.
Alternatively, the partition table construction module may specifically be configured to: acquiring the category number of attribute tags of the feature vector; and establishing an attribute partition table corresponding to each attribute tag, and establishing at least one primary tag partition associated with the attribute tag in the attribute partition table.
Alternatively, the tag ranking module may specifically be configured to: determining whether the primary label partition meets preset label grading conditions; if the primary label partition meets the preset label classification condition, expanding the primary label partition to obtain at least one next-stage label partition, and determining whether a target label partition meeting the label classification condition exists in the at least one next-stage label partition; if the target label partition exists, continuing to expand the target label partition until the next label partition obtained by expansion does not meet the label classification condition or the next label partition obtained by expansion is an N-level label partition.
Alternatively, the tag ranking module may specifically be configured to: acquiring a reference value of the tag partition; and determining whether the tag partition meets the tag classification condition according to the number of feature vectors divided into the tag partition and the size between the reference values of the tag partition.
Alternatively, the tag ranking module may specifically be configured to: acquiring a time interval and attenuation degree configured for the tag partition, wherein the time interval comprises an upper limit value and a lower limit value; determining a maximum number value of the feature vectors of the tag partition when the response time of the vector retrieval is an upper limit value; determining the minimum number value of the feature vector of the tag partition when the response time of vector retrieval is a lower limit value; selecting a target value from the maximum number of feature vectors and the minimum data value of the feature vectors; the target value is multiplied by the attenuation degree, and the reference value of the tag partition is calculated.
Optionally, the method further comprises a partition merging module, wherein the partition merging module is used for merging non-target label partitions which do not meet the label classification condition in each level of label partitions to form merged partitions.
Alternatively, the index building module may specifically be configured to: constructing a vector index file corresponding to the first-level tag partition in the attribute partition table according to the feature vector of each first-level tag partition in the attribute partition table; selecting at least one tag partition from at least two attribute partition tables to form a multi-attribute tag partition; and constructing a vector index file of the multi-attribute label partition according to the feature vectors in the multi-attribute label partition.
Optionally, the data query module may specifically be configured to: acquiring a tag in source data, wherein the source data comprises a tag; searching a target index file from the vector index file according to the label in the source data; searching and obtaining a preset number of target feature vectors according to the target index file, wherein the similarity between the target feature vectors and the source data is greater than a preset similarity threshold; and determining target unstructured data matched with the source data according to the target feature vector.
Optionally, the data query module may specifically be configured to: constructing a regular expression according to each label in the source data; and searching out the target index file from the vector index data file according to the regular expression.
Optionally, the system further comprises a real-time construction module, which is used for monitoring the access flow change of each level of label partition at the current moment and determining whether an abnormal label partition with abnormal access flow change exists; when an abnormal label partition exists, an offline data analysis model is constructed according to the feature vector and the vector index file at the previous moment, and an online feature vector at the current moment is obtained; according to the offline data analysis model, calculating to obtain reference similarity of each feature dimension and similarity of online feature vectors, wherein the feature dimensions comprise picture data, video data, audio data and natural language, the similarity of the online feature vectors is used for representing the maximum similarity of the whole members in the user group under each feature dimension, and the reference similarity is used for determining whether the user group is a suspected group; determining the size of the similarity between the reference similarity and the online feature vector, and performing vector retrieval analysis according to the primary label partition or the secondary label partition when the similarity of the online feature vector is greater than or equal to the reference similarity to obtain the coarse-granularity similarity of the online analysis; and determining the size of the coarse-granularity similarity and the similarity of the online feature vector, and constructing an index file of the abnormal label partition in real time when the similarity of the online feature vector is greater than or equal to the coarse-granularity similarity.
Optionally, the system further comprises a group determining module, which is used for carrying out similar retrieval of each characteristic dimension on the user group according to the index file of the abnormal label partition to obtain accurate similarity; and determining whether the user group is a suspected group according to the sizes of the accurate similarity and the reference similarity.
The device provided in the embodiment of the present application may be used to perform the method in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.
It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the contract splitting module may be a processing element which is set up separately, may be implemented in a chip of the above apparatus, or may be stored in a memory of the above apparatus in the form of program code, and may be called by a processing element of the above apparatus to execute the functions of the contract splitting module. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.
Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 15, the electronic device 1500 may include: at least one processor 1501 and a memory 1502. Fig. 15 shows an electronic device using one processor as an example. A memory 1502 for storing a program. In particular, the program may include program code including computer-operating instructions. The memory 1502 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory. The processor 1501 is configured to execute computer-executable instructions stored in the memory 1502 to implement the methods of the above method embodiments.
The processor 1501 may be a central processing unit (central processing unit, abbreviated as CPU), or an application specific integrated circuit (application specific integrated circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.
Alternatively, the memory 1502 may be separate or integrated with the processor 1501. When the memory 1502 is a device separate from the processor 1501, the electronic device 1500 may further include: a bus 1503 for connecting the processor 1501 and the memory 1502. The bus may be an industry standard architecture (industry standard architecture, abbreviated ISA) bus, an external device interconnect (peripheral component, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 1502 and the processor 1501 are integrated on a chip, the memory 1502 and the processor 1501 may complete communication through an internal interface.
Embodiments of the present application also provide a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, and specifically, the computer readable storage medium stores program instructions for the methods in the above method embodiments.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the above-described method embodiments.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the above-described method embodiments.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

1. An information retrieval method, comprising:
converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute tag of the feature vector;
according to the attribute labels of the feature vectors, an attribute partition table is established, and the feature vectors are stored into a first-level label partition in the attribute partition table;
according to preset tag classification conditions, carrying out step-by-step partitioning on the first-level tag partition to obtain at least one N-level tag partition, determining feature vectors stored under the N-level tag partition, wherein N is a positive integer with a value greater than or equal to 1;
constructing a vector index file according to each N-level tag partition and the feature vectors stored under the N-level tag partitions;
and inquiring to obtain target unstructured data matched with the source data according to the vector index file.
2. The method of claim 1, wherein the creating an attribute partition table according to the attribute tags of the feature vectors comprises:
acquiring the category number of the attribute tags of the feature vector;
and establishing an attribute partition table corresponding to each attribute tag, and establishing at least one primary tag partition associated with the attribute tag in the attribute partition table.
3. The method of claim 1, wherein the step-wise partitioning the first-level tag partition according to a preset tag classification condition to obtain at least one N-level tag partition comprises:
determining whether the primary label partition meets preset label classification conditions or not;
if the primary label partition meets the preset label classification condition, expanding the primary label partition to obtain at least one next-stage label partition, and determining whether a target label partition meeting the label classification condition exists in the at least one next-stage label partition;
if the target label partition exists, continuing to expand the target label partition until the next label partition obtained by expansion does not meet the label classification condition or the next label partition obtained by expansion is the N-level label partition.
4. A method according to claim 3, wherein determining whether a tag classification condition is satisfied comprises:
acquiring a reference value of the tag partition;
and determining whether the label partition meets the label grading condition according to the number of the feature vectors divided under the label partition and the size between the reference values of the label partition.
5. The method of claim 4, wherein the obtaining the reference value of the tag partition comprises:
acquiring a time interval and attenuation degree configured for the tag partition, wherein the time interval comprises an upper limit value and a lower limit value;
determining a maximum number value of the feature vectors of the tag partition when the response time of vector retrieval is the upper limit value;
determining a minimum number value of the feature vector of the tag partition when the response time of vector retrieval is the lower limit value;
selecting a target value from the maximum number of feature vectors and the minimum data value of the feature vectors;
and multiplying the target value by the attenuation degree, and calculating to obtain the reference value of the label partition.
6. The method according to claim 2, wherein the method further comprises:
and merging non-target label partitions which do not meet the label classification conditions in each level of label partitions to form a merging partition.
7. The method according to claim 1, wherein constructing a vector index file from each N-level tag partition and feature vectors stored under the N-level tag partition comprises:
constructing a vector index file corresponding to the first-level tag partition in the attribute partition table according to the feature vector of each first-level tag partition in the attribute partition table;
selecting at least one tag partition from at least two attribute partition tables to form a multi-attribute tag partition;
and constructing a vector index file of the multi-attribute label partition according to the feature vector in the multi-attribute label partition.
8. The method according to any one of claims 1-7, wherein querying the target unstructured data matching the source data from the vector index file comprises:
acquiring a tag in the source data, wherein the source data comprises a tag;
searching a target index file from the vector index file according to the label in the source data;
searching and obtaining a preset number of target feature vectors according to the target index file, wherein the similarity between the target feature vectors and the source data is greater than a preset similarity threshold;
And determining target unstructured data matched with the source data according to the target feature vector.
9. The method of claim 8, wherein if the tag in the source data includes more than two tags, the searching the target index file from the vector index data file comprises:
constructing a regular expression according to each label in the source data;
and searching out a target index file from the vector index data file according to the regular expression.
10. The method according to any one of claims 1-7, further comprising:
monitoring access flow change at the current moment of each level of label partition, and determining whether an abnormal label partition with abnormal access flow change exists or not;
when the abnormal label partition exists, an offline data analysis model is constructed according to the feature vector and the vector index file at the previous moment, and an online feature vector at the current moment is obtained;
according to the offline data analysis model, calculating to obtain reference similarity of each feature dimension and similarity of the online feature vector, wherein the feature dimensions comprise picture data, video data, audio data and natural language, the similarity of the online feature vector is used for representing the maximum similarity of the whole members in the user group under each feature dimension, and the reference similarity is used for determining whether the user group is a suspected group;
Determining the similarity between the reference similarity and the online feature vector, and performing vector retrieval analysis according to the primary label partition or the secondary label partition when the similarity of the online feature vector is greater than or equal to the reference similarity to obtain online analysis coarse-granularity similarity;
and determining the size of the coarse-grain similarity and the similarity of the online feature vector, and constructing an index file of the abnormal label partition in real time when the similarity of the online feature vector is greater than or equal to the coarse-grain similarity.
11. The method according to claim 10, wherein the method further comprises:
according to the index file of the abnormal label partition, performing similar retrieval of each characteristic dimension on the user group to obtain accurate similarity;
and determining whether the user group is a suspected group according to the sizes of the accurate similarity and the reference similarity.
12. An information retrieval apparatus, comprising:
the data conversion module is used for converting unstructured data into a mixed vector, wherein the unstructured data comprises at least one of picture data, video data, audio data and natural language, and the mixed vector comprises a feature vector and an attribute label of the feature vector;
The partition table construction module is used for establishing an attribute partition table according to the attribute tags of the feature vectors and storing the feature vectors into a first-level tag partition in the attribute partition table;
the label classification module is used for carrying out step-by-step partition on the first-level label partition according to preset label classification conditions to obtain at least one N-level label partition, determining a feature vector stored under the N-level label partition, wherein N is a positive integer with a value greater than or equal to 1;
the index construction module is used for constructing a vector index file according to each N-level tag partition and the feature vectors stored under the N-level tag partitions;
and the data query module is used for querying and obtaining target unstructured data matched with the source data according to the vector index file.
CN202310151989.5A 2023-02-10 2023-02-10 Information retrieval method and device Pending CN116186298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310151989.5A CN116186298A (en) 2023-02-10 2023-02-10 Information retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310151989.5A CN116186298A (en) 2023-02-10 2023-02-10 Information retrieval method and device

Publications (1)

Publication Number Publication Date
CN116186298A true CN116186298A (en) 2023-05-30

Family

ID=86442007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310151989.5A Pending CN116186298A (en) 2023-02-10 2023-02-10 Information retrieval method and device

Country Status (1)

Country Link
CN (1) CN116186298A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910310A (en) * 2023-06-16 2023-10-20 广东电网有限责任公司佛山供电局 Unstructured data storage method and device based on distributed database

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910310A (en) * 2023-06-16 2023-10-20 广东电网有限责任公司佛山供电局 Unstructured data storage method and device based on distributed database
CN116910310B (en) * 2023-06-16 2024-02-13 广东电网有限责任公司佛山供电局 Unstructured data storage method and device based on distributed database

Similar Documents

Publication Publication Date Title
CN107153713B (en) Overlapping community detection method and system based on similitude between node in social networks
Karim et al. Decision tree and naive bayes algorithm for classification and generation of actionable knowledge for direct marketing
US20210281593A1 (en) Systems and methods for machine learning-based digital content clustering, digital content threat detection, and digital content threat remediation in machine learning task-oriented digital threat mitigation platform
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
JP2009093655A (en) Identifying clusters of words according to word affinities
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
Safae et al. A review of machine learning algorithms for web page classification
CN113326377A (en) Name disambiguation method and system based on enterprise incidence relation
US20210263903A1 (en) Multi-level conflict-free entity clusters
CN109885693A (en) The quick knowledge control methods of knowledge based map and system
Liu et al. Behavior2vector: Embedding users’ personalized travel behavior to Vector
CN116186298A (en) Information retrieval method and device
Tabone et al. Pornographic content classification using deep-learning
JP2014211730A (en) Image searching system, image searching device, and image searching method
CN117807245A (en) Node characteristic extraction method and similar node searching method in network asset map
KR102358357B1 (en) Estimating apparatus for market size, and control method thereof
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
US11636677B2 (en) Systems, devices and methods for distributed hierarchical video analysis
CN111723208B (en) Conditional classification tree-based legal decision document multi-classification method and device and terminal
Punitha et al. Partition document clustering using ontology approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication