CN113918807A - Data recommendation method and device, computing equipment and computer-readable storage medium - Google Patents

Data recommendation method and device, computing equipment and computer-readable storage medium Download PDF

Info

Publication number
CN113918807A
CN113918807A CN202111121375.XA CN202111121375A CN113918807A CN 113918807 A CN113918807 A CN 113918807A CN 202111121375 A CN202111121375 A CN 202111121375A CN 113918807 A CN113918807 A CN 113918807A
Authority
CN
China
Prior art keywords
data
embedding vector
target
index
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111121375.XA
Other languages
Chinese (zh)
Inventor
韩宇龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111121375.XA priority Critical patent/CN113918807A/en
Publication of CN113918807A publication Critical patent/CN113918807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a data recommendation method, which comprises the following steps: acquiring a query request, wherein the query request comprises data to be matched; processing the data to be matched into a target embedding vector; generating a target embedding vector hash key value pair according to the target embedding vector; inquiring in a pre-constructed imbedding vector database according to the target imbedding vector hash key value pair to obtain an imbedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of an imbedding vector and the imbedding vector; and generating a recommendation result according to the embedding vector hash key value pair list. Through the mode, the embodiment of the invention has the beneficial effect of improving the data recommendation efficiency.

Description

Data recommendation method and device, computing equipment and computer-readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a data recommendation method, a data recommendation device, a computing device and a computer readable storage medium.
Background
Currently, the mainstream recommendation system mainly has three core and key steps, including feature engineering, recall, and sorting. Where neighbor recommendations may be embodied in the recall or ranking stage or both stages depending on different applications and scenarios, and also rely on the output of the feature engineering stage as its input. The neighbor recommendation technology realizes the idea of adopting collaborative filtering, firstly, vectors or a relation matrix is constructed, similarity between the vectors is calculated pairwise through some similarity calculation methods (such as cosine similarity) or some weight quantification modes, an inter-item similarity matrix or a similar user item relation is formed according to the similarity, the vectors, the inter-matrix relation and the like or a model is established to form a prediction score of a missing item in the relation matrix, and the final recommendation result is generated by sequencing according to the score, the similarity and the like, so that the data processing efficiency is low.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present invention provide a data recommendation method, an apparatus, a computing device, and a computer-readable storage medium, which are used to solve the technical problem in the prior art that data recommendation cannot be performed efficiently.
According to an aspect of an embodiment of the present invention, there is provided a data recommendation method, including:
acquiring a query request, wherein the query request comprises data to be matched;
processing the data to be matched into a target embedding vector;
generating a target embedding vector hash key value pair according to the target embedding vector;
inquiring in a pre-constructed imbedding vector database according to the target imbedding vector hash key value pair to obtain an imbedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of embedding vector hash key value pairs which are pre-stored, and the embedding vector hash key value pairs comprise local sensitive hash values of embedding vectors and the embedding vectors;
and generating a recommendation result according to the embedding vector hash key value pair list.
In an optional manner, the construction method of the embedding vector database includes: acquiring user behavior data in a streaming or batch manner; processing the user behavior data into an embedding vector aiming at each acquired user behavior data; generating an imbedding vector hash key value pair according to the imbedding vector; and storing the generated embedding vector hash key value pair into an embedding vector database.
In an optional manner, the generating a target embedding vector hash key-value pair according to the target embedding vector includes: generating a hash tree with the same dimension as the embedding vector; the hash tree comprises a plurality of random vectors; respectively calculating included angles between the embedding vector and each random vector in the hash tree; determining the locality sensitive hash value according to the included angle; and generating the target embedding vector hash key value pair according to the target locality sensitive hash value and the target embedding vector.
In an optional manner, the embedding vector database includes basic index data, neighbor set index data, and embedding vector index data; the construction method of the embedding vector database comprises the following steps: establishing neighbor set index data in an embedding vector database; the neighbor set index data comprises a plurality of neighbor sets correspondingly stored according to the keywords key of the imbedding vector hash key value pair, and the neighbor sets comprise similar imbedding vector hash key values; establishing basic index data in an embedding vector database; the basic index data comprises a plurality of theme sets which are correspondingly stored according to themes, and the theme sets comprise a plurality of embedding vector hash key value pairs with the same theme; establishing embedding vector index data in an embedding vector database; the embedding vector index data comprises a plurality of embedding vector sets stored according to the subject and the user identification or the content identification.
In an optional mode, the embedding vector database stores data by adopting an embedding index data structure, wherein the embedding index data structure comprises row key columns and data storage column clusters; the neighbor set index data comprises a first row of key columns and a first data storage column cluster, the first row of key columns comprises a plurality of first row of key information, the first data storage column cluster comprises at least one neighbor set corresponding to the first row of key information, and an embedding vector hash key value pair is stored in the neighbor set; generating the first row of key information according to the keyword key of the embedding vector hash key value pair; the basic index data comprises a second row key column and a second data storage column cluster, the second row key column comprises a plurality of second row key information, and the second data storage column cluster comprises an embedding vector hash key value pair corresponding to the second row key information; the second row key information is generated according to a theme; the embedding vector index data comprises a third row key column and a third data storage column cluster, the third row key column comprises a plurality of pieces of third row key information, the third data storage column cluster comprises embedding vectors corresponding to the third row key information, and the third row key information is generated according to a theme and a user identifier or a content identifier.
In an optional manner, the method further comprises: acquiring user behavior data to be transmitted; determining one or more of a keyword key, a topic and a user identifier or a content identifier of the embedding vector hash key value pair of the user behavior data to be transmitted; determining one or more of keywords key, topic, and topic and user identification or content identification of the embedding vector hash key value pair of the user behavior data to be transmitted, and searching a matched target row key column in the index database, wherein the target row key column is any one or more of the first row key column, the second row key column and the third row key column; determining whether the user behavior data to be transmitted is repeated data or not according to the target row key column and the corresponding target data storage column cluster; the target row key column stores one or more of key words, themes and user identifications or content identifications, and the target data storage column cluster stores embedding vectors corresponding to the target row key column; when the user behavior data to be transmitted are repeated data, deleting the user behavior data to be transmitted; when the user behavior data to be transmitted are not repeated data, adding the user behavior data to be transmitted into the embedding vector database.
In an optional manner, after the establishing neighbor set index data in the embedding vector database, the method includes: constructing an index structure according to the neighbor set index data; the index structure comprises a primary index, a primary index object, a secondary index and a secondary index object; the first-level index comprises index truncation of a keyword key of the embedding vector hash key value pair, and the first-level index object is a local sensitive hash value in the embedding vector hash key value pair containing the index truncation; the secondary index is a local sensitive hash value in the embedding vector hash key value pair, and the secondary index object is the embedding vector hash key value pair.
In an optional manner, the establishing basic index data in the embedding vector database includes: sorting all embedding vector hash key value pairs according to the size of the locality sensitive hash value; storing the sorted imbedding vector hash key value pairs in a plurality of theme sets according to a preset storage threshold sequence; the preset storage threshold value is a threshold value of the number of the embedding vector hash key value pairs stored in each topic set.
In an optional manner, querying in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched, includes: indexing a target locality sensitive hash value in the target embedding vector hash key value pair in the neighbor set index data, and determining a target neighbor set matched with the target locality sensitive hash value; calculating the similarity between the target locality sensitive hash value and locality sensitive hash values in each imbedding vector hash key value pair in the target neighbor set; and determining the imbedding vector hash key value pair list according to the similarity.
According to another aspect of the embodiments of the present invention, there is provided a data recommendation apparatus including:
the device comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a query request which comprises data to be matched;
the processing module is used for processing the data to be matched into a target embedding vector;
the generating module is used for generating a target embedding vector hash key value pair according to the target embedding vector;
the matching module is used for inquiring in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of an imbedding vector and the imbedding vector;
and the recommending module is used for generating a recommending result according to the imbedding vector hash key value pair list.
According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, in which at least one executable instruction is stored, and when the executable instruction is executed on a computing device, the computing device executes the operations of the data recommendation method.
The embodiment of the invention ensures local sensitivity by mapping the embedding vector into a local hash value, establishes depth conversion and combination with a local sensitive hash technology, is not limited to static texts or one-dimensional spaces, and can perform further dimension reduction characterization on high-dimensional objects or things, so that after a query request is obtained, data to be matched in the query request is processed into a target embedding vector, a target embedding vector hash key value pair is generated according to the target embedding vector, a pre-constructed embedding vector hash key value pair is queried in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value list matched with the data to be matched, wherein the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs, the embedding vector hash key value pair comprises the local sensitive hash value of the embedding vector and an embedding vector, and finally generates a recommendation result according to the embedding vector hash key value list, therefore, the efficiency of recommending the neighbor data is effectively improved.
Furthermore, by establishing index data and performing special splitting sequencing, when data to be deduplicated is put into storage, the global data is dynamically and regularly moved by combining the construction of the multi-level index, incremental calculation for accurately removing duplication is synchronously completed when the data is put into storage, and the process is incremental accumulation rather than full calculation every time, so that the index construction process is efficient, accurate deduplication values can be inquired in real time, a calculation process does not exist during inquiry, and the calculation process is completely unrelated to data quantity, so that the data processing efficiency is further improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a data recommendation method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an application environment of a data recommendation method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating neighbor set index data in a data recommendation method according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram illustrating basic index data in a data recommendation method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram illustrating embedding vector index data in the data recommendation method according to the embodiment of the present invention;
fig. 6 is a schematic flowchart illustrating data synchronization and index construction calculation in the data recommendation method according to the embodiment of the present invention;
fig. 7 is a schematic flowchart illustrating a neighbor search and accurate deduplication process in a data recommendation method according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a data recommendation device according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a computing device provided in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein.
In the process of implementing the embodiment of the present invention, the inventor of the present application finds that, in the prior art, the idea of adopting collaborative filtering is mostly implemented by the neighbor recommendation technology, the consumption of model computing resources is large, and especially in the recall stage, the resource consumption and burden are increased under the condition of generally facing a large amount of data, even a large amount of data. Meanwhile, the similarity (such as cosine similarity) is calculated one by one among the vectors, so that the resource consumption is large in engineering, the efficiency is low generally, and the situation that the effect sensitive to the effectiveness is met is generally required to be completed in advance by off-line calculation and cannot be achieved in a quasi-real-time manner; for some neighbor searching or recommending schemes with higher timeliness, a clustering algorithm and the like are usually utilized for calculating the most critical similarity, the similarity is prevented from being calculated one by one through calculation comparison with a clustering center, so that the efficiency is improved, but the accuracy is usually lost through comparison one by one; meanwhile, the prior art cannot simultaneously support two capabilities of efficient neighbor recommendation and accurate duplicate removal of mass data. In addition, the similarity can be calculated through the locality sensitive hash, and the generation of the locality sensitive hash is mostly directed to text content (web page) objects in the prior art, the content of the text content (web page) objects is all character content and has actual meaning or word meaning, but the embedding vector generated by machine learning cannot be mapped by the same method, and the meaning represented by the spatial property of the vector cannot be reserved. Meanwhile, in the face of large-scale data, the efficiency problem also exists in the case of one-by-one comparison.
According to the embodiment of the invention, the neighbor recommendation is realized and the effectiveness of recommendation under a mass data scene is improved by designing and realizing a specific embedding index data structure and carrying out key processes such as index construction, splitting sorting, neighbor aggregation, neighbor searching and the like, and the effectiveness is not limited by accuracy any more. The method is mainly used for solving the problems that efficient real-time (the highest can reach millisecond level) neighbor recommendation of mass data (hundred million levels) and accurate deduplication of the data (an accurate deduplication data volume theory can have no upper limit and can be obtained in real time) can be supported, and the method is based on multi-level/classification index construction and global data dynamic and regular continuous movement, so that the same set of system device can simultaneously support the neighbor recommendation and the accurate deduplication of the mass data.
FIG. 1 is a flow chart illustrating a data recommendation method, performed by a computing device, according to an embodiment of the invention. The computing device may be a computer device, a terminal device, a cloud processing device, a server, and the like, and the embodiment of the present invention is not particularly limited. As shown in fig. 1, the method comprises the steps of:
step 110: and acquiring a query request, wherein the query request comprises data to be matched.
The query request may be a request triggered when a user click or browse behavior is acquired. The data to be matched can be data generated by user clicking or browsing behaviors, and comprises user information data such as user ID and user gender, and content data such as browsing or clicking content.
Step 120: and processing the data to be matched into a target embedding vector.
Wherein the embedding vector, also known as embedding vector, is a collective term for a set of language modeling and feature learning techniques in word-embedded Natural Language Processing (NLP), where words or phrases from the vocabulary are mapped to vectors of real numbers, which involve mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions.
In the embodiment of the invention, after data to be matched are obtained, embedding vector processing is carried out on the data to be matched to obtain a target embedding vector, and specifically, the target embedding vector is obtained through modes such as unsupervised modeling (adopting word2vec or item2vec and the like for sequence data, adopting deepwater or node2vec and the like for image data), matrix decomposition (such as a method of collaborative filtering matrix decomposition), a DNN deep learning method (such as weights of an embedding layer in a DNN model) and the like on the basis of uniform label classification.
The data to be matched comprises user information data and content data. And performing embedding vector processing on the user information data and the content data respectively to obtain user embedding and content embedding.
Step 130: and generating a target embedding vector hash key value pair according to the target embedding vector.
In the embodiment of the present invention, a specific process of generating a target embedding vector hash key value pair according to a target embedding vector includes:
and generating a hash tree with the same dimension as the embedding vector, wherein 64 random vectors X form the hash tree in the embodiment of the invention. Wherein, for two embedding vectors S and V, defining the similarity (distance) as: p ═ s.v/| S | | | V |. And generating an n-dimensional random vector X from an n-dimensional vector space with the same dimension as the embedding vector, wherein the random vector X is used as a normal vector, and a corresponding random plane is Px.
And calculating the included angle between the target embedding vector Y and the random vector X.
And determining a target imbedding vector hash key value pair according to the included angle. The hash calculation is performed on the target embedding vector Y and the random vector X, wherein a space on one side pointed by the direction of the random vector X represents 1, and the opposite side is 0. The included angle between the target embedding vector Y and the random vector X can be used for judging the spatial direction of the target embedding vector, namely, if Y.X > 0, the hash value is 1, otherwise, the hash value is 0. And (3) respectively carrying out included angle calculation on each random vector X and the target embedding vector Y, so that the target embedding vector Y is mapped into a hash value of 64 bits consisting of 0 and 1, and a target local sensitive hash value is obtained.
The embodiment of the invention adopts a specific local hash generation algorithm, and has key differences from the traditional local sensitive hash generation algorithm (such as minhash or simhash) in that: the minhash or simhash is mainly aimed at text (web page) objects, the contents of which are all literal contents and have actual meanings or word senses. Through the mapping conversion, any technology (such as an embedding technology) related to a space vector can establish conversion and combination of depth with a locality sensitive hashing technology, and the conversion and combination are not limited to static text or a one-dimensional space, and can be used for further dimension reduction characterization of high-dimensional objects or things.
In the embodiment of the invention, after the target locality sensitive hash value corresponding to the target embedding vector is obtained, the target embedding vector hash key value pair is constructed and obtained. Specifically, the target locality sensitive hash value is used as a key, and the target embedding vector is used as a value, so that a target embedding vector hash key value pair can be obtained. The value of the target embedding vector hash key value pair may further include an ID value of the target embedding vector and the corresponding object.
Step 140: and inquiring in a pre-constructed embedding vector database according to the target locality sensitive hash value to obtain an embedding vector hash key value pair list matched with the data to be matched.
The method comprises the steps that a pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair includes a locality sensitive hash value of the imbedding vector and the imbedding vector itself.
In the embodiment of the present invention, before querying in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matching the data to be matched, the method further includes: index data are established in the embedding vector database, in the embodiment of the invention, functionally distinguished, neighbor set index data, basic index data and embedding vector index data are respectively established in the embedding vector database. Specifically, neighbor set index data is established in an embedding vector database; the neighbor set index data comprises a plurality of neighbor sets correspondingly stored according to keywords keys of the imbedding vector hash key value pairs, and the neighbor sets comprise similar imbedding vector hash key value pairs and imbedding vector hash key value pair information. Establishing basic index data in an embedding vector database; the basic index data comprises a plurality of theme sets which are correspondingly stored according to themes, and the theme sets comprise a plurality of identical embedding vector hash key value pairs and embedding vector hash key value pair information. Establishing embedding vector index data in an embedding vector database; the embedding vector index data comprises a pair set of a plurality of embedding vectors stored according to the subject and the user identification or the content identification.
In terms of a storage form, the embedding vector database stores data by using embedding index data, and the storage form of the embedding index data comprises row key columns and data storage column clusters. The row key column comprises one or more of a keyword key, a topic and user identification or content identification of the embedding vector hash key value pair; the data storage column cluster comprises an embedding vector hash key-value pair set corresponding to the row key column. The step is realized based on a preset data recommendation device, and the data recommendation device comprises an embedding vector database. As shown in fig. 2, in the embodiment of the present invention, the data recommendation apparatus is pre-constructed, and the data recommendation apparatus may construct a data index for data according to a subject, construct a data index according to a locality sensitive hash value, and construct a data index according to a subject and a user identifier or a content identifier. Specifically, the data recommendation device comprises a feature engineering device, an index builder, an embedding vector database and a neighbor recommender. The characteristic engineering device conducts embedding vector processing on data acquired in real time or periodically to obtain corresponding embedding vectors, corresponding local sensitive hash values are generated through the local sensitive hash value generation algorithm, and target embedding vector hash key value pairs are constructed and obtained according to the local sensitive hash values and the corresponding embedding vectors. Specifically, the locality sensitive hash value is used as a key, and the embedding vector is used as a value, so that an embedding vector hash key-value pair can be obtained. The value of the embedding vector hash key value pair may further include an embedding vector and an ID value of a corresponding object, such as a user ID, a user group ID, a content ID, an identification ID, and the like.
And after acquiring the embedding vector hash key value pairs of the data, storing the embedding vector hash key value pairs as index data to an embedding vector database. Specifically, when the embedding vector database is stored, each index data integrally includes two parts, namely a ROW KEY column [ ROW _ KEY ] and a data storage column cluster taking the column cluster as a unit. The ROW KEY column [ ROW _ KEY ] has a structure of [ ROW _ Hash _ Key ] l [ Dim _ Type ] [ Key _ Name ] [ DateTime ], wherein [ Row _ Hash _ Key ] is Hash for preventing data hot spots, the calculation mode is that Hash coding is carried out on the value of [ Dim _ Type ] [ Key _ Name ] [ DateTime ] ", then modulo pre-partitioning number is taken, the Hash value is obtained after the absolute value of the obtained result is taken, the calculation mode is that Hash coding is carried out on the solved value of [ Dim _ Type ] [ Key _ Name ] [ DateTime ]", then modulo pre-partitioning number is taken, and the Hash value is obtained after the absolute value of the obtained result is taken. The data storage column cluster is used for storing the embedding vector hash KEY value pair corresponding to the ROW KEY column [ ROW _ KEY ] and related data information. The following describes a specific construction process of the neighbor set index data, the base index data, and the embedding vector index data from a storage form.
Establishing neighbor set index data in an embedding vector database: as shown in fig. 3, the structure of the neighbor set index data includes a first row of key columns and a first data storage column cluster, where the first row of key columns includes a plurality of first row key information, the first data storage column cluster includes at least one neighbor set corresponding to the first row of key information, and an embedding vector hash key value pair and embedding vector hash key value pair information are stored in the neighbor set; and the key information of the first row is generated according to the keyword key of the embedding vector hash key value pair. That is, the structure of the first ROW KEY column [ ROW _ KEY ] is as described above, wherein the coding of [ Dim _ Type ] is used to distinguish whether the data column corresponding to the current first ROW KEY value [ ROW _ KEY ] is a set index position or a set itself; [ Key _ Name ] is a Key of an embedding vector hash Key-value pair. The structure of the first data storage column cluster comprises [ NI ] [ NV ] [ NT ] [ NS1] [ NS2] [. the. ] [ NSi ], wherein [ NI ] in the first data storage column cluster is the position (index) of a [ NSi ] set corresponding to [ ROW _ KEY ], [ NV ] is a currently newly inserted imbedding vector hash KEY value pair, [ NT ] is the sum of the numbers of all set columns ([ NSi ]) under the current [ ROW _ KEY ], and [ NSi ] is a set of similar vectors, elements in the set of similar vectors comprise a plurality of similar imbedding vector hash KEY value pairs, and time stamps are added to values value in the imbedding vector hash KEY value pairs.
Establishing basic index data in an embedding vector database: the basic index data is an imbedding vector hash key value pair set corresponding to each index under a certain theme. As shown in fig. 4, the structure of the base index data includes a second ROW KEY column [ ROW _ KEY ] and a second data storage column cluster, where the second ROW KEY column includes a plurality of second ROW KEY information, and the second data storage column cluster includes an embedding vector hash KEY value pair corresponding to the second ROW KEY information; the second row of key information is generated according to a theme. Specifically, the structure of the second ROW KEY column [ ROW _ KEY ] is as described above, where [ Dim _ Type ] is dimension or classification code, and is used in the base index data to distinguish different data topics, and the topics may be set according to specific scenes; [ Key _ Name ] is a certain index code, n bits in the basic index data are binary strings consisting of 0 and 1 or codes corresponding to a certain index, and the index codes can be used for accurate de-duplication calculation of the index; [ DateTime ] is the date and time code. The structure of the second data storage column cluster comprises [ QV ] [ ST ] [ SD ] [ S1] [ S2] [. the. ] Si ], the [ QV ] in the second data storage column cluster is the sum of the number of all sets [ Si ] under the current [ ROW _ KEY ], the [ ST ] is a queue of object values pre-stored under the current [ ROW _ KEY ], the [ SD ] is a KEY-based de-weight value of an element stored in all sets [ Si ], the [ Si ] is a specific value storage set, the stored element is an ordered map type data structure, and object data in the form of KEY value pairs, namely the embedding vector hash KEY value pairs, are stored. It should be noted that in the embodiment of the present invention, a plurality of indexes are provided under a certain theme, and each index may have a different dimension. The index condition under a certain theme can be reflected through the comparison of all dimensions under the same index. An index, a unit or method for measuring the degree of development of a thing, also called a measure. For example: population, GDP, revenue, number of users, profit margin, retention, coverage, etc. Dimension: some feature of things or phenomena, such as gender, region, time, etc., is a dimension.
Establishing embedding vector index data in an embedding vector database: the embedding vector index data is an embedding direct mapping index, and the embedding vector index data comprises a third row key column and a third data storage column cluster. As shown in fig. 5, the third row of key columns includes a plurality of pieces of third row key information, the third data storage column cluster includes embedding vectors corresponding to the third row of key information, and the third row of key information is generated according to a subject and a user identifier or a content identifier. Specifically, the structure of the third ROW KEY column [ ROW _ KEY ] is the same as the aforementioned structure, where [ KEY _ Name ] is a user or user group ID, or a content ID or an identification ID, and is classified according to [ Dim _ Type ]. The third data storage column cluster structure is [ EV ] [ VERSION ], wherein [ EV ] is a corresponding specific embedding vector value, and [ VERSION ] is a corresponding VERSION number of [ EV ].
After neighbor set index data, basic index data and embedding vector index data are constructed, indexes are constructed for the embedding vector index data in the embedding vector database through an index builder. Specifically, an index structure is constructed for neighbor set index data, and the index structure corresponding to the neighbor set index data includes a primary index, a primary index object, a secondary index and a secondary index object. The primary index comprises index truncation of a keyword key of an embedding vector hash key value pair, and a primary index object is a local sensitive hash value in the embedding vector hash key value pair containing the index truncation; the second-level index is a local sensitive hash value in the embedding vector hash key value pair, and an index object corresponding to the second-level index is the embedding vector hash key value pair. Specifically, the value key of the embedding vector hash key value pair (i.e., the binary locality-sensitive hash value of N bits) is truncated into segments in order of bits from low to high, and the segments are used as a first-level index. The object of its index (i.e., the index object) is a two-level index composed of the full hash value truncated by the index. The key of the secondary index is a complete local sensitive hash value, and the secondary index object is a specific embedding vector hash key value pair, so that the subsequent bit-by-bit comparison and calculation in the similarity calculation process are avoided. The index structure corresponding to the basic index data is consistent with the index structure corresponding to the neighbor set index data, and is not described herein again.
After the index is constructed, neighbor search is performed through the index, and specifically, the process of obtaining the imbedding vector hash key value pair list matched with the data to be matched is as follows: indexing a target locality sensitive hash value in the target embedding vector hash key value pair in the neighbor set index data, and determining a target neighbor set matched with the target locality sensitive hash value; calculating the similarity between the target locality sensitive hash value and locality sensitive hash values in each imbedding vector hash key value pair in the target neighbor set; and determining the imbedding vector hash key value pair list according to the similarity. Wherein, the first ROW KEY information of the neighbor set index data is generated according to the KEY of the embedding vector hash KEY value pair, so as to determine the target local sensitive hash value in the target embedding vector hash KEY value pair of the data to be matched, obtain the first ROW KEY column [ ROW _ KEY ] of the ROW KEY format in the neighbor set index data corresponding to the target local sensitive hash value, index in the index structure according to the first ROW KEY column [ ROW _ KEY ], obtain the [ NI ] value in the data storage cluster first, obtain the neighbor set position, then obtain the set [ NSi ] from the memory cache according to the position, directly judge whether the set [ NSi ] exists according to the KEY of the embedding vector hash KEY value pair, if so, see whether the [ NV ] value is consistent with the current embedding vector hash KEY value pair, if so, directly return the set composed of the values of the first N elements in the corresponding set [ NSi ] as the recommended hash KEY value pair list, and if the key keys are not consistent with the key keys of the current embedding vector hash key value pair, carrying out hamming distance calculation on the keys of the elements in the [ NSi ] set and returning the values of the first N specified elements as a recommended embedding vector hash key value pair list after sorting according to the hamming distance from small to large. When inquiring and obtaining, firstly obtaining from a memory cache, and if the corresponding value is not obtained, obtaining from a neighbor set index of a bottom layer index data structure; if the key value of the current embedding vector hash key value pair does not exist in the [ NSi ], similarity calculation and neighbor aggregation operation are carried out to obtain the ordering set of the appointed TOP N and return, and the ordering set is synchronously refreshed into the corresponding [ NSi ] set in the neighbor set index and loaded and refreshed into the memory. And if the neighbor value is not acquired yet, returning the hot content column by default. For index truncation, bit-wise truncation is performed on a local sensitive hash value to be matched (for example, 16 bits are used as a truncation segment), a certain chain or set is searched step by step according to a first-level index, the hash value to be matched is compared with the hash value in the set, the segmentation is assumed to be 4 truncations (each truncation is 16 bits), corresponding 4 ROW _ KEY is generated, the 4 ROW _ KEY traverses adjacent set index data, a data set column [ Si ] corresponding to the ROW _ KEY is found in a basic index according to a third-level index according to the KEY value of an embedding vector hash KEY value pair, whether the KEY value exists in the set or not is directly judged, if the KEY value exists, similarity calculation is sequentially performed with the KEYs of elements in the set, so that the efficiency can be improved, for example, 2^32 original data to be matched and calculated are originally, each piece of hash value data is 64 bits, each 16 bit is truncated, each 16 bit has 2^16 combinations, the candidate result of each truncation index is 2^ (32-16), if the number of the truncation indexes is 4, the total number of the truncation indexes is 4 ^ 2^ (32-16), and compared with the original 2^32 comparison calculation, the preliminary retrieval and construction efficiency is improved.
In the embodiment of the invention, the method also comprises data deduplication, and specifically comprises the following steps: acquiring user behavior data to be transmitted; determining one or more of a keyword key, a topic and a user identifier or a content identifier of the embedding vector hash key value pair of the user behavior data to be transmitted; determining keyword key, topic and one or more of topic and user identification or content identification of the embedding vector hash key value pair of the user behavior data to be transmitted, and searching a matched target row key column in the embedding vector database; the target row key column is any one or more of the first row key column, the second row key column and the third row key column; determining whether the user behavior data to be transmitted is repeated data or not according to the target row key column and the corresponding target data storage column cluster; the target row key column stores one or more of key words, themes and user identifications or content identifications, and the target data storage column cluster stores embedding vector hash key value pairs or embedding vectors corresponding to the target row key column; when the user behavior data to be transmitted are repeated data, deleting the user behavior data to be transmitted; when the user behavior data to be transmitted are not repeated data, adding the user behavior data to be transmitted into the embedding vector database.
Only the neighbor set index data, the base index data and the embedding vector index data are constructed in the above way, whether a certain hash value has a similarity object in a library to be matched can be compared and calculated, but the method is not enough to realize efficient neighbor recommendation and accurate deduplication, meanwhile, the retrieval of the objects in each truncated index (primary index) is still not efficient, so that the object elements (secondary index structures) of the primary index are required to be globally ordered and moved according to the keys of the secondary index by special splitting ordering, the chain heads of each set are ordered, and the chain head elements are used as tertiary indexes, one chain (set) in the serial chain may be indexed with the highest o (logn) time complexity, the global ordering and moving is a real-time and dynamic splitting process, and a plurality of split sets are in a serial chain structure which is logically connected end to end. Therefore, when the index data is established, the embodiment of the present invention further sequences the index data through the index builder, and includes the following steps:
sorting the imbedding vector hash key value pairs in the imbedding vector hash key value pair set; the sorting can be performed according to the size of the embedding vector hash key value, specifically, taking the basic index data as an example:
sorting all embedding vector hash key value pairs according to the size of the locality sensitive hash value;
storing the sorted imbedding vector hash key value pairs in a plurality of theme sets according to a preset storage threshold sequence; the preset storage threshold value is a threshold value of the number of the embedding vector hash key value pairs stored in each topic set.
According to the embodiment of the invention, the neighbor set index data, the basic index data and the embedding vector index data can be sorted respectively. For example, the process of sorting the base index data includes: and performing split sequencing during data writing, and concurrently acquiring object values to be written in the write queues of different ROW _ KEY, namely the values in [ QV ]. For a certain [ ROW _ KEY ], if the initial data set [ Si ] is empty, it is written directly into the first set [ S1 ]. And if the value exists in the [ S1], writing the values in ascending order by taking the keywords key of the imbedding vector hash key value pair as a reference, if the number of elements in the S1 set reaches a preset threshold value, splitting the set into a second set [ S2], moving the imbedding vector hash key value pair at the tail of the [ S1] set, namely the maximum keyword key, into the [ S2] set, and if the [ S2] set is full, similarly moving the tail element into the next set, namely [ S3], and so on. This ensures that the elements in the [ Si ] (i ═ 1,2, …) sets are sorted in ascending order and the data is constantly moving and changing, and these sets logically form a serial chain structure with the first element connected, i.e. the key of the first element in each chain (set) is minimum, the key of the last element is maximum, and the chain heads of each set are ordered. The subsequent writing firstly judges the KEY value of the first element of all [ Si ] sets and the KEY value to be inserted currently, the position to be inserted is found by taking the chain head element (namely the first element of [ SI ] under [ QV ] corresponding to a certain [ ROW _ KEY ]) as a three-level index, and the time complexity of O (logN) can be the highest to index to a certain chain (set) in the serial chain. And then, inserting the elements into the corresponding sets in sequence and carrying out data movement (if needed), and if the keys of the elements to be inserted are matched with the same value in the set [ Si ], updating the value values of the elements of the corresponding keys. And after finishing sorting write-in, the sorting module atomically updates and writes the current set number into the [ ST ] column, and atomically adds one to the [ SD ] value to be used as a de-duplication value of the number of elements in the current set.
When the index data is accurately de-duplicated, the index construction process does not involve index construction of the locality sensitive hash because similarity calculation and comparison of the locality sensitive hash are not involved, and the index construction process can be more efficient. That is, for the basic index data, the key of the first-level index is the index code, the key of the second-level index is the identification ID (such as user ID) of the index data, the index object is the data corresponding to the index data identification, the data are subjected to the splitting and sorting process and added into a chain, when the data are added, whether the index exists or not can be directly searched according to the key by the first-level, second-level and third-level indexes, and if the index exists, the value is discarded or updated; if the data is not existed, adding the data elements and accumulating and updating the dynamic duplication elimination value as the global index duplication elimination value of the current time node, and for different time ranges, synchronously accumulating and updating the data elements and the dynamic duplication elimination value into the duplication elimination value in the corresponding range according to the data timestamp and the data position classification information, so that the accurate duplication elimination increment calculation of the data is completed, and the process is continuously and iteratively executed along with continuously increased index data and is a dynamic process. Therefore, real-time and dynamic accurate deduplication and weight updating are realized by performing sequencing processing in advance.
With reference to fig. 2 and fig. 6, in an embodiment of the present invention, the pre-constructed embedding vector database may update the neighbor set index data (neighbor recommended index construction) and the basic index data (exact deduplication index construction) according to the raw data, streaming or batch processing raw data in a preset period or in real time, and update in the pre-constructed embedding vector database. Therefore, it is necessary to determine whether the data is duplicate index data at the time of incremental storage and to be able to quickly update it in the neighbor set index. Specifically, as shown in fig. 6, for the neighbor recommendation index construction: after the streaming or batch-processed original data is obtained, the neighbor recommendation index construction module obtains an embedding vector hash key value pair corresponding to the original data, inputs the embedding vector hash key value pair into the basic index module, segments the key value of the embedding vector hash key value pair according to the truncation mode to serve as a primary index, the index object of the primary index is a secondary index structure formed by global part sensitive hash values including the index truncation, the key of the secondary index structure is the global part sensitive hash value, and the index object is a specific embedding vector hash key value pair. And accessing a sorting module for sorting and sorting, wherein the sorting module performs global sorting and moving on object elements (secondary index structures) of the primary index according to keys of the secondary index through special splitting sorting, the global sorting and moving is a real-time continuous dynamic splitting process, a plurality of split sets logically present serial chain structures which are connected end to end, and the chain heads of each set are ordered, and the chain head elements are used as tertiary indexes. Therefore, a certain chain, namely a neighbor set, in the serial chain can be indexed according to the input imbedding vector hash key value pair. And after data is updated, triggering the neighbor index module to perform similarity calculation and neighbor aggregation operation, thereby obtaining neighbor data corresponding to the embedding vector hash key value pair corresponding to the original data, and updating the generated neighbor calculation result to a corresponding neighbor set in the neighbor set index after deduplication, thereby realizing construction or update of the neighbor set index data. For basic index data construction (exact deduplication index construction): after the original data are processed, accurate deduplication index KEY value pair data is obtained, wherein KEY is identification ID, value is detail data, after the data are classified and accessed into a basic index module according to indexes, because a second ROW of KEY values ROW _ KEY in the basic index data are used as a primary index, an index object is a secondary index structure, KEY is identification ID, and value is corresponding detail data, and after the data are retrieved by a sorting module, whether the accurate deduplication index KEY value pair data are repetitive data or not can be determined. The object elements (secondary index structures) of the primary index are subjected to global sequencing and movement according to the keys of the secondary index in the sequencing module, the global sequencing and movement is a real-time continuous dynamic splitting process, the split sets are logically in an end-to-end serial chain structure, and the chain heads of each set are ordered, and the chain head elements are used as the tertiary index, so that whether data which are repeated with the accurate deduplication index key value pair data exist in the tertiary index can be rapidly determined, if so, the accurate deduplication index key value pair data are deleted, and if not, corresponding increments are accumulated to the corresponding sets. The process combines the construction of multi-level/classification indexes to enable the global data dynamic rules to continuously move, so that the incremental calculation of accurate duplicate removal is synchronously completed when the data storage is completed, and the process is incremental accumulation instead of each full calculation. By the method, rapid and accurate duplicate removal during mass data storage can be realized. Through the steps, an iterative data synchronization and index construction calculation is completed, and the index data in the pre-constructed embedding vector database is continuously updated according to the preset period or real-time streaming or batch-processed original data.
Step 150: and generating a recommendation result according to the embedding vector hash key value pair list.
And after the imbedding vector hash key value pair list is obtained, a target data list corresponding to the imbedding vector hash key value pair is obtained, and the target data list is recommended to a user.
As shown in fig. 7, in the embodiment of the present invention, in the above manner, neighbor recommendation and accurate deduplication of massive data can be achieved. For the neighbor search, after the neighbor set index data is constructed, the neighbor set of a certain embedding vector can be directly retrieved from the neighbor set through the neighbor search module according to the neighbor search method, and finally output as a recall/recommendation list. Meanwhile, in the accurate deduplication process, the deduplication retrieval can directly query the dynamically updated global index deduplication value or the range index deduplication value through the index rule to directly obtain the accurate deduplication of the current index data in real time, at the moment, a calculation process does not exist, the data size is irrelevant, and the purpose of obtaining the accurate deduplication value of mass data through O (1) complexity is achieved.
The embodiment of the invention ensures local sensitivity by mapping the embedding vector into a local hash value, establishes depth conversion and combination with a local sensitive hash technology, is not limited to static texts or one-dimensional spaces, and can perform further dimension reduction characterization on high-dimensional objects or things, so that after a query request is obtained, data to be matched in the query request is processed into a target embedding vector, a target local sensitive hash value is generated according to the target embedding vector, the target local sensitive hash value is queried in a pre-constructed embedding vector database according to the target local sensitive hash value, and an embedding vector hash key value pair list matched with the data to be matched is obtained, wherein the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs, the target embedding vector hash key value pair is any one of the plurality of embedding vector hash key value pairs, and the embedding vector hash key value pair comprises the local sensitive hash value and the embedding vector of the embedding vector, and finally, generating a recommendation result according to the embedding vector hash key value pair list, thereby effectively improving the efficiency of recommending the neighbor data.
Furthermore, by establishing index data and performing special splitting sequencing, when data to be deduplicated is put into storage, the global data is dynamically and regularly moved by combining the construction of the multi-level index, incremental calculation for accurately removing duplication is synchronously completed when the data is put into storage, and the process is incremental accumulation rather than full calculation every time, so that the index construction process is efficient, accurate deduplication values can be inquired in real time, a calculation process does not exist during inquiry, and the calculation process is completely unrelated to data quantity, so that the data processing efficiency is further improved.
Fig. 8 is a schematic structural diagram illustrating a data recommendation apparatus according to an embodiment of the present invention. As shown in fig. 8, the apparatus 300 includes:
the obtaining module 310 is configured to obtain a query request, where the query request includes data to be matched.
And the processing module 320 is configured to process the data to be matched into a target embedding vector.
The generating module 330 is configured to generate a target embedding vector hash key value pair according to the target embedding vector.
The matching module 340 is configured to query a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of the imbedding vector and the imbedding vector.
And the recommending module 350 is configured to generate a recommending result according to the imbedding vector hash key value pair list.
The work flow of the data recommendation device 300 of the embodiment of the present invention is substantially the same as the specific implementation steps of the data recommendation method, and is not described herein again.
The embodiment of the invention ensures local sensitivity by mapping the embedding vector into a local hash value, establishes depth conversion and combination with a local sensitive hash technology, is not limited to static texts or one-dimensional spaces, and can perform further dimension reduction characterization on high-dimensional objects or things, so that after a query request is obtained, data to be matched in the query request is processed into a target embedding vector, a target local sensitive hash value is generated according to the target embedding vector, the target local sensitive hash value is queried in a pre-constructed embedding vector database according to the target local sensitive hash value, and an embedding vector hash key value pair list matched with the data to be matched is obtained, wherein the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs, the target embedding vector hash key value pair is any one of the plurality of embedding vector hash key value pairs, and the embedding vector hash key value pair comprises the local sensitive hash value and the embedding vector of the embedding vector, and finally, generating a recommendation result according to the embedding vector hash key value pair list, thereby effectively improving the efficiency of recommending the neighbor data.
Furthermore, by establishing index data and performing special splitting sequencing, when data to be deduplicated is put into storage, the global data is dynamically and regularly moved by combining the construction of the multi-level index, incremental calculation for accurately removing duplication is synchronously completed when the data is put into storage, and the process is incremental accumulation rather than full calculation every time, so that the index construction process is efficient, accurate deduplication values can be inquired in real time, a calculation process does not exist during inquiry, and the calculation process is completely unrelated to data quantity, so that the data processing efficiency is further improved.
Fig. 9 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
As shown in fig. 9, the computing device may include: a processor (processor)402, a Communications Interface 404, a memory 406, and a Communications bus 408.
Wherein: the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. The processor 402 is configured to execute the program 410, and may specifically perform the relevant steps in the data recommendation method embodiments described above.
In particular, program 410 may include program code comprising computer-executable instructions.
The processor 402 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 410 may be specifically invoked by the processor 402 to cause the computing device to perform the following operations:
acquiring a query request, wherein the query request comprises data to be matched;
processing the data to be matched into a target embedding vector;
generating a target embedding vector hash key value pair according to the target embedding vector;
inquiring in a pre-constructed imbedding vector database according to the target imbedding vector hash key value pair to obtain an imbedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of an imbedding vector and the imbedding vector;
and generating a recommendation result according to the embedding vector hash key value pair list.
In an optional manner, the construction method of the embedding vector database includes: acquiring user behavior data in a streaming or batch manner; processing the user behavior data into an embedding vector aiming at each acquired user behavior data; generating an imbedding vector hash key value pair according to the imbedding vector; and storing the generated embedding vector hash key value pair into an embedding vector database.
In an optional manner, the generating a target embedding vector hash key-value pair according to the target embedding vector includes: generating a hash tree with the same dimension as the embedding vector; the hash tree comprises a plurality of random vectors; respectively calculating included angles between the embedding vector and each random vector in the hash tree; determining the locality sensitive hash value according to the included angle; and generating the target embedding vector hash key value pair according to the target locality sensitive hash value and the target embedding vector.
In an optional manner, the embedding vector database includes basic index data, neighbor set index data, and embedding vector index data; the construction method of the embedding vector database comprises the following steps: establishing neighbor set index data in an embedding vector database; the neighbor set index data comprises a plurality of neighbor sets correspondingly stored according to the keywords key of the imbedding vector hash key value pair, and the neighbor sets comprise similar imbedding vector hash key values; establishing basic index data in an embedding vector database; the basic index data comprises a plurality of theme sets which are correspondingly stored according to themes, and the theme sets comprise a plurality of embedding vector hash key value pairs with the same theme; establishing embedding vector index data in an embedding vector database; the embedding vector index data comprises a plurality of embedding vector sets stored according to the subject and the user identification or the content identification.
In an optional mode, the embedding vector database stores data by adopting an embedding index data structure, wherein the embedding index data structure comprises row key columns and data storage column clusters; the neighbor set index data comprises a first row of key columns and a first data storage column cluster, the first row of key columns comprises a plurality of first row of key information, the first data storage column cluster comprises at least one neighbor set corresponding to the first row of key information, and an embedding vector hash key value pair is stored in the neighbor set; generating the first row of key information according to the keyword key of the embedding vector hash key value pair; the basic index data comprises a second row key column and a second data storage column cluster, the second row key column comprises a plurality of second row key information, and the second data storage column cluster comprises an embedding vector hash key value pair corresponding to the second row key information; the second row key information is generated according to a theme; the embedding vector index data comprises a third row key column and a third data storage column cluster, the third row key column comprises a plurality of pieces of third row key information, the third data storage column cluster comprises embedding vectors corresponding to the third row key information, and the third row key information is generated according to a theme and a user identifier or a content identifier.
In an optional manner, the method further comprises: acquiring user behavior data to be transmitted; determining one or more of a keyword key, a topic and a user identifier or a content identifier of the embedding vector hash key value pair of the user behavior data to be transmitted; determining one or more of keywords key, topic, and topic and user identification or content identification of the embedding vector hash key value pair of the user behavior data to be transmitted, and searching a matched target row key column in the index database, wherein the target row key column is any one or more of the first row key column, the second row key column and the third row key column; determining whether the user behavior data to be transmitted is repeated data or not according to the target row key column and the corresponding target data storage column cluster; the target row key column stores one or more of key words, themes and user identifications or content identifications, and the target data storage column cluster stores embedding vectors corresponding to the target row key column; when the user behavior data to be transmitted are repeated data, deleting the user behavior data to be transmitted; when the user behavior data to be transmitted are not repeated data, adding the user behavior data to be transmitted into the embedding vector database.
In an optional manner, after the establishing neighbor set index data in the embedding vector database, the method includes: constructing an index structure according to the neighbor set index data; the index structure comprises a primary index, a primary index object, a secondary index and a secondary index object; the first-level index comprises index truncation of a keyword key of the embedding vector hash key value pair, and the first-level index object is a local sensitive hash value in the embedding vector hash key value pair containing the index truncation; the secondary index is a local sensitive hash value in the embedding vector hash key value pair, and the secondary index object is the embedding vector hash key value pair.
In an optional manner, the establishing basic index data in the embedding vector database includes: sorting all embedding vector hash key value pairs according to the size of the locality sensitive hash value; storing the sorted imbedding vector hash key value pairs in a plurality of theme sets according to a preset storage threshold sequence; the preset storage threshold value is a threshold value of the number of the embedding vector hash key value pairs stored in each topic set.
In an optional manner, querying in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched, includes: indexing a target locality sensitive hash value in the target embedding vector hash key value pair in the neighbor set index data, and determining a target neighbor set matched with the target locality sensitive hash value; calculating the similarity between the target locality sensitive hash value and locality sensitive hash values in each imbedding vector hash key value pair in the target neighbor set; and determining the imbedding vector hash key value pair list according to the similarity.
The embodiment of the invention ensures local sensitivity by mapping the embedding vector into a local hash value, establishes depth conversion and combination with a local sensitive hash technology, is not limited to static texts or one-dimensional spaces, and can perform further dimension reduction characterization on high-dimensional objects or things, so that after a query request is obtained, data to be matched in the query request is processed into a target embedding vector, a target local sensitive hash value is generated according to the target embedding vector, the target local sensitive hash value is queried in a pre-constructed embedding vector database according to the target local sensitive hash value, and an embedding vector hash key value pair list matched with the data to be matched is obtained, wherein the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs, the target embedding vector hash key value pair is any one of the plurality of embedding vector hash key value pairs, and the embedding vector hash key value pair comprises the local sensitive hash value and the embedding vector of the embedding vector, and finally, generating a recommendation result according to the embedding vector hash key value pair list, thereby effectively improving the efficiency of recommending the neighbor data.
Furthermore, by establishing index data and performing special splitting sequencing, when data to be deduplicated is put into storage, the global data is dynamically and regularly moved by combining the construction of the multi-level index, incremental calculation for accurately removing duplication is synchronously completed when the data is put into storage, and the process is incremental accumulation rather than full calculation every time, so that the index construction process is efficient, accurate deduplication values can be inquired in real time, a calculation process does not exist during inquiry, and the calculation process is completely unrelated to data quantity, so that the data processing efficiency is further improved.
An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is executed on a computing device, the computing device is caused to execute a data recommendation method in any method embodiment described above.
The executable instructions may be specifically configured to cause the computing device to:
acquiring a query request, wherein the query request comprises data to be matched;
processing the data to be matched into a target embedding vector;
generating a target embedding vector hash key value pair according to the target embedding vector;
inquiring in a pre-constructed imbedding vector database according to the target imbedding vector hash key value pair to obtain an imbedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of an imbedding vector and the imbedding vector;
and generating a recommendation result according to the embedding vector hash key value pair list.
In an optional manner, the construction method of the embedding vector database includes: acquiring user behavior data in a streaming or batch manner; processing the user behavior data into an embedding vector aiming at each acquired user behavior data; generating an imbedding vector hash key value pair according to the imbedding vector; and storing the generated embedding vector hash key value pair into an embedding vector database.
In an optional manner, the generating a target embedding vector hash key-value pair according to the target embedding vector includes: generating a hash tree with the same dimension as the embedding vector; the hash tree comprises a plurality of random vectors; respectively calculating included angles between the embedding vector and each random vector in the hash tree; determining the locality sensitive hash value according to the included angle; and generating the target embedding vector hash key value pair according to the target locality sensitive hash value and the target embedding vector.
In an optional manner, the embedding vector database includes basic index data, neighbor set index data, and embedding vector index data; the construction method of the embedding vector database comprises the following steps: establishing neighbor set index data in an embedding vector database; the neighbor set index data comprises a plurality of neighbor sets correspondingly stored according to the keywords key of the imbedding vector hash key value pair, and the neighbor sets comprise similar imbedding vector hash key values; establishing basic index data in an embedding vector database; the basic index data comprises a plurality of theme sets which are correspondingly stored according to themes, and the theme sets comprise a plurality of embedding vector hash key value pairs with the same theme; establishing embedding vector index data in an embedding vector database; the embedding vector index data comprises a plurality of embedding vector sets stored according to the subject and the user identification or the content identification.
In an optional mode, the embedding vector database stores data by adopting an embedding index data structure, wherein the embedding index data structure comprises row key columns and data storage column clusters; the neighbor set index data comprises a first row of key columns and a first data storage column cluster, the first row of key columns comprises a plurality of first row of key information, the first data storage column cluster comprises at least one neighbor set corresponding to the first row of key information, and an embedding vector hash key value pair is stored in the neighbor set; generating the first row of key information according to the keyword key of the embedding vector hash key value pair; the basic index data comprises a second row key column and a second data storage column cluster, the second row key column comprises a plurality of second row key information, and the second data storage column cluster comprises an embedding vector hash key value pair corresponding to the second row key information; the second row key information is generated according to a theme; the embedding vector index data comprises a third row key column and a third data storage column cluster, the third row key column comprises a plurality of pieces of third row key information, the third data storage column cluster comprises embedding vectors corresponding to the third row key information, and the third row key information is generated according to a theme and a user identifier or a content identifier.
In an optional manner, the method further comprises: acquiring user behavior data to be transmitted; determining one or more of a keyword key, a topic and a user identifier or a content identifier of the embedding vector hash key value pair of the user behavior data to be transmitted; determining one or more of keywords key, topic, and topic and user identification or content identification of the embedding vector hash key value pair of the user behavior data to be transmitted, and searching a matched target row key column in the index database, wherein the target row key column is any one or more of the first row key column, the second row key column and the third row key column; determining whether the user behavior data to be transmitted is repeated data or not according to the target row key column and the corresponding target data storage column cluster; the target row key column stores one or more of key words, themes and user identifications or content identifications, and the target data storage column cluster stores embedding vectors corresponding to the target row key column; when the user behavior data to be transmitted are repeated data, deleting the user behavior data to be transmitted; when the user behavior data to be transmitted are not repeated data, adding the user behavior data to be transmitted into the embedding vector database.
In an optional manner, after the establishing neighbor set index data in the embedding vector database, the method includes: constructing an index structure according to the neighbor set index data; the index structure comprises a primary index, a primary index object, a secondary index and a secondary index object; the first-level index comprises index truncation of a keyword key of the embedding vector hash key value pair, and the first-level index object is a local sensitive hash value in the embedding vector hash key value pair containing the index truncation; the secondary index is a local sensitive hash value in the embedding vector hash key value pair, and the secondary index object is the embedding vector hash key value pair.
In an optional manner, the establishing basic index data in the embedding vector database includes: sorting all embedding vector hash key value pairs according to the size of the locality sensitive hash value; storing the sorted imbedding vector hash key value pairs in a plurality of theme sets according to a preset storage threshold sequence; the preset storage threshold value is a threshold value of the number of the embedding vector hash key value pairs stored in each topic set.
In an optional manner, querying in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched, includes: indexing a target locality sensitive hash value in the target embedding vector hash key value pair in the neighbor set index data, and determining a target neighbor set matched with the target locality sensitive hash value; calculating the similarity between the target locality sensitive hash value and locality sensitive hash values in each imbedding vector hash key value pair in the target neighbor set; and determining the imbedding vector hash key value pair list according to the similarity.
The embodiment of the invention ensures local sensitivity by mapping the embedding vector into a local hash value, establishes depth conversion and combination with a local sensitive hash technology, is not limited to static texts or one-dimensional spaces, and can perform further dimension reduction characterization on high-dimensional objects or things, so that after a query request is obtained, data to be matched in the query request is processed into a target embedding vector, a target local sensitive hash value is generated according to the target embedding vector, the target local sensitive hash value is queried in a pre-constructed embedding vector database according to the target local sensitive hash value, and an embedding vector hash key value pair list matched with the data to be matched is obtained, wherein the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs, the target embedding vector hash key value pair is any one of the plurality of embedding vector hash key value pairs, and the embedding vector hash key value pair comprises the local sensitive hash value and the embedding vector of the embedding vector, and finally, generating a recommendation result according to the embedding vector hash key value pair list, thereby effectively improving the efficiency of recommending the neighbor data.
Furthermore, by establishing index data and performing special splitting sequencing, when data to be deduplicated is put into storage, the global data is dynamically and regularly moved by combining the construction of the multi-level index, incremental calculation for accurately removing duplication is synchronously completed when the data is put into storage, and the process is incremental accumulation rather than full calculation every time, so that the index construction process is efficient, accurate deduplication values can be inquired in real time, a calculation process does not exist during inquiry, and the calculation process is completely unrelated to data quantity, so that the data processing efficiency is further improved.
The embodiment of the invention provides a data recommendation device, which is used for executing the data recommendation method.
Embodiments of the present invention provide a computer program, where the computer program can be called by a processor to enable a computing device to execute a data recommendation method in any of the above method embodiments.
Embodiments of the present invention provide a computer program product, which includes a computer program stored on a computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are run on a computer, the computer is caused to execute the data recommendation method in any of the above-mentioned method embodiments.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (12)

1. A method for recommending data, the method comprising:
acquiring a query request, wherein the query request comprises data to be matched;
processing the data to be matched into a target embedding vector;
generating a target embedding vector hash key value pair according to the target embedding vector;
inquiring in a pre-constructed imbedding vector database according to the target imbedding vector hash key value pair to obtain an imbedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of embedding vector hash key value pairs which are pre-stored, and the embedding vector hash key value pairs comprise local sensitive hash values of embedding vectors and the embedding vectors;
and generating a recommendation result according to the embedding vector hash key value pair list.
2. The method of claim 1, wherein the construction method of the embedding vector database comprises:
acquiring user behavior data in a streaming or batch manner;
processing the user behavior data into an embedding vector aiming at each acquired user behavior data;
generating an imbedding vector hash key value pair according to the imbedding vector;
and storing the generated embedding vector hash key value pair into an embedding vector database.
3. The method of claim 1, wherein generating a target embedding vector hash key-value pair from the target embedding vector comprises:
generating a hash tree with the same dimension as the target embedding vector; the hash tree comprises a plurality of random vectors;
respectively calculating included angles between the target embedding vector and each random vector in the hash tree;
determining a target locality sensitive hash value according to the included angle;
and generating the target embedding vector hash key value pair according to the target locality sensitive hash value and the target embedding vector.
4. The method according to any one of claim 2, wherein the embedding vector database includes base index data, neighbor set index data, embedding vector index data; the construction method of the embedding vector database comprises the following steps: establishing neighbor set index data in an embedding vector database; the neighbor set index data comprises a plurality of neighbor sets correspondingly stored according to the keywords key of the imbedding vector hash key value pair, and the neighbor sets comprise similar imbedding vector hash key values;
establishing basic index data in an embedding vector database; the basic index data comprises a plurality of theme sets which are correspondingly stored according to themes, and the theme sets comprise a plurality of embedding vector hash key value pairs with the same theme;
establishing embedding vector index data in an embedding vector database; the embedding vector index data comprises a plurality of embedding vector sets stored according to the subject and the user identification or the content identification.
5. The method of claim 4, wherein the embedding vector database employs an embedding index data structure for data storage, the embedding index data structure comprising row key columns and data storage column clusters; the neighbor set index data comprises a first row of key columns and a first data storage column cluster, the first row of key columns comprises a plurality of first row of key information, the first data storage column cluster comprises at least one neighbor set corresponding to the first row of key information, and an embedding vector hash key value pair is stored in the neighbor set; generating the first row of key information according to the keyword key of the embedding vector hash key value pair;
the basic index data comprises a second row key column and a second data storage column cluster, the second row key column comprises a plurality of second row key information, and the second data storage column cluster comprises an embedding vector hash key value pair corresponding to the second row key information; the second row key information is generated according to a theme;
the embedding vector index data comprises a third row key column and a third data storage column cluster, the third row key column comprises a plurality of pieces of third row key information, the third data storage column cluster comprises embedding vectors corresponding to the third row key information, and the third row key information is generated according to a theme and a user identifier or a content identifier.
6. The method of claim 5, further comprising:
acquiring user behavior data to be transmitted;
determining one or more of a keyword key, a topic and a user identifier or a content identifier of the embedding vector hash key value pair of the user behavior data to be transmitted;
determining keyword key, topic and one or more of topic and user identification or content identification of the embedding vector hash key value pair of the user behavior data to be transmitted, and searching a matched target row key column in the embedding vector database; the target row key column is any one or more of the first row key column, the second row key column and the third row key column;
determining whether the user behavior data to be transmitted is repeated data or not according to the target row key column and the corresponding target data storage column cluster; the target row key column stores one or more of key words, themes and user identifications or content identifications, and the target data storage column cluster stores embedding vector hash key value pairs or embedding vectors corresponding to the target row key column;
when the user behavior data to be transmitted are repeated data, deleting the user behavior data to be transmitted; when the user behavior data to be transmitted are not repeated data, adding the user behavior data to be transmitted into the embedding vector database.
7. The method of claim 4, wherein the establishing of the base index data in the embedding vector database comprises:
sorting all embedding vector hash key value pairs according to the size of the locality sensitive hash value;
storing the sorted imbedding vector hash key value pairs in a plurality of theme sets according to a preset storage threshold sequence; the preset storage threshold value is a threshold value of the number of the embedding vector hash key value pairs stored in each topic set.
8. The method of claim 5, wherein after establishing neighbor set index data in the embedding vector database, the method comprises:
constructing an index structure according to the neighbor set index data; the index structure comprises a primary index, a primary index object, a secondary index and a secondary index object;
the first-level index comprises index truncation of a keyword key of the embedding vector hash key value pair, and the first-level index object is a local sensitive hash value in the embedding vector hash key value pair containing the index truncation;
the secondary index is a local sensitive hash value in the embedding vector hash key value pair, and the secondary index object is the embedding vector hash key value pair.
9. The method of claim 5, wherein querying a pre-constructed imbedding vector database according to the target imbedding vector hash key-value pair to obtain a list of imbedding vector hash key-value pairs matching the data to be matched comprises:
indexing a target locality sensitive hash value in the target embedding vector hash key value pair in the neighbor set index data, and determining a target neighbor set matched with the target locality sensitive hash value;
calculating the similarity between the target locality sensitive hash value and locality sensitive hash values in each imbedding vector hash key value pair in the target neighbor set;
and determining the imbedding vector hash key value pair list according to the similarity.
10. A data recommendation apparatus, characterized in that the apparatus comprises:
the device comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a query request which comprises data to be matched;
the processing module is used for processing the data to be matched into a target embedding vector;
the generating module is used for generating a target embedding vector hash key value pair according to the target embedding vector;
the matching module is used for inquiring in a pre-constructed embedding vector database according to the target embedding vector hash key value pair to obtain an embedding vector hash key value pair list matched with the data to be matched; the pre-constructed embedding vector database comprises a plurality of pre-stored embedding vector hash key value pairs; the imbedding vector hash key value pair comprises a local sensitive hash value of an imbedding vector and the imbedding vector;
and the recommending module is used for generating a recommending result according to the imbedding vector hash key value pair list.
11. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation of the data recommendation method according to any one of claims 1-9.
12. A computer-readable storage medium having stored therein at least one executable instruction that, when executed on a computing device, causes the computing device to perform operations of a data recommendation method of any one of claims 1-9.
CN202111121375.XA 2021-09-24 2021-09-24 Data recommendation method and device, computing equipment and computer-readable storage medium Pending CN113918807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111121375.XA CN113918807A (en) 2021-09-24 2021-09-24 Data recommendation method and device, computing equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111121375.XA CN113918807A (en) 2021-09-24 2021-09-24 Data recommendation method and device, computing equipment and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN113918807A true CN113918807A (en) 2022-01-11

Family

ID=79235999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111121375.XA Pending CN113918807A (en) 2021-09-24 2021-09-24 Data recommendation method and device, computing equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN113918807A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115296901A (en) * 2022-08-03 2022-11-04 中国平安财产保险股份有限公司 Authority management method based on artificial intelligence and related equipment
CN117271529A (en) * 2023-11-20 2023-12-22 阿里云计算有限公司 Index processing method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199827A (en) * 2014-07-24 2014-12-10 北京大学 Locality-sensitive-hashing-based high-dimensional indexing method for large-scale multimedia data
CN104778234A (en) * 2015-03-31 2015-07-15 南京邮电大学 Multi-label file nearest neighbor search method based on LSH (Locality Sensitive Hashing) technology
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199827A (en) * 2014-07-24 2014-12-10 北京大学 Locality-sensitive-hashing-based high-dimensional indexing method for large-scale multimedia data
CN104778234A (en) * 2015-03-31 2015-07-15 南京邮电大学 Multi-label file nearest neighbor search method based on LSH (Locality Sensitive Hashing) technology
CN108717407A (en) * 2018-05-11 2018-10-30 北京三快在线科技有限公司 Entity vector determines method and device, information retrieval method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林朝晖;于俊清;何云峰;管涛;艾列富;: "高维分布式局部敏感哈希索引方法", 计算机科学与探索, no. 09, 28 May 2013 (2013-05-28), pages 47 - 55 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115296901A (en) * 2022-08-03 2022-11-04 中国平安财产保险股份有限公司 Authority management method based on artificial intelligence and related equipment
CN115296901B (en) * 2022-08-03 2023-07-04 中国平安财产保险股份有限公司 Rights management method based on artificial intelligence and related equipment
CN117271529A (en) * 2023-11-20 2023-12-22 阿里云计算有限公司 Index processing method, device and storage medium
CN117271529B (en) * 2023-11-20 2024-03-29 阿里云计算有限公司 Index processing method, device and storage medium

Similar Documents

Publication Publication Date Title
US11341419B2 (en) Method of and system for generating a prediction model and determining an accuracy of a prediction model
WO2020182019A1 (en) Image search method, apparatus, device, and computer-readable storage medium
US11853334B2 (en) Systems and methods for generating and using aggregated search indices and non-aggregated value storage
JP5749279B2 (en) Join embedding for item association
KR101265896B1 (en) Ranking system and how to provide ranking
US10521441B2 (en) System and method for approximate searching very large data
CN112115232B (en) Data error correction method, device and server
US20240127575A1 (en) Artificial intelligence system with iterative two-phase active learning
CN111475725A (en) Method, apparatus, device, and computer-readable storage medium for searching content
CN113821657A (en) Image processing model training method and image processing method based on artificial intelligence
CN112883030A (en) Data collection method and device, computer equipment and storage medium
US20210272013A1 (en) Concept modeling system
US9026539B2 (en) Ranking supervised hashing
CN113918807A (en) Data recommendation method and device, computing equipment and computer-readable storage medium
Prasanth et al. Effective big data retrieval using deep learning modified neural networks
CN106294784B (en) resource searching method and device
Yin et al. Content‐Based Image Retrial Based on Hadoop
US10235432B1 (en) Document retrieval using multiple sort orders
CN116578757A (en) Training method for blog vector generation model, blog recommendation method, device and equipment
US20090319505A1 (en) Techniques for extracting authorship dates of documents
Gisolf et al. Search and Explore Strategies for Interactive Analysis of Real-Life Image Collections with Unknown and Unique Categories
JP2011159100A (en) Successive similar document retrieval apparatus, successive similar document retrieval method and program
CN111639099A (en) Full-text indexing method and system
CN118227808B (en) File intelligent retrieval method and device based on big data and electronic equipment
CN118503807B (en) Multi-dimensional cross-border commodity matching method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination