CN116304253B - Data storage method, data retrieval method and method for identifying similar video - Google Patents

Data storage method, data retrieval method and method for identifying similar video Download PDF

Info

Publication number
CN116304253B
CN116304253B CN202310215233.2A CN202310215233A CN116304253B CN 116304253 B CN116304253 B CN 116304253B CN 202310215233 A CN202310215233 A CN 202310215233A CN 116304253 B CN116304253 B CN 116304253B
Authority
CN
China
Prior art keywords
data
target
distance
determining
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310215233.2A
Other languages
Chinese (zh)
Other versions
CN116304253A (en
Inventor
尹洁
黄贲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310215233.2A priority Critical patent/CN116304253B/en
Publication of CN116304253A publication Critical patent/CN116304253A/en
Application granted granted Critical
Publication of CN116304253B publication Critical patent/CN116304253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a data storage method, a data retrieval method and a method for identifying similar videos, relates to the technical field of computers, and particularly relates to the technical fields of artificial intelligence, big data, data retrieval and the like. The specific implementation scheme is as follows: clustering the plurality of data to obtain at least one first clustering center; clustering a plurality of data according to at least one first clustering center to obtain at least one second clustering center; dividing the plurality of data into at least one data group according to at least one first cluster center and at least one second cluster center; generating a graph structure from at least one data set; and storing the graph structure to a database.

Description

Data storage method, data retrieval method and method for identifying similar video
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the technical fields of artificial intelligence, big data, data retrieval, and the like.
Background
With the widespread use of CNN (Convolutional Neural Network )), the basic search technique ANN (Approximate Nearest Neighbor, approximate nearest neighbor search) applied to CNN features has also been developed faster. HNSW (Hierarchical Navigable Small World) may be utilized for large-scale datasets. But the HN SW algorithm is computationally intensive.
Disclosure of Invention
The present disclosure provides a data storage method, a data retrieval method, and a method, an apparatus, a device, a storage medium, and a program product for identifying similar videos.
According to an aspect of the present disclosure, there is provided a data storage method including: clustering the plurality of data to obtain at least one first clustering center; clustering the plurality of data according to the at least one first clustering center to obtain at least one second clustering center; dividing the plurality of data into at least one data set according to the at least one first cluster center and the at least one second cluster center; generating a graph structure according to the at least one data set; and storing the graph structure to a database.
According to another aspect of the present disclosure, there is provided a data retrieval method including: determining at least one first cluster center and a target first cluster center matched with data to be retrieved; determining at least one second hub and a target second hub matched with the data to be retrieved; determining a target data set in a plurality of data sets according to the target first clustering center and the target second clustering center; and searching data matched with the data to be searched by taking a node corresponding to the target data set as a starting point in a graph structure of a database to obtain a search result, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original data, and the plurality of original data are stored in the database according to the method disclosed by the embodiment of the invention.
According to another aspect of the present disclosure, there is provided a method of identifying similar videos, including: determining at least one first clustering center and a target first clustering center matched with video information to be identified; determining at least one target second aggregation center matched with the video information to be identified; determining a target data set in a plurality of data sets according to the target first clustering center and the target second clustering center; and in a graph structure of a database, retrieving video information matched with the video information to be identified by taking a node corresponding to the target data set as a starting point to obtain an identification result, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original video information, and the plurality of original video information is stored in the database according to the method disclosed by the embodiment of the invention.
According to another aspect of the present disclosure, there is provided a data storage device comprising: the second clustering module is used for clustering the plurality of data according to the at least one first clustering center to obtain at least one second clustering center; a partitioning module configured to partition the plurality of data into at least one data set according to the at least one first cluster center and the at least one second cluster center; a graph generating module, configured to generate a graph structure according to the at least one data set; and a storage module for storing the graph structure to a database storage module.
According to another aspect of the present disclosure, there is provided a data retrieval apparatus including: the first cluster center determining module is used for determining at least one first cluster center and a target first cluster center matched with the data to be retrieved; the second aggregation center determining module is used for determining at least one target second aggregation center matched with the data to be retrieved; the first target data set determining module is used for determining target data sets in a plurality of data sets according to the target first clustering center and the target second clustering center; and the first retrieval module is used for retrieving data matched with the data to be retrieved by taking a node corresponding to the target data set as a starting point in a graph structure of a database to obtain a retrieval result, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original data, and the plurality of original data are stored in the database according to the method disclosed by the embodiment of the invention.
Another aspect of the present disclosure provides an apparatus for identifying similar videos, comprising: the third cluster center determining module is used for determining at least one first cluster center and a target first cluster center matched with the video information to be identified; a fourth cluster center determining module, configured to determine at least one second cluster center and a target second cluster center that matches the video information to be identified; the second target data set determining module is used for determining target data sets in a plurality of data sets according to the target first clustering center and the target second clustering center; and the second retrieval module is used for retrieving the video information matched with the video information to be identified by taking the node corresponding to the target data set as a starting point in a graph structure of the database to obtain an identification result, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original video information, and the plurality of original video information is stored in the database according to the method disclosed by the embodiment of the invention.
Another aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods shown in the embodiments of the present disclosure.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods shown in the disclosed embodiments.
According to another aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method shown in the disclosed embodiments.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a data storage method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of a method of generating a graph structure, according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a flow chart of a data retrieval method according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of determining a target second hub class according to an embodiment of the disclosure;
FIG. 6 schematically illustrates a schematic diagram of a three-point relationship according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a method of retrieving data matching data to be retrieved, starting with a node corresponding to a target data set, according to an embodiment of the disclosure;
FIG. 8 schematically illustrates a block diagram of a data storage device according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a data retrieval device according to an embodiment of the present disclosure;
FIG. 10 schematically illustrates a block diagram of an apparatus for identifying similar videos in accordance with an embodiment of the disclosure;
FIG. 11 schematically illustrates a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The system architecture of the data storage method, the data retrieval method, the method and the device for identifying similar videos provided by the present disclosure will be described below with reference to fig. 1.
Fig. 1 schematically illustrates an exemplary system architecture 100 to which data storage methods, data retrieval methods, methods and apparatus for identifying similar videos may be applied, according to embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the data storage method, the data retrieval method, and the method for identifying similar videos provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the data storage device, the data retrieval device, and the device for identifying similar videos provided by embodiments of the present disclosure may be generally disposed in the server 105. The data storage method, the data retrieval method, and the method of identifying similar videos provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the data storage means, the data retrieving means and the means for identifying similar videos provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
The data storage method, the data retrieval method and the method for identifying similar videos provided by the embodiment of the disclosure can be executed by the same device or can be executed by different devices respectively. Accordingly, the data storage device, the data retrieval device and the device for identifying similar videos provided in the embodiments of the present disclosure may be disposed in the same device, or may be disposed in different devices respectively. The present disclosure is not particularly limited thereto.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
The data storage method provided by the present disclosure will be described below with reference to fig. 2.
Fig. 2 schematically illustrates a flow chart of a data storage method according to an embodiment of the present disclosure.
As shown in fig. 2, the data storage method 200 includes clustering a plurality of data to obtain at least one first cluster center in operation S210.
According to embodiments of the present disclosure, the data may include, for example, images, text, video, audio, and the like. Alternatively, the data may also include, for example, features obtained by feature extraction of images, text, video, audio, and the like.
According to embodiments of the present disclosure, a plurality of data may be clustered, for example, using a clustering algorithm, resulting in at least one first cluster center. The clustering algorithm is a method for automatically dividing a stack of data without labels into at least one class, belongs to an unsupervised learning method, and is characterized in that samples with smaller distance or higher similarity are clustered into a cluster according to the distance or similarity between the samples, a plurality of clusters are finally formed, and the center of each cluster, namely, a cluster center, can be obtained. The obtained cluster center is related to all data in the coverage area of the cluster center, and the cluster center can be a mean vector of all data. The clustering centers can comprehensively describe the data distribution in the corresponding clusters, and certain distinction is made among different clustering centers.
Illustratively, in this embodiment, a K-Means clustering algorithm may be used to cluster a plurality of data to obtain at least one cluster, and a cluster center of each cluster, that is, a first cluster center. For example, the number K of clusters desired may be set first, where K is a positive integer. K data may then be randomly selected from the plurality of data as cluster centers, after which a distance of the data from each cluster center is calculated for each of the plurality of data, wherein the distance may include, for example, a euclidean distance (L2 distance), a cosine distance (COS distance), and the like. Data is closest to which cluster center, the data is divided into clusters to which that cluster center belongs. After all the data are clustered, K clusters can be obtained. Then, the cluster center of each cluster can be recalculated, if the distance between the newly calculated cluster center and the original cluster center is smaller than a preset threshold value, the position change of the recalculated cluster center is not large, the stability or convergence tends to be achieved, further the cluster calculation can be considered to reach the expected result, and the algorithm is terminated. However, if the distance between the new cluster center and the original cluster center is greater than the predetermined threshold, it indicates that the position of the recalculated cluster center is greatly changed, and the above steps can be iterated again until the position of the recalculated cluster center becomes stable. The predetermined threshold may be set according to actual needs.
It will be appreciated that other clustering algorithms may be selected to cluster data in the plurality of data in addition to K-Means, which is not specifically limited in this disclosure.
Then, in operation S220, the plurality of data are clustered according to the at least one first cluster center, resulting in at least one second cluster center.
According to an embodiment of the present disclosure, the plurality of data may be clustered again based on the first cluster center, for example, using a clustering algorithm, thereby obtaining at least one second cluster center. The clustering algorithm may include a K-Means clustering algorithm.
In operation S230, the plurality of data is divided into at least one data group according to at least one first cluster center and at least one second cluster center.
According to an embodiment of the present disclosure, for example, data corresponding to the same first cluster center and the same second cluster center among a plurality of data may be divided into one data group. Wherein the data set comprises at least one data.
In operation S240, a graph structure is generated from at least one data set.
According to embodiments of the present disclosure, a node in a graph structure may be determined, for example, from each data in at least one data set. Edges between nodes may then be generated from the correspondence of nodes to the data sets, where the edges are used to represent neighbor relationships between nodes.
In operation S250, the graph structure is stored to a database.
According to embodiments of the present disclosure, a plurality of data may be stored in a database in the form of a graph structure for retrieval.
According to an embodiment of the present disclosure, data is grouped according to the result of the two clustering, and a graph structure is generated according to the grouping result. By searching based on the graph structure, the performance in searching can be improved.
According to embodiments of the present disclosure, for example, an original feature of each of a plurality of data may be determined, resulting in an original feature space. And then clustering the plurality of data based on the original feature space to obtain at least one first cluster and a cluster center of the at least one first cluster, namely a first cluster center.
According to embodiments of the present disclosure, for example, a residual error between each data of the plurality of data and a first cluster center closest to the data may be determined, resulting in a residual error vector space. And then clustering the plurality of data based on the residual vector space to obtain at least one second cluster and a cluster center of the at least one second cluster, namely a second cluster center.
For example, a residual vector between each of the plurality of data and the first cluster center closest thereto may be calculated, thereby obtaining a residual vector corresponding to each data. The residual vector corresponding to each first data constitutes a residual vector space, and the residual vector space describes the relative position distribution of the data and the clustering center. The residual vector space may then be clustered using a K-Means clustering algorithm to describe the data distribution within the new space, resulting in a corresponding at least one second cluster center.
According to an embodiment of the disclosure, in the original feature space, data belonging to the same first cluster center may belong to any one second cluster center in the residual vector space. Residual vectors belonging to the same second cluster center within the residual vector space may correspond to first cluster centers of different original feature spaces. Two data may be considered highly similar if they belong to the same first cluster center in the spatial original feature space and to the same second cluster center in the residual vector space.
In this embodiment, the first cluster may correspond to a first subspace, and the second cluster may correspond to a second subspace. Theoretically, the spatial position of the cluster center is close to the center point of each subspace. For data under different cluster centers, the distribution of the positions (residuals) relative to the cluster centers is always similar. Thus, the data distribution in the second subspace is similar to the data distribution in the first subspace.
For example, assuming that the number of first subspaces of the original feature space is 5000 on a data set with a data size of 4 hundred million, the total number of samples per first subspace is about 80000 on average, and the residual vector space is subdivided into 5000 subspaces. For a first subspace in the original feature space, the first subspace may be referred to by the cluster center Ci of the first subspace. The data in Ci may belong to any of { Fn, fm … Ft } where { Fn, fm … Ft } is a subset of { F0, F1 … Fh } and F0, F1 … Fh is the second subspace of the residual vector space. Thus, each data can be identified with a cluster center pair, i.e., { (Ci, fn), (Ci, fm), …, (Ci, ft) }. At this time, the training cost is 2×5000 cluster centers, and the cost is greatly reduced. In addition, by dividing the data into two layers of an original feature space and a difference vector space, the calculation amount in the process of storing the data is reduced, and therefore the data storage efficiency and the retrieval efficiency are improved.
According to embodiments of the present disclosure, a node in a graph structure may be determined, for example, from each data in at least one data set. Then for any two nodes, adjacent edges can be generated between the two nodes in the case where the two nodes correspond to the same data set. In case the two nodes correspond to different data sets, a similarity between the two nodes is determined. In the case where the similarity is higher than the similarity threshold, a neighboring edge is generated between the two nodes. The similarity threshold may be set according to actual needs.
According to the embodiment of the disclosure, the data is converted into the form of the graph structure, the full-connected graph is not required to be constructed, and the performance is better during construction. In the graph structure, adjacent subspaces can guarantee connectivity and support hopping of adjacent subspaces. In addition, the adjacent edges in the graph structure are smaller and the adjacent edges are higher, so that the number of the edges required by the whole can be greatly reduced, and accordingly, the memory occupied by data storage and retrieval can be greatly reduced.
The method of generating a graph structure provided by the present disclosure will be described below in connection with fig. 3.
Fig. 3 schematically illustrates a schematic diagram of a method of generating a graph structure according to an embodiment of the disclosure.
As shown in fig. 3, data set a may include data A1, A2, and A3. Data set B may include data B1, B2, and B3. Based on this, it is possible to generate the nodes 311, 312, and 313 according to A1, A2, and A3, respectively, and generate adjacent edges between the nodes 311, 312, and 313. From B1, B2, and B3, nodes 321, 322, and 323 are generated, respectively, and adjacent edges are generated between the nodes 321, 322, and 323. Then, since the similarity of the data A1 and the data B1 is higher than the similarity threshold, a neighboring edge is generated between the nodes 311 and 321.
The data retrieval method provided by the present disclosure will be described below with reference to fig. 4.
Fig. 4 schematically illustrates a flow chart of a data retrieval method according to an embodiment of the present disclosure.
As shown in fig. 4, the data retrieval method 400 includes determining at least one first cluster center to be matched with a target first cluster center of data to be retrieved in operation S410.
According to embodiments of the present disclosure, the data to be retrieved may include, for example, images, text, video, audio, and the like. Optionally, after the data to be retrieved is obtained, features of the data to be retrieved may be extracted, and normalization processing may be performed on the extracted features, so as to perform retrieval better.
According to embodiments of the present disclosure, for example, a distance between each of the at least one first cluster center and the data to be retrieved may be calculated. And then determining a first cluster center with a distance smaller than a first distance threshold value in the at least one first cluster center as a target first cluster center. The first distance threshold may be set according to actual needs, which is not specifically limited in this disclosure.
In operation S420, it is determined that at least one second aggregation center matches a target second aggregation center of data to be retrieved.
According to embodiments of the present disclosure, for example, a distance between at least one second hub and the data to be retrieved may be determined. And then determining a second cluster center with a distance smaller than a second distance threshold value in the at least one second cluster center as a target second thermal cluster center. The second distance threshold may be set according to actual needs, which is not specifically limited in this disclosure.
In operation S430, a target data set of the plurality of data sets is determined according to the target first cluster center and the target second cluster center.
According to an embodiment of the present disclosure, a data group corresponding to a target first cluster center and a target second cluster center among a plurality of data groups may be determined as a target data group.
In operation S440, in the graph structure of the database, data matching the data to be retrieved is retrieved with a node corresponding to the target data group as a starting point, resulting in a retrieval result.
According to an embodiment of the present disclosure, the graph structure may include a plurality of nodes, which are in one-to-one correspondence with a plurality of raw data stored to the database according to the data storage method according to the embodiment of the present disclosure.
According to the embodiment of the disclosure, coarse-grained retrieval is performed according to the first clustering center and the second clustering center to obtain a target data set. And then carrying out fine-grained retrieval on the data in the target data set to obtain a retrieval result. Thus, the search efficiency can be improved.
In addition, according to an embodiment of the present disclosure, when searching for data in a target data set, nodes corresponding to the target data set are all used as starting points, traversing is performed in a graph structure, and matched data is searched for. When the target data set corresponds to a plurality of nodes, traversal can be started from the plurality of nodes, so that calculation steps can be reduced, and the retrieval efficiency can be improved.
The method of determining the target second hub class provided by the present disclosure will be described below in conjunction with fig. 5.
Fig. 5 schematically illustrates a flow chart of a method of determining a target second hub class according to an embodiment of the disclosure.
As shown in fig. 5, the method 520 of determining the target second cluster center includes acquiring a first distance between the target first cluster center and the second cluster center in operation S521.
According to embodiments of the present disclosure, for example, a distance between a target first cluster center and a second cluster center may be calculated, resulting in a first distance. Wherein the first distance may include a euclidean distance, a cosine distance, and the like.
According to another embodiment of the present disclosure, for example, the distance between each first cluster center and each second cluster center may be pre-calculated and the first distance recorded in the database. After the target first cluster center is determined, the distance between the target first cluster center and each second cluster center can be directly obtained from the stored data, so that the calculation process can be reduced.
In operation S522, a distance upper bound is determined according to the first distance and a second distance between the target first cluster center and the data to be retrieved.
According to the embodiment of the disclosure, for example, the distance between the target first clustering center and the data to be retrieved can be calculated, so as to obtain the second distance. Wherein the second distance may include a euclidean distance, a cosine distance, and the like.
According to embodiments of the present disclosure, a distance lower bound may be determined from a first distance and a second distance between a target first cluster center and data to be retrieved, for example, using a triangle inequality rule.
In operation S523, a distance lower bound is determined from the first distance, the second distance, and the residual corresponding to the second center of the cluster.
According to embodiments of the present disclosure, for example, the triangle inequality may be factored to obtain a deformation formula, based on which a distance lower bound may be determined from the first distance, the second distance, and a residual corresponding to the second cluster center.
In operation S524, a second cluster center, of which a distance from the data to be retrieved among the at least one second cluster center matches a distance upper bound and a distance lower bound, is determined as a target second cluster center.
According to an embodiment of the present disclosure, a match is indicated if the distance is greater than or equal to the lower distance bound and less than or equal to the upper distance bound.
In this embodiment, any three points satisfy the triangle inequality for the L2 or COS distance space, where the triangle inequality can be expressed as follows:
d(Ox,Oy)≤d(Ox,Oz)+d(Oz,Oy);
where d is the distance, ox, oy, oz represent any three points in space.
Based on this, fig. 6 schematically shows a schematic diagram of a three-point relationship according to an embodiment of the present disclosure.
As shown in fig. 6, the data, the first cluster center, and the second cluster center satisfy the following formulas:
|p-(C+F)|2≥|p-C|2-|(C+F)-C|2
wherein p may be data, C may be a first cluster center, and F may be a second cluster center.
Based on this, in the present embodiment, the distance upper bound can be calculated, for example, according to the following formula:
Wherein max is the upper distance bound, D 1 is the first distance, and D 2 is the second distance.
In addition, the triangle inequality can be factored to yield the following formula:
|q-C-F|2=|q-C|2+2<C,Fi>+|Fi|2-2<q,Fi>≥|q-C|2+min(2<C,Fi>)+|Fi|2-2<q,Fi>
Where q may be data, C may be a first cluster center, and F i may be an i second cluster center.
Based on this, in the present embodiment, the distance lower bound can be calculated according to the following formula, for example:
Wherein, min is the distance lower bound, D 1 is the first distance, D 2 is the second distance, and F is the residual error corresponding to the second cluster center.
After the first distance between the second data and the first cluster center is calculated, since the distance between the first cluster center and the second cluster center is known, the upper distance boundary and the lower distance boundary can be determined according to the triangle inequality, and pruning operation can be performed according to the upper distance boundary and the lower distance boundary, that is, all subspaces are ordered according to a certain rule, pruning is performed according to the upper distance boundary and the lower distance boundary, and calculation is not required for subspaces which do not meet the upper distance boundary and the lower distance boundary, so that the overall calculation amount can be greatly reduced.
A method of retrieving data matching data to be retrieved, provided by the present disclosure, with a node corresponding to a target data group as a starting point, will be described below with reference to fig. 7.
Fig. 7 schematically illustrates a flowchart of a method of retrieving data matching data to be retrieved, starting with a node corresponding to a target data set, according to an embodiment of the disclosure.
As shown in fig. 7, the method 740 of retrieving data matching the data to be retrieved, starting from a node corresponding to the target data group, includes adding the node corresponding to the target data group to the candidate set in operation S741.
In operation S742, for each node in the candidate set, a neighbor node having a neighbor relation with each node is determined.
In operation S743, it is determined whether there is a target node matching the data to be retrieved among the neighbor nodes. In case there is a target node matching the data to be retrieved among the neighbor nodes, operation S744 is performed. In case that there is no target node matching the data to be retrieved among the neighbor nodes, operation S745 is performed.
According to the embodiment of the disclosure, for example, the similarity between the neighbor node and the data to be retrieved can be calculated, and in the case that the similarity is greater than the similarity threshold, the neighbor node is determined as the target node. The similarity threshold may be set according to actual needs, which is not specifically limited in this disclosure.
In operation S744, the candidate set is updated according to the target node, and for the updated candidate set, operation S742 is returned.
According to embodiments of the present disclosure, the target node may be added to the candidate set, for example, in case the candidate set is not full. And if the similarity of the target node is larger than the similarity of the node with the minimum similarity in the candidate set under the condition that the data in the candidate set is full, deleting the node with the minimum similarity, and adding the target node into the candidate set. Otherwise, the target node may be discarded.
In operation S745, it is determined that each node in the candidate set has performed the above-described operation. If both executions are completed, operation S746 is performed. Otherwise, operation S742 is returned.
According to the embodiment of the disclosure, on one hand, the small-range search can be performed in the target data set, on the other hand, the search can be performed in other adjacent data sets, and finally, the search result is obtained, the recall rate and the accuracy are high, and in addition, the search speed is high.
In operation S746, raw data corresponding to the candidate set is determined as a search result.
According to the embodiment of the disclosure, in the case that each node in the candidate set traverses, the data in the current candidate set can be used as a retrieval result.
In this embodiment, the data to be stored may include video information, for example. The video information may include, for example, a video file, a feature of the video, description information of the video, and the like. Based on this, a plurality of original video information may be acquired and then stored in the form of a graph structure to a database, which may be used to identify similar videos. Wherein the original video information may be determined, for example, from video published in the video platform.
For example, a plurality of original video information may be clustered to obtain at least one first cluster center. And clustering the plurality of original video information according to the at least one first clustering center to obtain at least one second clustering center. The plurality of original video information is then partitioned into at least one data set based on the at least one first cluster center and the at least one second cluster center. And generating a graph structure according to the at least one data set. The graph structure is then stored to a database. The specific method for storing the original video information may refer to the data storage method described above, and will not be described herein.
For example, in this embodiment, the data to be retrieved may include video information of the video to be identified, that is, the video information to be identified. The original video information in the database may be utilized to determine whether there is a video similar to the video information to be identified.
For example, for video information to be identified, at least one first cluster center may be determined that matches a target first cluster center of the video information to be identified. At least one second hub is determined to be a target second hub that matches the video information to be identified. And then determining a target data set in a plurality of data sets according to the target first clustering center and the target second clustering center. And then, in the graph structure of the database, searching the original video information matched with the video information to be identified by taking the node corresponding to the target data set as a starting point, and obtaining an identification result. The specific method for retrieving the original video information may refer to the data retrieving method described above, and will not be described herein. And determining video information similar to the video information to be identified according to the identification result, thereby obtaining the video similar to the video to be identified.
The data storage device provided by the present disclosure will be described below with reference to fig. 8.
Fig. 8 schematically illustrates a block diagram of a data storage device according to an embodiment of the present disclosure.
As shown in fig. 8, the data storage 800 includes a first clustering module 810, a second clustering module 820, a partitioning module 830, a graph generation module 840, and a storage module 850.
The first clustering module 810 is configured to cluster the plurality of data to obtain at least one first cluster center.
And the second clustering module 820 is configured to cluster the plurality of data according to the at least one first cluster center to obtain at least one second cluster center.
A partitioning module 830, configured to partition the plurality of data into at least one data group according to the at least one first cluster center and the at least one second cluster center.
The diagram generating module 840 is configured to generate a diagram structure according to at least one data set.
A storage module 850 for storing the graph structure in a database.
According to an embodiment of the present disclosure, the second aggregation module may include: the residual determination submodule is used for determining residual between each data in the plurality of data and the first clustering center closest to the data to obtain a residual vector space; and a residual clustering sub-module, configured to cluster the plurality of data based on a residual vector space, to obtain at least one second aggregation center.
According to an embodiment of the present disclosure, the graph generation module may include: a node generating sub-module for determining a node in the graph structure according to each data in the at least one data set; and an edge generation sub-module for generating edges between the nodes according to the corresponding relation between the nodes and the data set, wherein the edges are used for representing the neighbor relation between the nodes.
According to embodiments of the present disclosure, edges include adjoining edges and adjacent edges. The edge generation sub-module may include: the adjacent edge generating unit is used for generating adjacent edges between any two nodes under the condition that the two nodes correspond to the same data set; a similarity determining unit configured to determine a similarity between two nodes in a case where the two nodes correspond to different data sets; and a neighboring edge generation unit configured to generate a neighboring edge between two nodes in a case where the similarity is higher than a similarity threshold.
According to an embodiment of the present disclosure, the first clustering module may include: the original characteristic determining submodule is used for determining the original characteristic of each data in the plurality of data to obtain an original characteristic space; and the original characteristic clustering sub-module is used for clustering a plurality of data based on the original characteristic space to obtain at least one first clustering center.
According to an embodiment of the present disclosure, the partitioning module may include: and the dividing sub-module is used for dividing the data corresponding to the same first clustering center and the same second clustering center in the plurality of data into a data group.
According to an embodiment of the present disclosure, the above data storage device may further include: the computing module is used for computing a first distance between each first clustering center and each second clustering center; and a recording module for recording the first distance in the database.
The data retrieval device provided by the present disclosure will be described below with reference to fig. 9.
Fig. 9 schematically shows a block diagram of a data retrieval device according to an embodiment of the present disclosure.
As shown in fig. 9, the data retrieval apparatus 900 includes a first cluster center determination module 9l0, a second cluster center determination module 920, a first target data group determination module 930, and a first retrieval module 940.
The first cluster center determining module 910 is configured to determine a target first cluster center that matches at least one first cluster center with data to be retrieved.
A second aggregation center determining module 920, configured to determine at least one second aggregation center that matches the target second aggregation center of the data to be retrieved.
A first target data set determining module 930, configured to determine a target data set of the plurality of data sets according to the target first cluster center and the target second cluster center.
The first retrieving module 940 is configured to retrieve, in a graph structure of a database, data matching the data to be retrieved with a node corresponding to the target data set as a starting point, to obtain a retrieval result, where the graph structure includes a plurality of nodes, the plurality of nodes are in one-to-one correspondence with a plurality of original data, and the plurality of original data are stored in the database according to the data storage method according to the embodiment of the present disclosure.
According to an embodiment of the present disclosure, the first cluster center determination module may include: a first distance calculation sub-module for calculating a distance between each first cluster center of the at least one first cluster center and the data to be retrieved; and a first target cluster center determination submodule for determining a first cluster center with a distance smaller than a first distance threshold value in at least one first cluster center as a target first cluster center.
According to an embodiment of the present disclosure, the second hub determination module may include: the first distance acquisition sub-module is used for acquiring a first distance between a first clustering center and a second clustering center of the target; the first upper bound determining submodule is used for determining a distance upper bound according to the first distance and the second distance between the first clustering center of the target and the data to be retrieved; the first lower limit determining submodule is used for determining a distance lower limit according to the first distance, the second distance and the residual error corresponding to the second polymer center; and a second target cluster center determination submodule for determining a second cluster center, of which the distance between the at least one second cluster center and the data to be retrieved is matched with the upper and lower distance bounds, as a target second cluster center.
According to an embodiment of the present disclosure, the first retrieval module may include: a first adding sub-module for adding a node corresponding to the target data set to the candidate set; a first neighbor determination submodule, configured to determine, for each node in the candidate set, a neighbor node having a neighbor relation with each node; the first target node determining submodule is used for determining whether target nodes matched with data to be retrieved exist in the neighbor nodes or not; the first updating sub-module is used for updating the candidate set according to the target node under the condition that the target node matched with the data to be retrieved exists in the neighbor nodes, returning each node in the candidate set aiming at the updated candidate set, and determining the operation of the neighbor node with the neighbor relation with each node; and a search result determining sub-module for determining a candidate set as a search result in the case that a target node matching the data to be searched does not exist in the neighbor nodes.
According to an embodiment of the present disclosure, the first target node determination submodule may include: the first similarity calculation unit is used for calculating the similarity between the neighbor node and the data to be retrieved; and a first target node determining unit configured to determine, as the target node, the neighboring node in a case where the similarity is greater than a similarity threshold.
The apparatus for recognizing similar videos provided in the present disclosure will be described below with reference to fig. 10.
Fig. 10 schematically illustrates a block diagram of an apparatus for identifying similar videos according to an embodiment of the disclosure.
As shown in fig. 10, the apparatus 1000 for identifying similar videos includes a third cluster center determination module 1010, a fourth cluster center determination module 1020, a second target data set determination module 1030, and a second retrieval module 1040.
A third cluster center determining module 1010 is configured to determine a target first cluster center that matches the video information to be identified with at least one first cluster center.
A fourth cluster center determination module 1020 is configured to determine at least one second cluster center that matches the target second cluster center of the video information to be identified.
A second target data set determining module 1030 is configured to determine a target data set of the plurality of data sets according to the target first cluster center and the target second cluster center.
The second retrieving module 1040 is configured to retrieve, in a graph structure of the database, video information matching with video information to be identified with a node corresponding to a target data set as a starting point, and obtain an identification result, where the graph structure includes a plurality of nodes, the plurality of nodes and a plurality of original video information are in one-to-one correspondence, and the plurality of original video information is stored in the database according to the data storage method according to the embodiment of the present disclosure.
According to an embodiment of the present disclosure, the third class center determination module may include: the second distance calculation sub-module is used for calculating the distance between each first cluster center in the at least one first cluster center and the video information to be identified; and a third target cluster center determination sub-module that determines, as target first cluster centers, first cluster centers having a distance smaller than a first distance threshold among the at least one first cluster center.
According to an embodiment of the present disclosure, the fourth cluster center determination module may include: the second distance acquisition sub-module is used for acquiring a first distance between a first clustering center and a second clustering center of the target; the second upper bound determining sub-module is used for determining a distance upper bound according to the first distance and a second distance between the first clustering center of the target and the video information to be identified; the second lower bound determining submodule is used for determining a lower bound of the distance according to the first distance, the second distance and the residual error corresponding to the second polymer center; and a fourth target cluster center determination submodule, configured to determine, as target second cluster centers, second cluster centers in which a distance between at least one second cluster center and video information to be identified matches a distance upper bound and a distance lower bound.
According to an embodiment of the present disclosure, the second search module may include: a second adding sub-module for adding a node corresponding to the target data set to the candidate set; a second neighbor determination submodule, configured to determine, for each node in the candidate set, a neighbor node having a neighbor relation with each node; the second target node determining submodule is used for determining whether target nodes matched with the video information to be identified exist in the neighbor nodes or not; the second updating sub-module is used for updating the candidate set according to the target node under the condition that the target node matched with the video information to be identified exists in the neighbor nodes, returning each node in the candidate set aiming at the updated candidate set, and determining the operation of the neighbor node with the neighbor relation with each node; and the recognition result determining submodule is used for determining video information corresponding to the candidate set as a recognition result when a target node matched with the video information to be recognized does not exist in the neighbor nodes.
According to an embodiment of the present disclosure, the second target node determination submodule may include: the second similarity calculation unit is used for calculating the similarity between the neighbor node and the video information to be identified; and a second target node determining unit configured to determine, as the target node, the neighboring node in a case where the similarity is greater than the similarity threshold.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 11 schematically illustrates a block diagram of an example electronic device 1100 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
Various components in device 1100 are connected to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the respective methods and processes described above, such as a data storage method, a data retrieval method, and a method of identifying similar videos. For example, in some embodiments, the data storage method, the data retrieval method, and the method of identifying similar videos may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the data storage method, the data retrieval method, and the method of identifying similar videos described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the data storage method, the data retrieval method, and the method of identifying similar videos in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (33)

1. A data retrieval method comprising:
Determining a graph structure from a database, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original data, and the nodes comprise at least one first clustering center and at least one second clustering center;
determining a target first cluster center matched with the data to be retrieved from at least one first cluster center;
obtaining the distance between the target first clustering center and each second clustering center as a first distance;
Determining the distance between the target first clustering center and the data to be retrieved as a second distance;
determining a distance upper bound according to the first distance and the second distance;
determining a distance lower bound according to the first distance, the second distance and a residual error corresponding to the second polymer center;
Determining a target second hub from among the at least one second hub, wherein a distance between the target second hub and the data to be retrieved is greater than or equal to the lower distance bound and less than or equal to the upper distance bound;
Determining a target data set in a plurality of data sets according to the target first clustering center and the target second clustering center; and
And in the graph structure of the database, searching the data matched with the data to be searched by taking the node corresponding to the target data set as a starting point, so as to obtain a search result.
2. The method of claim 1, wherein the determining a target first cluster center from at least one first cluster center that matches data to be retrieved comprises:
Calculating a distance between each first cluster center of the at least one first cluster center and the data to be retrieved; and
And determining a first cluster center with a distance smaller than a first distance threshold value in the at least one first cluster center as the target first cluster center.
3. The method according to claim 1, wherein the retrieving, in the graph structure of the database, the data matching the data to be retrieved with the node corresponding to the target data set as a starting point, to obtain a retrieval result, includes:
adding nodes corresponding to the target data set to a candidate set;
Determining, for each node in the candidate set, a neighbor node having a neighbor relationship with each node;
determining whether a target node matched with the data to be retrieved exists in the neighbor nodes;
Updating the candidate set according to the target node under the condition that the target node matched with the data to be retrieved exists in the neighbor nodes, and returning each node in the candidate set aiming at the updated candidate set to determine the operation of the neighbor node with a neighbor relation with each node; and
And under the condition that no target node matched with the data to be searched exists in the neighbor nodes, determining the original data corresponding to the candidate set as the search result.
4. A method according to claim 3, wherein said determining whether there is a target node in the neighbor node that matches the data to be retrieved comprises:
Calculating the similarity between the neighbor node and the data to be retrieved; and
And determining the neighbor node as the target node under the condition that the similarity is larger than a similarity threshold value.
5. The method of claim 1, wherein the graph structure is stored in a database by:
clustering the plurality of data to obtain at least one first clustering center;
Clustering the plurality of data according to the at least one first clustering center to obtain at least one second clustering center;
dividing the plurality of data into at least one data set according to the at least one first cluster center and the at least one second cluster center;
generating a graph structure according to the at least one data set; and
And storing the graph structure in a database.
6. The method of claim 5, wherein the clustering the plurality of data according to the at least one first cluster center to obtain at least one second cluster center comprises:
Determining residual errors between each data in the plurality of data and a first clustering center closest to the data, and obtaining a residual error vector space; and
And clustering the plurality of data based on the residual vector space to obtain the at least one second aggregation center.
7. The method of claim 5, wherein the generating a graph structure from the at least one data set comprises:
determining a node in the graph structure from each data in the at least one data set; and
And generating edges between the nodes according to the corresponding relation between the nodes and the data set, wherein the edges are used for representing the neighbor relation between the nodes.
8. The method of claim 7, wherein the edges comprise adjacent edges and adjacent edges; the generating the edge between the nodes according to the corresponding relation between the nodes and the data group comprises the following steps:
for any two nodes, generating adjacent edges between the two nodes under the condition that the two nodes correspond to the same data group;
In case the two nodes correspond to different data sets,
Determining the similarity between the two nodes; and
And generating an adjacent edge between the two nodes under the condition that the similarity is higher than a similarity threshold value.
9. The method of claim 5, wherein the clustering the plurality of data to obtain at least one first cluster center comprises:
Determining the original characteristics of each data in the plurality of data to obtain an original characteristic space; and
And clustering the plurality of data based on the original feature space to obtain the at least one first clustering center.
10. The method of claim 5, wherein the partitioning the plurality of data into at least one data set according to the at least one first cluster center and the at least one second cluster center comprises:
Data corresponding to the same first clustering center and the same second clustering center in the plurality of data are divided into a data group.
11. The method of claim 5, further comprising:
calculating a first distance between each first cluster center and each second cluster center; and
The first distance is recorded in the database.
12. A method of identifying similar videos, comprising:
determining a graph structure from a database, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original video information, and the plurality of original video information comprises at least one first clustering center and at least one second clustering center;
determining a target first clustering center matched with the video information to be identified from the at least one first clustering center;
obtaining the distance between the target first clustering center and each second clustering center as a first distance;
determining the distance between the target first clustering center and the video information to be identified as a second distance;
determining a distance upper bound according to the first distance and the second distance;
Determining a distance lower bound according to the first distance, the second distance and a residual error corresponding to the second polymer center; and
Determining a target second focus from among the at least one second focus, wherein a distance between the target second focus and the video information to be identified is greater than or equal to the distance lower bound and less than or equal to the distance upper bound;
Determining a target data set in a plurality of data sets according to the target first clustering center and the target second clustering center; and
And in the graph structure of the database, searching the video information matched with the video information to be identified by taking the node corresponding to the target data set as a starting point, and obtaining an identification result.
13. The method of claim 12, wherein the determining a target first cluster center from at least one first cluster center that matches video information to be identified comprises:
calculating a distance between each first cluster center of the at least one first cluster center and the video information to be identified; and
And determining a first cluster center with a distance smaller than a first distance threshold value in the at least one first cluster center as the target first cluster center.
14. The method according to claim 12, wherein said retrieving, in the graph structure of the database, the target video information matching the video information to be identified, with the node corresponding to the target data set as a starting point, to obtain the identification result, includes:
adding nodes corresponding to the target data set to a candidate set;
Determining, for each node in the candidate set, a neighbor node having a neighbor relationship with each node;
determining whether a target node matched with the video information to be identified exists in the neighbor nodes;
Under the condition that a target node matched with the video information to be identified exists in the neighbor nodes, updating the candidate set according to the target node, and returning each node in the candidate set aiming at the updated candidate set, and determining the operation of the neighbor node with a neighbor relation with each node; and
And under the condition that no target node matched with the video information to be identified exists in the neighbor nodes, determining the video information corresponding to the candidate set as the identification result.
15. The method of claim 14, wherein the determining whether there is a target node in the neighbor nodes that matches the video information to be identified comprises:
calculating the similarity between the neighbor node and the video information to be identified; and
And determining the neighbor node as the target node under the condition that the similarity is larger than a similarity threshold value.
16. A data retrieval device comprising:
The first graph structure determining module is used for determining a graph structure from a database, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original data, and the nodes comprise at least one first clustering center and at least one second clustering center;
A first cluster center determining module for determining a target first cluster center matched with the data to be retrieved from at least one first cluster center;
the first distance determining module is used for obtaining the distance between the first clustering center of the target and each second clustering center as a first distance;
the second distance determining module is used for determining the distance between the target first clustering center and the data to be retrieved as a second distance;
The distance upper limit determining module is used for determining a distance upper limit according to the first distance and the second distance;
The distance lower bound determining module is used for determining a distance lower bound according to the first distance, the second distance and the residual error corresponding to the second aggregation center;
a second cluster center determining module configured to determine a target second cluster center from the at least one second cluster center, where a distance between the target second cluster center and the data to be retrieved is greater than or equal to the distance lower bound and less than or equal to the distance upper bound;
The first target data set determining module is used for determining target data sets in a plurality of data sets according to the target first clustering center and the target second clustering center; and
And the first retrieval module is used for retrieving the data matched with the data to be retrieved by taking the node corresponding to the target data set as a starting point in the graph structure of the database, so as to obtain a retrieval result.
17. The apparatus of claim 16, wherein the first cluster center determination module comprises:
A first distance calculation sub-module for calculating a distance between each first cluster center of the at least one first cluster center and the data to be retrieved; and
And the first target cluster center determining submodule is used for determining a first cluster center with a distance smaller than a first distance threshold value in the at least one first cluster center as the target first cluster center.
18. The apparatus of claim 16, wherein the first retrieval module comprises:
A first adding sub-module for adding a node corresponding to the target data set to a candidate set;
A first neighbor determination submodule, configured to determine, for each node in the candidate set, a neighbor node having a neighbor relation with each node;
a first target node determining submodule, configured to determine whether a target node matching the data to be retrieved exists in the neighboring nodes;
a first updating sub-module, configured to update the candidate set according to the target node when there is a target node matching the data to be retrieved in the neighbor nodes, and for the updated candidate set, return each node in the candidate set, and determine an operation of the neighbor node having a neighbor relation with each node; and
And the retrieval result determining submodule is used for determining the original data corresponding to the candidate set as the retrieval result under the condition that the target node matched with the data to be retrieved does not exist in the neighbor nodes.
19. The apparatus of claim 18, wherein the first target node determination submodule comprises:
the first similarity calculation unit is used for calculating the similarity between the neighbor node and the data to be retrieved; and
And the first target node determining unit is used for determining the neighbor node as the target node under the condition that the similarity is larger than a similarity threshold value.
20. The apparatus of claim 16, wherein the apparatus further comprises:
the first clustering module is used for clustering the plurality of data to obtain at least one first clustering center;
The second clustering module is used for clustering the plurality of data according to the at least one first clustering center to obtain at least one second clustering center;
A partitioning module configured to partition the plurality of data into at least one data set according to the at least one first cluster center and the at least one second cluster center;
A graph generating module, configured to generate a graph structure according to the at least one data set; and
And the storage module is used for storing the graph structure to a database.
21. The apparatus of claim 20, wherein the second aggregation module comprises:
a residual determination submodule, configured to determine a residual between each data of the plurality of data and a first cluster center closest to the data, and obtain a residual vector space; and
And the residual clustering sub-module is used for clustering the plurality of data based on the residual vector space to obtain the at least one second aggregation center.
22. The apparatus of claim 20, wherein the map generation module comprises:
A node generating sub-module, configured to determine a node in the graph structure according to each data in the at least one data set; and
And the edge generation sub-module is used for generating edges between the nodes according to the corresponding relation between the nodes and the data set, wherein the edges are used for representing the neighbor relation between the nodes.
23. The apparatus of claim 22, wherein the edges comprise an adjacent edge and an adjacent edge; the edge generation submodule includes:
The adjacent edge generating unit is used for generating adjacent edges between any two nodes under the condition that the two nodes correspond to the same data set;
A similarity determining unit configured to determine a similarity between the two nodes in a case where the two nodes correspond to different data sets; and
And the adjacent edge generating unit is used for generating an adjacent edge between the two nodes under the condition that the similarity is higher than a similarity threshold value.
24. The apparatus of claim 20, wherein the first clustering module comprises:
an original feature determining sub-module, configured to determine an original feature of each of the plurality of data, to obtain an original feature space; and
And the original feature clustering sub-module is used for clustering the plurality of data based on the original feature space to obtain the at least one first clustering center.
25. The apparatus of claim 20, wherein the partitioning module comprises:
And the dividing sub-module is used for dividing the data corresponding to the same first clustering center and the same second clustering center in the plurality of data into a data group.
26. The apparatus of claim 20, further comprising:
the computing module is used for computing a first distance between each first clustering center and each second clustering center; and
And the recording module is used for recording the first distance in the database.
27. An apparatus for identifying similar videos, comprising:
The second graph structure determining module is used for determining a graph structure from the database, wherein the graph structure comprises a plurality of nodes, the nodes are in one-to-one correspondence with a plurality of original video information, and the plurality of original video information comprises at least one first clustering center and at least one second clustering center;
A third cluster center determining module, configured to determine a target first cluster center that matches the video information to be identified from at least one first cluster center;
A fourth cluster center determining module, configured to obtain a distance between the target first cluster center and each second cluster center as a first distance; determining the distance between the target first clustering center and the video information to be identified as a second distance; determining a distance upper bound according to the first distance and the second distance; determining a distance lower bound according to the first distance, the second distance and a residual error corresponding to the second polymer center; and determining a target second-class center from among the at least one second-class center, wherein a distance between the target second-class center and the video information to be identified is greater than or equal to the distance lower bound and less than or equal to the distance upper bound;
the second target data set determining module is used for determining target data sets in a plurality of data sets according to the target first clustering center and the target second clustering center; and
And the second retrieval module is used for retrieving the video information matched with the video information to be identified by taking the node corresponding to the target data set as a starting point in the graph structure of the database to obtain an identification result.
28. The apparatus of claim 27, wherein the third class center determination module comprises:
A second distance calculation sub-module for calculating a distance between each first cluster center of the at least one first cluster center and the video information to be identified; and
And the third target cluster center determining submodule determines a first cluster center with a distance smaller than a first distance threshold value in the at least one first cluster center as the target first cluster center.
29. The apparatus of claim 27, wherein the second retrieval module comprises:
a second adding sub-module for adding a node corresponding to the target data set to a candidate set;
A second neighbor determination submodule, configured to determine, for each node in the candidate set, a neighbor node having a neighbor relation with each node;
A second target node determining submodule, configured to determine whether a target node matched with the video information to be identified exists in the neighboring nodes;
A second updating sub-module, configured to update the candidate set according to the target node when there is a target node matching the video information to be identified in the neighboring nodes, and for the updated candidate set, return each node in the candidate set, and determine an operation of the neighboring node having a neighboring relationship with each node; and
And the identification result determining submodule is used for determining video information corresponding to the candidate set as the identification result when the target node matched with the video information to be identified does not exist in the neighbor nodes.
30. The apparatus of claim 29, wherein the second target node determination submodule comprises:
the second similarity calculation unit is used for calculating the similarity between the neighbor node and the video information to be identified; and
And the second target node determining unit is used for determining the neighbor node as the target node under the condition that the similarity is larger than a similarity threshold value.
31. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.
32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.
33. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method according to any of claims 1-15.
CN202310215233.2A 2023-02-28 2023-02-28 Data storage method, data retrieval method and method for identifying similar video Active CN116304253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215233.2A CN116304253B (en) 2023-02-28 2023-02-28 Data storage method, data retrieval method and method for identifying similar video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215233.2A CN116304253B (en) 2023-02-28 2023-02-28 Data storage method, data retrieval method and method for identifying similar video

Publications (2)

Publication Number Publication Date
CN116304253A CN116304253A (en) 2023-06-23
CN116304253B true CN116304253B (en) 2024-05-07

Family

ID=86791928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215233.2A Active CN116304253B (en) 2023-02-28 2023-02-28 Data storage method, data retrieval method and method for identifying similar video

Country Status (1)

Country Link
CN (1) CN116304253B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656373A (en) * 2021-08-16 2021-11-16 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for constructing retrieval database
CN114817657A (en) * 2022-04-29 2022-07-29 上海徐毓智能科技有限公司 To-be-retrieved data processing method, data retrieval method, electronic device and medium
CN115169489A (en) * 2022-07-25 2022-10-11 北京百度网讯科技有限公司 Data retrieval method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230035337A1 (en) * 2021-07-13 2023-02-02 Baidu Usa Llc Norm adjusted proximity graph for fast inner product retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656373A (en) * 2021-08-16 2021-11-16 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for constructing retrieval database
CN114817657A (en) * 2022-04-29 2022-07-29 上海徐毓智能科技有限公司 To-be-retrieved data processing method, data retrieval method, electronic device and medium
CN115169489A (en) * 2022-07-25 2022-10-11 北京百度网讯科技有限公司 Data retrieval method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116304253A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN108038183B (en) Structured entity recording method, device, server and storage medium
TWI696081B (en) Sample set processing method and device, sample query method and device
US10311288B1 (en) Determining identity of a person in a digital image
CN112765477B (en) Information processing method and device, information recommendation method and device, electronic equipment and storage medium
CN114444619B (en) Sample generation method, training method, data processing method and electronic device
CN112860993B (en) Method, device, equipment, storage medium and program product for classifying points of interest
CN112035626A (en) Rapid identification method and device for large-scale intentions and electronic equipment
CN113656373A (en) Method, device, equipment and storage medium for constructing retrieval database
WO2022007596A1 (en) Image retrieval system, method and apparatus
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN113010752B (en) Recall content determining method, apparatus, device and storage medium
CN110209895B (en) Vector retrieval method, device and equipment
CN115169489B (en) Data retrieval method, device, equipment and storage medium
CN114897666B (en) Graph data storage, access, processing method, training method, device and medium
CN116304253B (en) Data storage method, data retrieval method and method for identifying similar video
CN115934724A (en) Method for constructing database index, retrieval method, device, equipment and medium
CN113220840B (en) Text processing method, device, equipment and storage medium
CN111859192B (en) Searching method, searching device, electronic equipment and storage medium
CN112860626B (en) Document ordering method and device and electronic equipment
CN115146103A (en) Image retrieval method, image retrieval apparatus, computer device, storage medium, and program product
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN113961720A (en) Method for predicting entity relationship and method and device for training relationship prediction model
CN108090182B (en) A kind of distributed index method and system of extensive high dimensional data
CN111460088A (en) Similar text retrieval method, device and system
CN115794984B (en) Data storage method, data retrieval method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant