CN117056459A - Vector recall method and device - Google Patents

Vector recall method and device Download PDF

Info

Publication number
CN117056459A
CN117056459A CN202310988298.0A CN202310988298A CN117056459A CN 117056459 A CN117056459 A CN 117056459A CN 202310988298 A CN202310988298 A CN 202310988298A CN 117056459 A CN117056459 A CN 117056459A
Authority
CN
China
Prior art keywords
word
data
words
rule
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310988298.0A
Other languages
Chinese (zh)
Other versions
CN117056459B (en
Inventor
时迎超
王杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangpin Information Technology Co ltd
Original Assignee
Beijing Wangpin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangpin Information Technology Co ltd filed Critical Beijing Wangpin Information Technology Co ltd
Priority to CN202310988298.0A priority Critical patent/CN117056459B/en
Publication of CN117056459A publication Critical patent/CN117056459A/en
Application granted granted Critical
Publication of CN117056459B publication Critical patent/CN117056459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a vector recall method, and belongs to the technical field of data processing. Which comprises the following steps: (1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant; (2) When the network is trained, the parameter matrix of the neural network is the characteristic of the input text, and the output is word vector; (3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words. The method and the system effectively help to promote the service side effect, promote the library falling, and help to promote recommended search service indexes.

Description

Vector recall method and device
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a vector recall method and device.
Background
Natural language processing (Natural Language Processing, NLP) is a subject of language questions for human interaction with computers. In general, natural language processing includes natural language understanding and natural language generation. Natural language understanding is the conversion of natural language into a language that a computer can understand, and unstructured text into structured information. Specifically, the NLP technology can complete semantic recognition and recommendation of natural language or call related services to provide more intelligent functions for users. With the development of artificial intelligence, NLP technology is more mature.
Unlike conventional ItemCF/userCF, vectorized recall is to utilize a vector retrieval tool to conduct a fast neighbor search in vector space based on Euclidean distance. When offline, training is carried out through a model, wherein the general mode is to make the similarity of a positive sample pair higher than that of a negative sample pair as much as possible, the model converges to obtain Embedding, and the Embedding of a static part (queried) is written into a Faiss tool to establish an index for online query; when the method is on line, a corresponding ebedding is output aiming at a dynamic part (active query) request model, and the static content of TopK is searched in Faiss and used as a Topk recall result.
When vectorization recall is carried out, the number of general static parts is different from hundreds to hundreds of millions, for example, recommended vectorization recall general materials have millions, the possible number of semantic similarity recalls of NLP is different from hundreds to tens of thousands, and when the NLP is used for text classification, if the number of classifications is relatively large, the classification thought is difficult to achieve higher accuracy, and at the moment, the problem can be solved by using the vectorization recall thought. The number of static parts faced by vector recall is relatively large, if all negative samples except positive samples are involved in calculation, the calculation is difficult to realize in engineering, and the calculated amount is greatly increased, so that the negative sampling strategy is generally adopted for sample processing. Of course, if the static part is only hundreds, the negative sample can also participate in all the calculations, so that a better effect is achieved.
FIG. 4 shows a vector recall method used in the prior art, and compared with the present application, the prior art still uses an array combination of atomic strategies, and the effect of curing the recall strategy is poor. When the platform is developed, the following problems are obviously existed, the recall mode is single, and the multiple forms of recall do not make up for the long-term and short-term; the strategy granularity is thicker and limited to channel dimension, and JD personalized recall cannot be achieved; the production efficiency is low, and it is difficult to traverse all effective strategy combinations, so that recall results are low in efficiency and poor in effect.
Disclosure of Invention
1. Problems to be solved
Aiming at the problems in the prior art, the application provides a vector recall method, which uses query-title data to train a double-tower structure, uses a title tower to calculate and provide vectors, extracts words from word models, carries out vectorization representation, and packages the words into batch data to enable the parallel calculation of the models to improve the performance.
2. Technical proposal
In order to solve the problems, the application adopts the following technical scheme.
A vector recall method comprising the steps of:
(1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;
(2) Training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;
(3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words.
The vector recall method described above,
before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a type system, an API (application program interface) and a Graph Engine; the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area; the DGr aph distributed database, the Type system, the API (application program interface) and the Graph Engine are sequentially connected
The vector recall method described above,
the probability calculation rule in the step (1) is as follows:
setting a rule filter, which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer; the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer; the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor, wherein the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel.
The vector recall method described above,
the neural network in the step (1) performs data transmission optimization between different layers by using the following algorithm:
in the method, in the process of the application,parameter values, p, optimized for data transmission 0 、s j S' are PC end and APP end respectivelyThe vertices of three network nodes for third party applications.
The vector recall method described above,
the vector recall model described in step (2) utilizes a double-tower model whose patterns for data indexing are as follows:
hash calculation is carried out on the data content of the data, and a data identifier of the data is obtained;
constructing an index node matching layer according to the data identification, wherein the index node matching layer is used for corresponding a matching main key value in the index node matching layer with a matching index node, the index node matching layer is composed of a plurality of index node matching tables, and each index node matching table is composed of related main key values and key value pairs; the data identification of the data is used as a main key value of the pointer block, and the index node of the data is used as a key value of the pointer block;
when the obtained data identification of the subsequent data does not exist in the index node matching layer, generating a new index node matching table according to the subsequent data identification, and inserting the new index node matching table into the index node matching layer;
when the obtained data identification of the subsequent data exists in the index node matching layer, the subsequent data is pointed to the corresponding index node.
The vector recall method described above,
the construction method of the DGgraph distributed database comprises the following steps:
performing dggraph hypergraph on the relation table in the database cluster, wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) }, wherein T (Ei) & ai=tsΣ (E); h (Ei). Ai=hΣ (E), each containing IDs of all and related tuples, defined as function dependent x→y meaning that the values of X are the same then the Y values must also be the same, the data in the relationship table being divided into different equivalence classes according to the value of X, in each equivalence class, all members have the same X value, and the Y values are the same or different; two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges; in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes; wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name, and Ai E U.
The vector recall method described above,
the specific way of the word desired to be output in the step (2) is as follows:
setting a model adapter for starting a mathematical model, and inputting data into the mathematical model to obtain a calculation result;
the Droois rule executor in the double-tower model is used to connect with the NLP rule executor in the mathematical model.
The vector recall method described above,
the scheduling algorithm of the API application program interface for the data in the gradient descent method in the step (3) is as follows:
wherein P is i,t I-force for the ith interface of period t, where P m,0 Output for processing mth network data before processing, wherein P n,0 Output for processing the nth network data before processing, wherein P' m,t Dispensing force for mth piece of network data of period t, wherein P' n,t The dispensing force for the nth network data of period t.
3. Advantageous effects
Compared with the prior art, the application has the beneficial effects that:
the innovative design is provided with a rule filter which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer; the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer; the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor, wherein the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel. Training a double-tower structure by using query-title data, calculating by using a title tower to provide vectors, extracting words from a word model, carrying out vectorization representation, and packaging into batch data to enable the parallel calculation of the model to improve the performance.
Drawings
FIG. 1 is a flow chart of a vector recall method of the present application.
FIG. 2 is a CBOW model diagram of the vector recall method of the present application.
FIG. 3 is a graph showing the effect of offline recall index evaluation in the vector recall method of the present application.
FIG. 4 shows a vector recall method used in the prior art, and compared with the present application, the prior art still uses an array combination of atomic strategies, and the effect of curing the recall strategy is poor.
Detailed Description
The application is further described below in connection with specific embodiments.
Example 1
As shown in fig. 1 and 2, the vector recall method includes the following steps:
(1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; in the output, the output word may be a word adjacent to the center word or a word or related word appearing in the vicinity of the center word, or various words not related to each other.
(2) When the network is trained, the parameter matrix of the neural network is the characteristic of the input text, and the output is word vector;
in training, words which are expected to be output are words which surround the center word, namely, the larger the probability product of the output words is, the more relevant the output words and the center word are; in training the network, the parameter matrix of the network is the character of the input text, i.e. the word vector we want to get.
(3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words.
The vector recall method described above,
before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a Type system, an API (application program interface) and a Graph Engine; the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area; the DG Graph distributed database, the Type system, the API and the Graph Engine are connected in sequence
The vector recall method described above,
the probability calculation rule in the step (1) is as follows:
setting a rule filter, which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer; the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer; the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor, wherein the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel.
The vector recall method described above,
the neural network in the step (1) performs data transmission optimization between different layers by using the following algorithm:
in the method, in the process of the application,parameter values, p, optimized for data transmission 0 、s j And s' are the vertexes of three network nodes of the PC end, the APP end and the third party application respectively.
The vector recall method described above,
the vector recall model described in step (2) utilizes a double-tower model whose model for data indexing is as follows:
hash calculation is carried out on the data content of the data, and a data identifier of the data is obtained;
constructing an index node matching layer according to the data identification, wherein the index node matching layer is used for corresponding a matching main key value in the index node matching layer with a matching index node, the index node matching layer is composed of a plurality of index node matching tables, and each index node matching table is composed of related main key values and key value pairs; the data identification of the data is used as a main key value of the pointer block, and the index node of the data is used as a key value of the pointer block;
when the obtained data identification of the subsequent data does not exist in the index node matching layer, generating a new index node matching table according to the subsequent data identification, and inserting the new index node matching table into the index node matching layer;
when the obtained data identification of the subsequent data exists in the index node matching layer, the subsequent data is pointed to the corresponding index node.
The vector recall method described above,
the construction method of the DGgraph distributed database comprises the following steps:
performing dggraph hypergraph on the relation table in the database cluster, wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) }, wherein T (Ei) & ai=tsΣ (E); h (Ei) & ai=hΣ (E), each I Ds containing all and related tuples, defined as function dependent x→y meaning that the values of X are the same and the values of Y must be the same, the data in the relationship table are divided into different equivalence classes according to the value of X, in each equivalence class, all members have the same value of X, and the values of Y are the same or different; two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges; in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes; wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name, and Ai E U.
The vector recall method described above,
the specific way of the word desired to be output in the step (2) is as follows:
setting a model adapter for starting a mathematical model, and inputting data into the mathematical model to obtain a calculation result;
the Droois rule executor in the double-tower model is used to connect with the NLP rule executor in the mathematical model.
The vector recall method described above,
the scheduling algorithm of the API application program interface for the data in the gradient descent method in the step (3) is as follows:
wherein P is i,t I-force for the ith interface of period t, where P m,0 Output for processing mth network data before processing, wherein P n,0 Output for processing the nth network data before processing, wherein P' m,t Dispensing force for mth piece of network data of period t, wherein P' n,t The dispensing force for the nth network data of period t.
Example 2
In connection with FIG. 2, we use a self-created vector recall model for better seamless integration and upgrades with existing modules and interfaces of the system.
Word vector: word vectors (Word emplacement), also known as a collective term for a set of language modeling and feature learning techniques in Word embedded Natural Language Processing (NLP), wherein words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Specifically, words are mapped into vectors, and natural language is converted into computation between vectors.
Vector representation in natural language processing tasks, there are two ways of representing word vectors. First kind: one-hot representation; second kind: distribution representation. We have adopted the CBOW model of distribution representation approach.
The specific flow is as follows:
(1) A neural network is constructed, input as a certain word (center word) in a sentence, and output as probabilities that all other words appear around the center word. In the output, the output word may be a word adjacent to the center word or a word (related word) appearing in the vicinity of the center word, or various words not related to the center word.
(2) In training, we want the words that are output to be words that are surrounding the center word, i.e., the greater the probability product of the output word (maximum likelihood) the more relevant the output word and the center word are. In training the network, the parameter matrix of the network is the character of the input text, i.e. the word vector we want to get.
(3) Therefore, the probability product of the output words is as large as possible by the gradient descent method, and finally word vectors of all words are obtained under continuous updating.
When the window size is set to be 2, the input of the model is the output probability of the context words [ on Aibeijing Tiananmen ] on two sides of the center word to predict [ Tiananmen ].
Vector training was performed using a double-tower model in JDCV understanding, with the innovation points as follows:
the query-title data is used to train the double-tower structure, and the title tower is used to calculate the vector.
And carrying out vectorization representation on the words extracted by the word model.
Packaging into batch data allows model parallel computing to improve performance.
The application also provides a vector recall device, which is characterized by comprising:
constructing a neural network module, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;
the training module is used for training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;
and the updating module is used for continuously updating the word vectors through a gradient descent method, so that the probability product of the word vectors reaches the maximum value of the number sequence, and the word vectors of all words are obtained.
FIG. 3 is a graph showing the effect of offline recall index evaluation in the vector recall method of the present application.
Specifically, the experimental team samples 2000 behaving JDs, spawns a recall queue (exp 5000) according to a recall policy, runs offline indicator scripts. Sentence vector accuracy is improved to 90%, and an online interface is used for completing library dropping and used for business side experiments, so that the business side effect is effectively improved; the chapter vector pushes the chapter library to be used for business side experiments, and helps to promote recommended search business indexes.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (9)

1. A vector recall method comprising the steps of:
(1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;
(2) Training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;
(3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words.
2. The vector recall method of claim 1 wherein:
before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a type system, an API (application program interface) and a Graph Engine;
the DGgraph distributed database is provided with a MetaData Store MetaData storage area and an Index Store Index storage area;
the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area;
the DGgraph distributed database, the Type system, the API and the Graph Engine are sequentially connected.
3. The vector recall method of claim 2 wherein:
the probability calculation rule in the step (1) is as follows:
setting a rule filter, which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer;
the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer;
the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor;
the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel.
4. A vector recall method according to claim 3 wherein:
the neural network in the step (1) performs data transmission optimization between different layers by using the following algorithm:
in the method, in the process of the application,parameter values, p, optimized for data transmission 0 The vertex is the top point of the PC end network node, s j S 'is the vertex of the APP end network node and s' is the vertex of the network node of the third party application.
5. The vector recall method of claim 4 wherein:
the vector recall model described in step (2) utilizes a double-tower model whose patterns for data indexing are as follows:
hash calculation is carried out on the data content of the data, and a data identifier of the data is obtained;
constructing an index node matching layer according to the data identification, wherein the index node matching layer is used for corresponding a matching main key value in the index node matching layer with a matching index node, the index node matching layer is composed of a plurality of index node matching tables, and each index node matching table is composed of related main key values and key value pairs;
the data identification of the data is used as a main key value of the pointer block, and the index node of the data is used as a key value of the pointer block;
when the obtained data identification of the subsequent data does not exist in the index node matching layer, generating a new index node matching table according to the subsequent data identification, and inserting the new index node matching table into the index node matching layer;
when the obtained data identification of the subsequent data exists in the index node matching layer, the subsequent data is pointed to the corresponding index node.
6. The vector recall method of claim 5 wherein:
the construction method of the DGgraph distributed database comprises the following steps:
performing DGgraph hypergraph on the relation table in the database cluster;
wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) },
wherein T (Ei). Ai=tsΣ (E); h (Ei). Ai=hΣ (E);
IDs, each containing all and relevant tuples, defined as function dependent x→y means that the value of X is the same and the value of Y must be the same;
dividing the data in the relation table into different equivalence classes according to the value of X, wherein in each equivalence class, all members have the same X value, and Y values are the same or different;
two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges;
in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes;
wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name and Ai E U.
7. The vector recall method of claim 6 wherein:
the specific way of outputting the word vector in the step (2) is as follows:
setting a model adapter for starting a mathematical model, and inputting data into the mathematical model to obtain a calculation result;
the Droois rule executor in the double-tower model is used to connect with the NLP rule executor in the mathematical model.
8. The vector recall method of claim 7 wherein:
the scheduling algorithm of the API application program interface for the data in the gradient descent method in the step (3) is as follows:
wherein p is i,t I-force for the ith interface of period t, where P m,0 Output for processing mth network data before processing, wherein P n,0 Output for processing the nth network data before processing, wherein P' m,t Dispensing force for mth piece of network data of period t, wherein P' n,t The dispensing force for the nth network data of period t.
9. A vector recall device, comprising:
constructing a neural network module, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;
the training module is used for training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;
and the updating module is used for continuously updating the word vectors through a gradient descent method, so that the probability product of the word vectors reaches the maximum value of the number sequence, and the word vectors of all words are obtained.
CN202310988298.0A 2023-08-07 2023-08-07 Vector recall method and device Active CN117056459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310988298.0A CN117056459B (en) 2023-08-07 2023-08-07 Vector recall method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310988298.0A CN117056459B (en) 2023-08-07 2023-08-07 Vector recall method and device

Publications (2)

Publication Number Publication Date
CN117056459A true CN117056459A (en) 2023-11-14
CN117056459B CN117056459B (en) 2024-05-10

Family

ID=88667088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310988298.0A Active CN117056459B (en) 2023-08-07 2023-08-07 Vector recall method and device

Country Status (1)

Country Link
CN (1) CN117056459B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118034147A (en) * 2024-03-08 2024-05-14 未来城市(上海)建筑规划设计有限公司 Building control system for wireless communication

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782975A (en) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 Retrieval method and device and electronic equipment
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
CN113849492A (en) * 2021-09-23 2021-12-28 北京网聘咨询有限公司 System for providing standardized data quality check for multi-scenario service
US20220019739A1 (en) * 2019-02-21 2022-01-20 Beijing Jingdong Shangke Information Technology Co., Ltd. Item Recall Method and System, Electronic Device and Readable Storage Medium
CN114820134A (en) * 2022-05-12 2022-07-29 北京沃东天骏信息技术有限公司 Commodity information recall method, device, equipment and computer storage medium
WO2023040516A1 (en) * 2021-09-18 2023-03-23 腾讯科技(深圳)有限公司 Event integration method and apparatus, and electronic device, computer-readable storage medium and computer program product

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220019739A1 (en) * 2019-02-21 2022-01-20 Beijing Jingdong Shangke Information Technology Co., Ltd. Item Recall Method and System, Electronic Device and Readable Storage Medium
CN111782975A (en) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 Retrieval method and device and electronic equipment
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
WO2023040516A1 (en) * 2021-09-18 2023-03-23 腾讯科技(深圳)有限公司 Event integration method and apparatus, and electronic device, computer-readable storage medium and computer program product
CN113849492A (en) * 2021-09-23 2021-12-28 北京网聘咨询有限公司 System for providing standardized data quality check for multi-scenario service
CN114820134A (en) * 2022-05-12 2022-07-29 北京沃东天骏信息技术有限公司 Commodity information recall method, device, equipment and computer storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118034147A (en) * 2024-03-08 2024-05-14 未来城市(上海)建筑规划设计有限公司 Building control system for wireless communication

Also Published As

Publication number Publication date
CN117056459B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
JP7468929B2 (en) How to acquire geographical knowledge
CN104318340B (en) Information visualization methods and intelligent visible analysis system based on text resume information
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN104239501B (en) Mass video semantic annotation method based on Spark
CN106951558B (en) Data processing method of tax intelligent consultation platform based on deep search
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111581401A (en) Local citation recommendation system and method based on depth correlation matching
CN111767325B (en) Multi-source data deep fusion method based on deep learning
CN112100397A (en) Electric power plan knowledge graph construction method and system based on bidirectional gating circulation unit
CN117056459B (en) Vector recall method and device
CN116975256B (en) Method and system for processing multisource information in construction process of underground factory building of pumped storage power station
CN113946686A (en) Electric power marketing knowledge map construction method and system
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN115222048A (en) Training method, device, equipment and medium for document abstract generation model
Tapsai et al. Natural language interface to database for data retrieval and processing
CN112732944A (en) New method for text retrieval
CN110674293B (en) Text classification method based on semantic migration
CN115795018B (en) Multi-strategy intelligent search question-answering method and system for power grid field
CN111985204A (en) Customs import and export commodity tax number prediction method
CN111581365A (en) Predicate extraction method
CN111339258A (en) University computer basic exercise recommendation method based on knowledge graph
CN115688785A (en) Multi-source knowledge fused aviation equipment model named entity identification method
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN106156259A (en) A kind of user behavior information displaying method and system
CN115238075A (en) Text emotion classification method based on hypergraph pooling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant