CN117056459A

CN117056459A - Vector recall method and device

Info

Publication number: CN117056459A
Application number: CN202310988298.0A
Authority: CN
Inventors: 时迎超; 王杨
Original assignee: Beijing Wangpin Information Technology Co ltd
Current assignee: Beijing Wangpin Information Technology Co ltd
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-11-14
Anticipated expiration: 2043-08-07
Also published as: CN117056459B

Abstract

The application discloses a vector recall method, and belongs to the technical field of data processing. Which comprises the following steps: (1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant; (2) When the network is trained, the parameter matrix of the neural network is the characteristic of the input text, and the output is word vector; (3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words. The method and the system effectively help to promote the service side effect, promote the library falling, and help to promote recommended search service indexes.

Description

Vector recall method and device

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a vector recall method and device.

Background

Natural language processing (Natural Language Processing, NLP) is a subject of language questions for human interaction with computers. In general, natural language processing includes natural language understanding and natural language generation. Natural language understanding is the conversion of natural language into a language that a computer can understand, and unstructured text into structured information. Specifically, the NLP technology can complete semantic recognition and recommendation of natural language or call related services to provide more intelligent functions for users. With the development of artificial intelligence, NLP technology is more mature.

Unlike conventional ItemCF/userCF, vectorized recall is to utilize a vector retrieval tool to conduct a fast neighbor search in vector space based on Euclidean distance. When offline, training is carried out through a model, wherein the general mode is to make the similarity of a positive sample pair higher than that of a negative sample pair as much as possible, the model converges to obtain Embedding, and the Embedding of a static part (queried) is written into a Faiss tool to establish an index for online query; when the method is on line, a corresponding ebedding is output aiming at a dynamic part (active query) request model, and the static content of TopK is searched in Faiss and used as a Topk recall result.

When vectorization recall is carried out, the number of general static parts is different from hundreds to hundreds of millions, for example, recommended vectorization recall general materials have millions, the possible number of semantic similarity recalls of NLP is different from hundreds to tens of thousands, and when the NLP is used for text classification, if the number of classifications is relatively large, the classification thought is difficult to achieve higher accuracy, and at the moment, the problem can be solved by using the vectorization recall thought. The number of static parts faced by vector recall is relatively large, if all negative samples except positive samples are involved in calculation, the calculation is difficult to realize in engineering, and the calculated amount is greatly increased, so that the negative sampling strategy is generally adopted for sample processing. Of course, if the static part is only hundreds, the negative sample can also participate in all the calculations, so that a better effect is achieved.

FIG. 4 shows a vector recall method used in the prior art, and compared with the present application, the prior art still uses an array combination of atomic strategies, and the effect of curing the recall strategy is poor. When the platform is developed, the following problems are obviously existed, the recall mode is single, and the multiple forms of recall do not make up for the long-term and short-term; the strategy granularity is thicker and limited to channel dimension, and JD personalized recall cannot be achieved; the production efficiency is low, and it is difficult to traverse all effective strategy combinations, so that recall results are low in efficiency and poor in effect.

Disclosure of Invention

1. Problems to be solved

Aiming at the problems in the prior art, the application provides a vector recall method, which uses query-title data to train a double-tower structure, uses a title tower to calculate and provide vectors, extracts words from word models, carries out vectorization representation, and packages the words into batch data to enable the parallel calculation of the models to improve the performance.

2. Technical proposal

In order to solve the problems, the application adopts the following technical scheme.

A vector recall method comprising the steps of:

(1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;

(2) Training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;

(3) And continuously updating by a gradient descent method to ensure that the probability product of the word vectors reaches the maximum value of the number sequence, thereby obtaining the word vectors of all words.

The vector recall method described above,

before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a type system, an API (application program interface) and a Graph Engine; the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area; the DGr aph distributed database, the Type system, the API (application program interface) and the Graph Engine are sequentially connected

The vector recall method described above,

the probability calculation rule in the step (1) is as follows:

setting a rule filter, which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer; the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer; the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor, wherein the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel.

The vector recall method described above,

the neural network in the step (1) performs data transmission optimization between different layers by using the following algorithm:

in the method, in the process of the application,parameter values, p, optimized for data transmission ₀ 、s _j S' are PC end and APP end respectivelyThe vertices of three network nodes for third party applications.

The vector recall method described above,

the vector recall model described in step (2) utilizes a double-tower model whose patterns for data indexing are as follows:

hash calculation is carried out on the data content of the data, and a data identifier of the data is obtained;

constructing an index node matching layer according to the data identification, wherein the index node matching layer is used for corresponding a matching main key value in the index node matching layer with a matching index node, the index node matching layer is composed of a plurality of index node matching tables, and each index node matching table is composed of related main key values and key value pairs; the data identification of the data is used as a main key value of the pointer block, and the index node of the data is used as a key value of the pointer block;

when the obtained data identification of the subsequent data does not exist in the index node matching layer, generating a new index node matching table according to the subsequent data identification, and inserting the new index node matching table into the index node matching layer;

when the obtained data identification of the subsequent data exists in the index node matching layer, the subsequent data is pointed to the corresponding index node.

The vector recall method described above,

the construction method of the DGgraph distributed database comprises the following steps:

performing dggraph hypergraph on the relation table in the database cluster, wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) }, wherein T (Ei) & ai=tsΣ (E); h (Ei). Ai=hΣ (E), each containing IDs of all and related tuples, defined as function dependent x→y meaning that the values of X are the same then the Y values must also be the same, the data in the relationship table being divided into different equivalence classes according to the value of X, in each equivalence class, all members have the same X value, and the Y values are the same or different; two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges; in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes; wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name, and Ai E U.

The vector recall method described above,

the specific way of the word desired to be output in the step (2) is as follows:

setting a model adapter for starting a mathematical model, and inputting data into the mathematical model to obtain a calculation result;

the Droois rule executor in the double-tower model is used to connect with the NLP rule executor in the mathematical model.

The vector recall method described above,

the scheduling algorithm of the API application program interface for the data in the gradient descent method in the step (3) is as follows:

wherein P is _i，t I-force for the ith interface of period t, where P _m，0 Output for processing mth network data before processing, wherein P _n，0 Output for processing the nth network data before processing, wherein P' _m，t Dispensing force for mth piece of network data of period t, wherein P' _n，t The dispensing force for the nth network data of period t.

3. Advantageous effects

Compared with the prior art, the application has the beneficial effects that:

the innovative design is provided with a rule filter which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer; the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer; the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor, wherein the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel. Training a double-tower structure by using query-title data, calculating by using a title tower to provide vectors, extracting words from a word model, carrying out vectorization representation, and packaging into batch data to enable the parallel calculation of the model to improve the performance.

Drawings

FIG. 1 is a flow chart of a vector recall method of the present application.

FIG. 2 is a CBOW model diagram of the vector recall method of the present application.

FIG. 3 is a graph showing the effect of offline recall index evaluation in the vector recall method of the present application.

FIG. 4 shows a vector recall method used in the prior art, and compared with the present application, the prior art still uses an array combination of atomic strategies, and the effect of curing the recall strategy is poor.

Detailed Description

The application is further described below in connection with specific embodiments.

Example 1

As shown in fig. 1 and 2, the vector recall method includes the following steps:

(1) Constructing a neural network, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; in the output, the output word may be a word adjacent to the center word or a word or related word appearing in the vicinity of the center word, or various words not related to each other.

(2) When the network is trained, the parameter matrix of the neural network is the characteristic of the input text, and the output is word vector;

in training, words which are expected to be output are words which surround the center word, namely, the larger the probability product of the output words is, the more relevant the output words and the center word are; in training the network, the parameter matrix of the network is the character of the input text, i.e. the word vector we want to get.

The vector recall method described above,

before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a Type system, an API (application program interface) and a Graph Engine; the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area; the DG Graph distributed database, the Type system, the API and the Graph Engine are connected in sequence

The vector recall method described above,

the probability calculation rule in the step (1) is as follows:

The vector recall method described above,

in the method, in the process of the application,parameter values, p, optimized for data transmission ₀ 、s _j And s' are the vertexes of three network nodes of the PC end, the APP end and the third party application respectively.

The vector recall method described above,

the vector recall model described in step (2) utilizes a double-tower model whose model for data indexing is as follows:

The vector recall method described above,

performing dggraph hypergraph on the relation table in the database cluster, wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) }, wherein T (Ei) & ai=tsΣ (E); h (Ei) & ai=hΣ (E), each I Ds containing all and related tuples, defined as function dependent x→y meaning that the values of X are the same and the values of Y must be the same, the data in the relationship table are divided into different equivalence classes according to the value of X, in each equivalence class, all members have the same value of X, and the values of Y are the same or different; two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges; in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes; wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name, and Ai E U.

The vector recall method described above,

Example 2

In connection with FIG. 2, we use a self-created vector recall model for better seamless integration and upgrades with existing modules and interfaces of the system.

Word vector: word vectors (Word emplacement), also known as a collective term for a set of language modeling and feature learning techniques in Word embedded Natural Language Processing (NLP), wherein words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Specifically, words are mapped into vectors, and natural language is converted into computation between vectors.

Vector representation in natural language processing tasks, there are two ways of representing word vectors. First kind: one-hot representation; second kind: distribution representation. We have adopted the CBOW model of distribution representation approach.

The specific flow is as follows:

(1) A neural network is constructed, input as a certain word (center word) in a sentence, and output as probabilities that all other words appear around the center word. In the output, the output word may be a word adjacent to the center word or a word (related word) appearing in the vicinity of the center word, or various words not related to the center word.

(2) In training, we want the words that are output to be words that are surrounding the center word, i.e., the greater the probability product of the output word (maximum likelihood) the more relevant the output word and the center word are. In training the network, the parameter matrix of the network is the character of the input text, i.e. the word vector we want to get.

(3) Therefore, the probability product of the output words is as large as possible by the gradient descent method, and finally word vectors of all words are obtained under continuous updating.

When the window size is set to be 2, the input of the model is the output probability of the context words [ on Aibeijing Tiananmen ] on two sides of the center word to predict [ Tiananmen ].

Vector training was performed using a double-tower model in JDCV understanding, with the innovation points as follows:

the query-title data is used to train the double-tower structure, and the title tower is used to calculate the vector.

And carrying out vectorization representation on the words extracted by the word model.

Packaging into batch data allows model parallel computing to improve performance.

The application also provides a vector recall device, which is characterized by comprising:

constructing a neural network module, inputting a certain word or a central word in a sentence, and outputting the probability of all other words appearing around the central word; when outputting, the output words are words adjacent to the central word or words or related words which appear nearby the central word, and various words which are irrelevant;

the training module is used for training a network by using a vector recall model, wherein a parameter matrix of the neural network is the characteristic of an input text, and the output is a word vector;

and the updating module is used for continuously updating the word vectors through a gradient descent method, so that the probability product of the word vectors reaches the maximum value of the number sequence, and the word vectors of all words are obtained.

Specifically, the experimental team samples 2000 behaving JDs, spawns a recall queue (exp 5000) according to a recall policy, runs offline indicator scripts. Sentence vector accuracy is improved to 90%, and an online interface is used for completing library dropping and used for business side experiments, so that the business side effect is effectively improved; the chapter vector pushes the chapter library to be used for business side experiments, and helps to promote recommended search business indexes.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A vector recall method comprising the steps of:

2. The vector recall method of claim 1 wherein:

before constructing the neural network in the step (1), constructing a neural network library system which comprises a DGgraph distributed database, a type system, an API (application program interface) and a Graph Engine;

the DGgraph distributed database is provided with a MetaData Store MetaData storage area and an Index Store Index storage area;

the DGgraph distributed database is respectively connected with the MetaData Store MetaData storage area and the Index Store Index storage area;

the DGgraph distributed database, the Type system, the API and the Graph Engine are sequentially connected.

3. The vector recall method of claim 2 wherein:

the probability calculation rule in the step (1) is as follows:

setting a rule filter, which comprises a rule executor, a parameter entering rule matcher, a scene planning filter, a rule relation maintainer, a model adapter and an execution result analyzer;

the rule executor is respectively connected with the parameter entering rule matcher, the scene planning filter and the rule relation maintainer, the parameter entering rule matcher, the scene planning filter and the rule relation maintainer are respectively connected with the model adapter, and the model adapter is connected with the execution result analyzer;

the rule executor comprises a simple rule executor, a parallel planning executor, a rule executor selector, a Droois rule executor, a basic data rule executor and an NLP rule executor;

the simple rule executor, the parallel planning executor, the rule executor selector, the Droois rule executor, the basic data rule executor and the NLP rule executor are arranged in parallel.

4. A vector recall method according to claim 3 wherein:

in the method, in the process of the application,parameter values, p, optimized for data transmission ₀ The vertex is the top point of the PC end network node, s _j S 'is the vertex of the APP end network node and s' is the vertex of the network node of the third party application.

5. The vector recall method of claim 4 wherein:

constructing an index node matching layer according to the data identification, wherein the index node matching layer is used for corresponding a matching main key value in the index node matching layer with a matching index node, the index node matching layer is composed of a plurality of index node matching tables, and each index node matching table is composed of related main key values and key value pairs;

the data identification of the data is used as a main key value of the pointer block, and the index node of the data is used as a key value of the pointer block;

6. The vector recall method of claim 5 wherein:

performing DGgraph hypergraph on the relation table in the database cluster;

wherein each dggraph edge Ei E is defined as ei= { T (Ei), H (Ei), Ω (Ei) },

wherein T (Ei). Ai=tsΣ (E); h (Ei). Ai=hΣ (E);

IDs, each containing all and relevant tuples, defined as function dependent x→y means that the value of X is the same and the value of Y must be the same;

dividing the data in the relation table into different equivalence classes according to the value of X, wherein in each equivalence class, all members have the same X value, and Y values are the same or different;

two types of superedges exist in the DGgraph, one type is provided with only one head node and |H (E) |=1, and the other type is provided with a plurality of head nodes and |H (E) | >1, which are respectively B-arc edges and edges;

in dggraph, if one or more edges are included, it means that at least one function depends on the left attribute of the equivalence class to map to the right attribute of the equivalence classes;

wherein X, Y is an attribute in a relational table of the database cluster, Σ is a function dependency set, E is a set of supersides in the supergraph, H (E) represents a head node of the supersides, T (E) represents a tail node of the supersides, T is a tuple in the table R, ai is an attribute name and Ai E U.

7. The vector recall method of claim 6 wherein:

the specific way of outputting the word vector in the step (2) is as follows:

8. The vector recall method of claim 7 wherein:

9. A vector recall device, comprising: