CN117874173A

CN117874173A - Training method and related device of vector model

Info

Publication number: CN117874173A
Application number: CN202410273478.5A
Authority: CN
Inventors: 陈春全
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-04-12

Abstract

The application discloses a training method and a related device of a vector model, which can be applied to cloud technology, artificial intelligence, vehicle-mounted scenes and other scenes, wherein the method comprises the following steps: first text training data is acquired. The method comprises the steps that a network model to be trained is obtained, the position code embedded matrix of the network model to be trained comprises an original position code and an expansion position code, the expansion position code is used for expanding the original position code, and the original position code is expanded through the expansion position code, so that the sequence length of the position code embedded matrix is expanded from the original sequence length to the target sequence length. Initializing model parameters of the network model to be trained to obtain an initial network model. And training the extended position codes of the initial network model by using the first text training data to obtain a target vector model. The method does not need to carry out segmentation processing or sliding window processing when processing the long document, and avoids influencing the retrieval performance due to the loss of the context information.

Description

Training method and related device of vector model

Technical Field

The present disclosure relates to the field of computers, and in particular, to a training method for a vector model and a related device.

Background

The vector search method is an efficient method for information search by representing a document and a query text as vectors and using the similarity between the vectors to quickly find a relevant document. The vector retrieval method has wide application scenes on various products, for example, the vector retrieval method can be used in scenes such as search engines, question-answering systems and the like.

Various deep learning models are currently available for vector retrieval, however, these deep learning models cannot directly process long documents with a sequence length exceeding 512, and for this reason, a segmentation process or a sliding window method is generally adopted for processing long document retrieval.

When long documents are processed by segmentation or sliding window, some loss of context information may be caused by dividing the long document into shorter subsections or using sliding windows, and especially when there is a close semantic association between subsections, such division may split the relevant information, thereby affecting the retrieval performance.

Disclosure of Invention

In order to solve the technical problems, the application provides a training method of a vector model and a related device, so as to solve the problem that retrieval performance is affected due to information related to fracture.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a training method of a vector model, where the method includes:

acquiring first text training data composed of a first query text and a target document with an association relation with the first query text;

acquiring a network model to be trained, wherein a position code embedded matrix of the network model to be trained comprises an original position code and an extended position code, the sequence length of the original position code is the original sequence length, the extended position code is used for expanding the original position code, and the original position code is expanded through the extended position code so that the sequence length of the position code embedded matrix is expanded from the original sequence length to a target sequence length, and the target sequence length is larger than the original sequence length;

initializing model parameters of the network model to be trained to obtain an initial network model, wherein initialization parameter values of extended position codes in the initial network model are obtained by random initialization, initialization parameter values of residual model parameters in the initial network model are obtained by initializing model parameters of an open source vector model which is pre-trained, the residual model parameters are model parameters except the extended position codes in all model parameters of the initial network model, and the residual model parameters comprise the original position codes;

And training the expansion position code of the initial network model by using the first text training data to obtain a target vector model.

In one aspect, an embodiment of the present application provides a training device for a vector model, where the device includes a first obtaining unit, a second obtaining unit, an initializing unit, and a training unit:

the first acquisition unit is used for acquiring first text training data composed of a first query text and a target document with an association relation with the first query text;

the second obtaining unit is configured to obtain a network model to be trained, where a position code embedding matrix of the network model to be trained includes an original position code and an extended position code, a sequence length of the original position code is an original sequence length, the extended position code is a position code for extending the original position code, and the original position code is extended by the extended position code so that the sequence length of the position code embedding matrix is extended from the original sequence length to a target sequence length, and the target sequence length is greater than the original sequence length;

the initialization unit is configured to initialize model parameters of the network model to be trained to obtain an initial network model, initialization parameter values of extended position codes in the initial network model are obtained by random initialization, initialization parameter values of remaining model parameters in the initial network model are obtained by initializing model parameters of an open source vector model after pre-training, the remaining model parameters are model parameters except the extended position codes in all model parameters of the initial network model, and the remaining model parameters include the original position codes;

The training unit is used for training the expansion position code of the initial network model by utilizing the first text training data to obtain a target vector model.

In one aspect, embodiments of the present application provide a computer device comprising a processor and a memory:

the memory is used for storing a computer program and transmitting the computer program to the processor;

the processor is configured to perform the method of any of the preceding aspects according to instructions in the computer program.

In one aspect, embodiments of the present application provide a computer-readable storage medium for storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of the preceding aspects.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding aspects.

According to the technical scheme, the original position codes in the position code embedding matrix are expanded through the expansion position codes on the basis of the existing model structure, and the sequence length of the original position codes is the original sequence length, so that the sequence length of the position code embedding matrix is expanded from the original sequence length to the target sequence length. Because the length of the target sequence is longer than that of the original sequence, the target vector model trained based on the network model to be trained can process long files with the sequence length longer than that of the original sequence. In the training process, model parameters of a network model to be trained are initialized to obtain an initial network model. The initialization parameter values of the expansion position codes in the initial network model are obtained by random initialization, the initialization parameter values of the residual model parameters in the initial network model are obtained by initializing model parameters of the pre-trained open source vector model, the residual model parameters are model parameters except the expansion position codes in all model parameters of the initial network model, the residual model parameters comprise original position codes, and therefore the residual model parameters which are already pre-trained can be frozen during training, and only the expansion position codes of the initial network model are trained by using the acquired first text training data to obtain a target vector model. Therefore, rich language knowledge learned by the pre-trained vector model can be reserved, training time is shortened, and training speed is increased. According to the method and the device, the sequence length of the position code embedded matrix of the model to be trained is expanded, and partial model parameters of the network model to be trained are trained, so that the maximum sequence length of the target vector model is improved and expanded on the premise that language knowledge of a short text learned in advance is reserved and the training speed is ensured. Thus, the segmentation process or the sliding window process is not needed when the long document is processed, and the influence on the retrieval performance due to the loss of the context information is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is an application scenario architecture diagram of a training method of a vector model according to an embodiment of the present application;

FIG. 2 is a flowchart of a training method of a vector model according to an embodiment of the present application;

fig. 3 is a schematic diagram of a model structure of a network model to be trained according to an embodiment of the present application;

FIG. 4 is a schematic diagram of pretreatment according to an embodiment of the present application;

FIG. 5 is a schematic diagram of obtaining a target document in an open source text dataset according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another method for obtaining a target document in an open source text dataset according to an embodiment of the present application;

fig. 7 is a schematic diagram of calculating a fifth similarity according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a hint template according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a process for searching through a target vector model according to an embodiment of the present application;

fig. 10 is a schematic diagram of a training method and a flow chart for use of a vector model according to an embodiment of the present application;

FIG. 11 is a block diagram of a training device for a vector model according to an embodiment of the present application;

fig. 12 is a block diagram of a terminal according to an embodiment of the present application;

fig. 13 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

The information retrieval is more efficient by using a vector retrieval method, wherein the vector retrieval is performed by representing the document and the query text as vectors and using the similarity between the vectors to quickly find the relevant document, and a deep learning model is generally used for the vector retrieval when the information retrieval is performed. Deep learning models that may be used for vector retrieval include Pre-trained language models (Pre-trained Language Models, PLMs) such as bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT) and brute force optimized BERT method (Robustlyoptimized BERT approach, roBERTa).

The pre-training language model is obtained by performing unsupervised training on a large-scale corpus, and can learn rich language knowledge through training. When the pre-training language model is used for vector retrieval, the pre-training language model generates text vectors of the document by extracting output of a specific layer or using a preset convergence method, which may be a semantic feature vector (CLS) convergence method. The pre-training language model calculates the similarity between vectors by using cosine similarity or Euclidean distance and other metrics, and finds relevant documents by using the similarity between vectors.

However, the maximum sequence length of the encoder for generating text vectors in these deep learning models is usually limited to 512, so long documents with sequence length exceeding 512 cannot be processed directly, and for this reason, a segmentation processing or sliding window method is usually adopted to process long document retrieval.

The segmentation process is a method of dividing a long document into a plurality of shorter subsections, calculating the similarity of the subsections and the query text respectively, then combining the similarities into a total score, and searching the related document by using the total score. Segmentation processing tends to require truncation or segmentation of the text, which may result in some loss of context information. For example: in the document 'today's weather is big to heavy rain ', the document is divided into two subsections of' today's weather is big' and 'to heavy rain', the whole long document is actually expressed as 'big to heavy rain', but the segmentation processing leads to the loss of context information among fields, so that the important information of 'big to heavy rain' can be difficult to be expressed, and the final retrieval performance is affected.

In processing long documents with sequence lengths exceeding 512, a sliding window may be used in addition to the segmentation process. By using a sliding window on the document, information of different sub-regions of the document can be captured, the similarity between each sub-region and the query text is calculated, and related documents are searched through the similarity between each sub-region and the query text. For example: in the document that the weather today is heavy to heavy rain, bad weather appears in the early morning and does not influence the normal travel of citizens, when capturing the information of the subarea that the weather today is heavy to heavy rain, the sliding window ignores the information of the part of the following information of the bad weather appears in the early morning and does not influence the normal travel of citizens, so that the final retrieval performance is influenced.

Whether it is a segmentation process or a sliding window, when a document is processed, especially when a long document with a tight semantic association between sub-segments is processed, relevant information may be split due to the division of the document, which ultimately affects the retrieval performance.

In order to solve the problems, the method and the device improve the retrieval performance of long documents, and expand original position codes in the position code embedded matrix on the basis of the existing model structure, so that the sequence length of the position code embedded matrix is expanded from the original sequence length to the target sequence length. Because the length of the target sequence is longer than that of the original sequence, when the target vector model obtained by training the network model to be trained after the expansion position coding is used for processing a long document with the length of the sequence longer than that of the original sequence, the document is not required to be divided, and the retrieval performance is improved by fully considering the context information.

It should be noted that, the training method of the vector model provided by the embodiment of the application can be applied to various vector retrieval scenes, the document and the query text are respectively represented as text vectors by using the vector model obtained by training, and the related document is quickly found by using the similarity between the text vectors. The vector search scenario may be, for example, a search engine, a question and answer system, and the embodiment of the present application does not limit this.

The vector model trained by the embodiment of the application can be applied to the scene of a search engine, such as a search engine for searching web pages, documents or other resources. The user inputs the query text into the search engine, and the search engine queries the query text input by the user by utilizing the vector model trained by the embodiment of the application, so that the user can more accurately find the search result related to the query text. The deep semantic relation between the query text and the document is captured through the vector model, so that the correlation between the search result and the query text is improved, and the quality of the search result is improved.

The vector model trained by the embodiment of the application can be applied to a scene of a question-answering system, such as an intelligent question-answering system or a chat robot. In the intelligent question-answering system or the chat robot, the vector model trained by the embodiment of the application can be used for searching answers most relevant to the user questions (namely the query text) in the knowledge base, and the vector model can help the intelligent question-answering system or the chat robot to more accurately understand the requirements of the user so as to provide more targeted answers.

The training method of the vector model provided by the embodiment of the application can be executed by computer equipment, and the computer equipment can be a server or a terminal, for example. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. Terminals include, but are not limited to, smart phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

As shown in fig. 1, fig. 1 shows an application scenario architecture diagram of a training method of a vector model, the application scenario being introduced with a computer device being a server.

The application scenario may include the server 100 and the terminal 200, where the server 100 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a cloud computing service. The server 100 may derive a target vector model for vector retrieval by training. The trained target vector model may be deployed on the server 100, and the server 100 may provide a vector retrieval service for the terminal 200 based on the target vector model. The terminal 200 may install an application having a vector search function, the terminal 200 may transmit a text to be queried to the server 100, the server 100 may search based on the text to be queried using a target vector model to obtain a search result, and return the search result to the terminal 200. Terminal 200 includes, but is not limited to, a smart phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like. The server 100 and the terminal 200 may be directly or indirectly connected through wired or wireless communication, which is not limited herein. For example, the server 100 and the terminal 200 may be connected through a network, which may be a wired or wireless network.

Specifically, the server 100 may obtain first text training data including a first query text and a target document having an association with the first query text, where the query text may be a text input by a user and used for reflecting a search intention of the user, and the target document may be a document that replies to the query text and has an association with the query text. The sequence length of the target document is not limited, and the sequence length of the target document can exceed the original sequence length and be equal to the original sequence length or can be shorter than the original sequence length. The target document having an association with the first query text may be referred to as a positive sample of the first query text.

The server 100 may obtain a network model to be trained, where the position codes of the network model to be trained are embedded in a matrix that includes the original position codes and the extended position codes. The position-coding embedding matrix may be an embedding matrix that represents words in the document with position information. The original position code may be a position code of a position code embedding matrix of an existing model structure, and the length of the original position code is an original sequence length, so that a document with the sequence length within the original sequence length is represented by position information. The extended position code is a position code that extends the original position code for representing a portion of the document beyond the length of the original sequence with position information. The sequence length of the position code embedded matrix obtained by expanding the original position code through the expansion position code is the target sequence length, and the target sequence length is larger than the original sequence length.

In the related art, the pre-training language model has set a maximum sequence length limit for the original sequence length at the time of design and training, and the original sequence length may be a relatively short sequence length, for example, 512. That is, the position code embedding matrix of the pre-training language model only includes the original position code, and the maximum sequence length of the pre-training language model is the original sequence length. If the pre-training language model is a BERT model, the maximum sequence length of the pre-training language model may be limited to 512, which follows the maximum sequence length limit of 512 in order to fully exploit the performance and weight of the pre-training language model. Based on the above, in order to enable the vector model obtained by training to be suitable for a longer document, the limitation of the original sequence length is broken through, the method and the device expand on the basis of the original position coding, and the network model to be trained, of which the sequence length of the position coding embedded matrix reaches the target sequence length, is used as a training basis, so that a long document with the sequence length being larger than the original sequence length and even reaching the target sequence length can be obtained.

Then, the network model to be trained, of which the model structure meets the vector retrieval requirement, is required to be trained. In the training process, the network model to be trained can be initialized to obtain an initial network model. When the network model to be trained is initialized, since the expansion position codes are newly added and there is no model parameter which can be referred, the initialization parameter values of the expansion position codes can be obtained through random initialization. The rest model parameters except the extended position codes form an existing model structure, and an open source vector model with pre-trained completion exists, wherein the open source vector model is a vector model with parameter disclosure, and the open source vector model can be a pre-trained language model with parameter disclosure. Therefore, in order to increase the training speed of the target vector model, the model parameters of the pre-trained open source vector model can be utilized to initialize the residual model parameters to obtain the initialization parameter values of the residual model parameters. The remaining model parameters may be the model parameters other than the extended position code among all model parameters of the initial network model (or the network model to be trained), and may include the original position code.

Then, as the open source vector model is a pre-trained vector model, the performance is better, and the parameter values of the model parameters are more accurate. Therefore, for the remaining model parameters, no additional training can be performed, only the extended position codes of the initial network model are trained by using the first text training data, the target vector model is obtained, and the training speed of the target vector model is improved by reducing the number of the model parameters to be trained. And, the language knowledge of the short text learned by the pre-trained open source vector model can be reserved through initialization.

After training of the target vector model is completed, the target vector model may be deployed on the server 100. When the user sends the text to be queried to the server 100 through the terminal 200, the server 100 may input the text to be queried sent by the terminal 200 to the target vector model to obtain a vector representation of the text to be queried. The server 100 searches for a vector representation of the document to be processed, which has the highest similarity with the vector representation of the text to be queried, by using the vector representation of the text to be queried, and the server 100 returns the document to be processed, which has the highest similarity, as a query result to the terminal 200, so as to complete the query of the user.

Because the sequence length of the position code embedding matrix of the model to be trained is expanded, the target vector model obtained based on the training of the network model to be trained can directly process the long document with the sequence length larger than the original sequence length, and the segmentation processing or sliding window processing is not needed when the long document is processed, so that the retrieval performance is prevented from being influenced by the loss of the context information.

It should be noted that, the server for training the target vector model may be the same as or different from the server for providing the vector search service for the terminal, which is not limited in this embodiment of the present application. Fig. 1 is only an example and is not limiting of the present application.

It should be noted that the methods provided in the embodiments of the present application may relate to artificial intelligence technology, where artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense an environment, acquire knowledge, and use knowledge to obtain an optimal result. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It may be appreciated that the training method of the vector model provided in the embodiments of the present application may involve natural language processing. Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. The vector model trained by the embodiment of the application can be applied to scenes such as a cable engine, a question-answering system and the like, and text processing, semantic understanding and other technologies can be used when the vector model is trained.

When training the vector model, machine Learning (ML) may also be involved, which is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. In the embodiment of the application, the target vector model can be obtained by using machine learning training.

It should be noted that, in the specific embodiment of the present application, relevant data such as user information may be involved in the whole process, and when the above embodiments of the present application are applied to specific products or technologies, individual consent or individual permission of the user needs to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Next, taking an example that the computer device is a server, a training method of the vector model provided in the embodiment of the present application will be described with reference to the accompanying drawings. Referring to fig. 2, fig. 2 shows a flowchart of a training method of a vector model, the method comprising:

s201, acquiring first text training data composed of a first query text and a target document with an association relation with the first query text.

The server acquires first text training data, wherein the first text training data is composed of a first query text and a target document with an association relation with the first query text. The first query text may be text for embodying the search intention, and the target document may be a document having an association relationship with the first query text in reply to the first query text. The target document having an association with the first query text may be referred to as a positive sample of the first query text.

In an embodiment of the present application, the first text training data may have a plurality of different expression forms, and the expression form of the first text training data is not limited in this application, in one possible implementation manner, the first text training data may be expressed by a binary group, if the query represents the first query text, the message represents the target document having an association relationship with the first query text, and the expression form of the first text training data may be (query, message) or (message, query).

The sequence length of the first query text is not limited, and the sequence length of the first query text may exceed the original sequence length, be equal to the original sequence length, or be shorter than the original sequence length. In one possible implementation, the sequence length of the first query text is shorter, even much shorter, than the original sequence length, and the following description will take as an example that the sequence length of the first query text is smaller than the original sequence length. For example, the original sequence length is 512, and in combination with actual usage scenarios, such as search engine scenarios, the sequence length of the first query text is typically relatively short, and the sequence length of the first query text does not exceed 512, hardly exceeds 128, or even does not exceed 64.

The sequence length of the target document may exceed the original sequence length and be equal to the original sequence length, or may be shorter than the original sequence length. In order to improve the processing performance of the vector model obtained by training on long documents, the embodiment of the application mainly takes the case that the sequence length of the target document exceeds the original sequence length as an example. Because the text sequence length of the target document in the first text training data is larger than the original sequence length, the first text training data can make the model pay more attention to the learning of the long document when training the model, and the processing performance of the vector model obtained by training on the long document is improved.

When the text sequence length of the target document having an association with the first query text is greater than the original sequence length, in order to ensure that the first text training data of the target document having the text sequence length greater than the original sequence length can be constructed, the server may construct the first text training data by determining the target document from the documents having the text sequence length greater than the original sequence length, determining the first query text having an association with the target document, and then constructing the first text training data based on the first query text and the target document having an association with the first query text. For example, there are a plurality of documents, each document has a certain text sequence length, and each document has a corresponding query text, the original sequence length is 512, the server may determine the target document from the documents with text sequence lengths greater than 512, the target document determined by the server is "the first query text determined by the server is" where the first query text of XX "is, the AA has long culture and history, and is economic center and culture center of the country. The server constructs the target document and the first query text as first text training data in the form of a binary group (where the first capital of XX is AA, which has a long culture and history, is an economic and cultural center of the country).

According to the method and the device for the text sequence length screening, screening is conducted based on the text sequence length, so that the text sequence length of the target document in the first text training data obtained through construction is larger than the original sequence length, namely the target document is a long document. The model training is performed by utilizing the first text training data constructed by the long document, so that the model can fully learn the knowledge related to the long document, and the processing performance of the vector model obtained by training on the long document is improved.

S202, acquiring a network model to be trained.

The server acquires a network model to be trained, and the position code embedding matrix of the network model to be trained comprises an original position code and an extended position code. The position code embedding matrix may be an embedding matrix in which words in a document are represented by position information, and the original position code may be a position code of a position code embedding matrix of an existing model structure, and the length of the original position code is the length of an original sequence. The extended position code is a position code that extends the original position code for representing a portion of the document beyond the length of the original sequence with position information. After the original position codes are expanded through the expansion position codes, the sequence length of the position code embedded matrix is expanded from the original sequence length to the target sequence length, and the target sequence length is larger than the original sequence length.

Since the maximum sequence length of the open source vector model provided by the related art is limited to the original sequence length, the original sequence length is 512, for example. If the open source vector model is a generic sentence vector model, the shape of the position-coding embedding matrix of the generic sentence vector model is (512, 1024), and the position-coding embedding matrix can only process and encode documents with text sequence lengths within 512. The position code embedded matrix of the network model to be trained is obtained after expansion through expansion position codes, if the target sequence length obtained after expansion of original position codes through expansion position codes is 2048, the shape of the position code embedded matrix of the network model to be trained can be 2048,1024, and the position code embedded matrix can process and encode documents with text sequence length within 2048. Compared with the open source vector model provided by the related art, the long document with longer text sequence length can be processed.

It should be noted that, in general, whether the open source vector model or the network model to be trained is mainly used to process a dense vector, the dense vector refers to a vector modeled on a vector using a data structure of an array, and most elements in the dense vector are not 0.

The network model to be trained may choose a variety of model structures, and in one possible implementation, the network model to be trained may choose a transducer model structure, and one network layer of the network model to be trained includes two sub-network layers, namely a Multi-Head Self-Attention (MHSA) layer and a Feed-forward neural network layer (Feed-Forward Neural Network, FFNN), both of which are followed by a residual connection (Residual Connection, RC) and layer normalization (Layer Normalization, LN), respectively. The network model to be trained may be formed by stacking a plurality of identical network layers as described above.

FIG. 3 is a schematic diagram of a network model to be trained according to an embodiment of the present application, where cls represents the beginning of a document and sep represents the end of the document. X is x ₀ ~x ₅ Representing different parts of a document entered into the network model to be trained, transformer Block represents one network layer, with two network layers being present for the training network model in fig. 3. The document is processed by the two network layers and then is input into an average pooling layer, and the corresponding text vector is output after being processed by the average pooling layer.

Different network layers in the network model to be trained can play different roles, for example, a multi-head self-attention mechanism layer is used for calculating the association degree between each word and other words in an input document or query text, and long-distance dependency relationship in the document is captured by utilizing the association degree between the word and the other words, and the long-distance dependency relationship refers to the connection between different words in the document. For example, in the document "i drive out today, it can drive in soon," it "refers to the" car "in front, and the relationship between" it "and" car "in the document is a long distance dependency. Because the multi-head self-attention mechanism layer is provided with a multi-head mechanism, the multi-head mechanism can enable the network model to be trained to pay attention to information of different positions of the document at the same time.

As another example, a feed-forward neural network layer is used to extract local features of a text sequence of an input document, and typically comprises two fully connected layers and an activation function.

In order to improve the performance of the vector model obtained by training, the network model to be trained can adopt a bidirectional attention mechanism, and the bidirectional attention mechanism can simultaneously consider the contextual information of the vocabulary when encoding the document. The bi-directional attention mechanism enables the network model to be trained to better capture semantic information in context when interpreting the document.

It should be noted that, in the embodiment of the present application, the execution order of S201 and S202 is not limited, and S201 may be executed first and then S202 may be executed, S202 may be executed first and then S201 may be executed, or S201 and S202 may be executed in parallel.

S203, initializing model parameters of the network model to be trained to obtain an initial network model.

The initialization parameter values of the extended position codes in the initial network model are obtained through random initialization, and the random initialization can be a method for randomly determining the initialization parameter values of the extended position codes by using numerical values in a preset range. The initialization parameter values of the remaining model parameters in the initial network model may be obtained by initializing model parameters based on the pre-trained open source vector model, where the remaining model parameters are model parameters except for extended position codes in all model parameters of the initial network model. Taking an original sequence length of 512 and a target sequence length of 2048 as an example, the original position codes of 0-512 in the initial network model can be initialized through model parameters of the pre-trained open source vector model, and the extended position codes of 512-2048 can be initialized randomly.

The pre-trained open source vector model may have a variety of options, such as a pre-trained chinese-english semantic vector model (BAAI General Embedding, BGE) or a pre-trained generic sentence vector model (General Text Embeddings, GTE). Wherein the position coding embedded matrix of the generic sentence vector model is a trainable absolute position coding matrix. The structure of the pre-trained open source vector model may be a bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT).

Model parameters other than extended position coding may be initialized using a pre-trained open source vector model. Furthermore, since the maximum sequence length of the open source vector model is limited, the maximum sequence length of the open source vector model corresponds to the original sequence length of the network model to be trained, and the original position coding of the network model to be trained can be initialized through the open source vector model.

In the method, the model parameters of the network model to be trained are initialized based on the pre-trained open source vector model, and the network model to be trained can fully utilize the vector representation capability of the pre-trained open source vector model on the short text through initialization. Meanwhile, because the model parameters in the pre-trained open source vector model are relatively accurate, the model parameters do not need to be greatly adjusted in the subsequent training process, and the time for obtaining the vector model through training is shortened.

S204, training the expansion position codes of the initial network model by using the first text training data to obtain a target vector model.

And training the expansion position code of the initial network model by the server by using the first text training data to obtain a target vector model.

When the server trains by using the first text training data, the first text training data can be preprocessed, and the first text training data is converted into a form supported by the initial network model. Preprocessing the first text training data may be performed in a variety of ways, and in one possible implementation, the server may perform preprocessing on the first query text and the target document in the first text training data, respectively. The preprocessing may include word segmentation and indexing, where indexing may be a process of representing words obtained by word segmentation in a form supported by an initial network model.

Next, taking preprocessing of the first query text as an example, during preprocessing, the server adds a special symbol "cls" at the beginning of the first query text, where the cls is used to characterize the beginning of the first query text; a special symbol "sep" is added at the end of the first query text, which is used to characterize the end of the first query text. And then the server performs word segmentation and indexing processing on the first query text.

Fig. 4 is a schematic diagram of preprocessing provided in an embodiment of the present application, where the first query text is "where is the capital of XX? ", the" [ 'cls', 'X', 'X', 'X', 'are obtained after word segmentation,' first ',' all ',' is ',' which ',' inside ',' is? "sep ' ]", indexing the vocabulary obtained by the above word segmentation to obtain the result of "[ (2)," xxx ', "xxxx '," 4638', "7674 '," 6963', "3221 '," 1525', "7027 '," 8043', "3 ' ]". The words "cls" obtained by word segmentation become "2" after indexing, and the server represents each word obtained by word segmentation by using the form supported by the initial network model.

The server may train the extended position code of the initial network model in a plurality of ways, and the method for training the extended position code of the initial network model is not limited in this application.

In one possible implementation, the server may train the extended position code of the initial network model directly with the first text training data to obtain the target vector model.

In another possible implementation, the server may employ a staged training approach, with the first stage of training being training of part of the model parameters and the second stage of training being training of all of the model parameters. Specifically, the server may train the extended position code of the initial network model by using the first text training data to obtain the intermediate network model, thereby implementing the training of the first stage. And then, the server performs fine adjustment on all model parameters of the intermediate network model by using the acquired second text training data to obtain a target vector model. The second text training data acquired by the server is composed of the second query text and a target document with an association relation with the second query text, and the first text training data and the second text training data can be the same or different.

According to the method provided by the embodiment of the application, the target vector model can be obtained through two-stage training, the extended position code of the initial network model can be learned to the knowledge related to the long document through the first text training data in the first stage training, and the processing capacity of the target vector model on the long document is improved. During the first stage of training, only the extended position-coded portion is trained, and the remaining model parameters of the initial network model are not engaged in training. By training only the part of the extended position code, on one hand, the rich language knowledge of the short document processing learned by the initial network model can be reserved, and the effect of processing the short document is prevented from being damaged in the learning process; on the other hand, the number of model parameters which need to be updated in the training process of training only the part of the extended position code is reduced, so that the training speed is increased. The training in the second stage can lead the training to obtain better fusion of the model parameters corresponding to the original position codes of the target vector model and the model parameters corresponding to the extended position codes. Because the training in the second stage only carries out fine adjustment on model parameters, the whole speed of training the target vector model is accelerated while the processing capacity of a long document is ensured.

The server can train the expansion position codes of the initial network model in a plurality of different modes to obtain the intermediate network model, and the mode of obtaining the intermediate network model through training is not limited in the application. In one possible implementation, the server may train with a plurality of first text training data to obtain the intermediate network model, where each of the first text training data includes a first query text and a target document having an association with the first query text. When training with multiple pieces of first text training data, contrast learning, which refers to identifying and distinguishing relevant middle sample text from other irrelevant negative sample text, can be used as a training target for the middle network model.

Firstly, the server can acquire a positive sample and a negative sample of a first query text, the server can acquire the positive sample and the negative sample of the first query text aiming at the first query text in each piece of first text training data, the target document of the first text training data which belongs to the same piece of first query text is the positive sample of the first query text, and the target document of the first text training data which does not belong to the same piece of first query text is the negative sample of the first query text.

Second, the server may input the first query text, a positive sample of the first query text, and a negative sample of the first query text into the initial network model. And outputting a first text vector corresponding to the first query text, a second text vector corresponding to the positive sample of the first query text and a third text vector corresponding to the negative sample of the first query text through the initial network model.

Again, the server may calculate a first similarity between the first query text and positive samples of the first query text based on the first text vector and the second text vector. The server may also calculate a second similarity between the first query text and the negative sample of the first query text based on the first text vector and the third text vector.

The server may then construct a first loss function based on the first similarity and the second similarity. The server trains the extended position codes of the initial network model based on the first loss function to obtain an intermediate network model.

The server is used for acquiring n pieces of first text training data asFor example, to distinguish each piece of first text training data, n pieces of first text training data may be expressed as。q _i For the first query text, p in the ith first text training data _i Is the target document in the ith first text training data. For the ith first text training data (q _i ，p _i ) The server may collect the first query text q in the ith first text training data _i A corresponding negative sample. q _i The corresponding set of negative samples may be used +.>M is the first query text q collected by the server _i The number of corresponding negative samples, +.>Represents q _i And the corresponding j-th negative sample, j is a positive integer less than or equal to m.

Second, the server will first query text q _i Positive sample p of first query text _i And a negative sample of the first query textAnd inputting an initial network model, and outputting a first text vector, a second text vector and a third text vector through the initial network model.

Again, the server may calculate the first query text q based on the first text vector and the second text vector _i Positive sample p with first query text _i The first similarity therebetween is not limited in the manner of calculating the first similarity, and may be a functionCalculate q _i And p _i A first similarity between the two, q is calculated by the function _i And p _i At a first similarity between the first text vector and the second text vector is input. And, the server may calculate the first query text q based on the first text vector and the third text vector _i Negative sample of first query text +.>A second degree of similarity therebetween. The way of calculating the second similarity is likewise not limited in this application, for example by the function +.>Calculate q _i And->A second similarity between the two, q is calculated by the function _i And->In the case of the second similarity, the first text vector and the third text vector are input.

The server may then construct a first loss function from the first similarity and the second similarity, the first loss function being expressed by the following formula:

wherein L is _cont Representing a first loss function, n being the number of bars, q, of the first text training data _i Representing the first query text, p, in the ith first text training data _i Representing a target document in the ith first text training data, i.e. a positive sample of the first query text in the ith first text training data, i being a positive integer less than or equal to n,represents q _i And the corresponding j-th negative sample, j is a positive integer less than or equal to m.Representing a first similarity,/->Representing a second degree of similarity.

Function ofThere can be a number of different forms, in one possible implementation,/a >The expression can be expressed by the following formula:

wherein E is _q A vector representation corresponding to the first query text q, E, output for the last network layer of the initial network model _p A vector representation of positive or negative samples of the first query text q output for the last network layer of the initial network model. τ is the temperature coefficient for scaling.

In the embodiment of the application, if the function is passedCalculating a first similarity, then E _q May be the first query text q in the ith first text training data _i E, E _p May be a second text vector. If through functionCalculate the second similarity, then E _q May be the first query text q in the ith first text training data _i E, E _p May be a third text vector.

It should be noted that, the training of the initial network model may be a continuous process, and the method may be a training round for training the initial network model, and when the server confirms that the preset cutoff condition is reached, the training of the initial network model is stopped, so as to obtain the intermediate network model. The preset cutoff condition may be a preset round of training or an initial network model to a preset accuracy. When the initial network model is trained, the remaining model parameters can be frozen so that the remaining model parameters cannot be adjusted along with training.

According to the method provided by the embodiment of the application, the expansion position codes of the initial network model are trained, a comparison learning mode is used in the process of obtaining the intermediate network model, the initial network model learns the ability of identifying and distinguishing relevant positive sample documents from other irrelevant negative sample documents through comparison learning, and the performance of the intermediate network model obtained through training is improved.

The server trains the initial network model by using the first text training data, and the intermediate network model obtained by training can process long documents with the text sequence length exceeding the original sequence length. In order to further improve the processing effect of the intermediate network model on the long document, the server can further perform fine adjustment on all model parameters of the intermediate network model through the acquired second text training data to obtain a target vector model.

The server may perform fine tuning on all model parameters of the intermediate network model in a plurality of different manners to obtain a target vector model, and the fine tuning manner is not limited in the present application. In one possible implementation, the server may train with a plurality of pieces of second text training data, each piece of second text training data including a second query text and a target document having an association with the second query text. The first text training data and the second text training data may be the same or different.

Firstly, the server can acquire a positive sample and a negative sample of the second query text, the server can acquire the positive sample and the negative sample of the second query text aiming at the second query text in each piece of second text training data, the target document of the second text training data which belongs to the same piece of second query text is the positive sample of the second query text, and the target document of the second text training data which does not belong to the same piece of second query text is the negative sample of the second query text.

Second, the server may input the second query text, a positive sample of the second query text, and a negative sample of the second query text into the intermediate network model. And outputting a fourth text vector corresponding to the second query text, a fifth text vector corresponding to the positive sample of the second query text and a sixth text vector corresponding to the negative sample of the second query text through the intermediate network model.

Again, the server may calculate a third similarity between the second query text and positive samples of the second query text based on the fourth text vector and the fifth text vector. The server may also calculate a fourth similarity between the second query text and the negative sample of the second query text based on the fourth text vector and the sixth text vector.

The server may then determine a second loss function based on the third similarity and the fourth similarity. And the server constructs a target loss function based on the second loss function, and fine-tunes all model parameters of the intermediate network model based on the target loss function to obtain a target vector model.

The process of determining the second penalty function by the server is similar to the process of determining the first penalty function, the fourth text vector is similar to the first text vector, the fifth text vector is similar to the second text vector, and the sixth text vector is similar to the third text vector. The process of the server calculating the third similarity using the fourth text vector and the fifth text vector is similar to the process of calculating the first similarity using the first text vector and the second text vector, and the process of calculating the fourth similarity using the fourth text vector and the sixth text vector is similar to the process of calculating the second similarity using the first text vector and the third text vector. The process of determining the second loss function by the server according to the third similarity and the fourth similarity is similar to the process of determining the first loss function according to the first similarity and the second similarity, and will not be described again here.

After the intermediate network model is obtained, the intermediate network model is further trained, and parameters corresponding to the original position codes and parameters corresponding to the extended position codes can be better fused through training of the intermediate network model. In addition, a contrast learning mode is used for training the intermediate network model, and the intermediate network model is further learned through the contrast learning to identify and distinguish the relevant positive sample documents from other irrelevant negative sample documents, so that the performance of the intermediate network model obtained through training is further improved.

It should be noted that, the embodiment of the present application does not limit the manner of constructing the objective loss function based on the second loss function. In one possible implementation, the server may directly take the second loss function as the target loss function. When the server directly uses the second loss function as the target loss function, the process of obtaining the target vector model through training the intermediate network model is similar to the process of obtaining the intermediate network model through training the initial network model, and details are not repeated here.

It can be understood that the intermediate network model obtained through training has learned rich language knowledge and characteristics, and has stronger processing performance on long documents. In some cases, in the training process of all model parameters, the problem of disastrous forgetting easily occurs, so that the model parameters (mainly the rest model parameters of the intermediate network model) are updated and changed greatly, so that the semantic knowledge of processing short documents learned before the training of the obtained target vector model is lost, and the processing effect of the short documents is reduced. In this case, in order to avoid a catastrophic forgetting problem during the training process of all model parameters, the embodiment of the application provides a method for constructing the objective loss function based on the second loss function. The server first obtains the remaining model parameters of the intermediate network model, which are model parameters other than the extended position code. The server then determines a model parameter loss function based on the initialized parameter values and the updated parameter values for the remaining model parameters. After the model parameter loss function is obtained, the server constructs a target loss function based on the second loss function and the model parameter loss function.

In one possible implementation, the target loss function may be a weighted sum of the second loss function and the model parameter loss function, in particular the target loss function may be expressed by the following formula:

wherein L represents a target loss function, L _cont Representing a second loss function, L _mse Representing the model parameter loss function, T is the hyper-parameter that balances the second loss function and the model parameter loss function, and the server can fine-tune all model parameters of the intermediate network model by minimizing the objective loss function L.

Model parameter loss function L _mse There may be a number of options, in one possible implementation, a model parameter loss function L _mse May be a mean square error loss (Mean Squared Error, MSE). The server may calculate a mean square error loss between the initialized parameter values and the updated parameter values of the remaining model parameters, and the mean square error loss may be expressed by the following formula:

wherein L is _mse Representing the mean square error loss between the initialized parameter values and the updated parameter values for the remaining model parameters, N representing the number of remaining model parameters,representing the parameter value of the i-th remaining model parameter after updating,/- >Initialization parameter values representing the i-th remaining model parameters, i being a positive integer less than or equal to N.

The method provided by the embodiment of the application uses the constraint model parameter updating method to avoid excessive updating and changing of the residual model parameters of the intermediate network model, so that catastrophic forgetting of the intermediate network model is avoided, and the intermediate network model retains the original performance of processing the short text by the constraint model parameter updating method.

According to the embodiment provided by the application, in the training process, the model parameters of the network model to be trained are initialized to obtain the initial network model. The initialization parameter values of the expansion position codes in the initial network model are obtained by random initialization, the initialization parameter values of the residual model parameters in the initial network model are obtained by initializing model parameters of the pre-trained open source vector model, the residual model parameters are model parameters except the expansion position codes in all model parameters of the initial network model, the residual model parameters comprise original position codes, and therefore the residual model parameters which are already pre-trained can be frozen during training, and only the expansion position codes of the initial network model are trained by using the acquired first text training data to obtain a target vector model. Therefore, rich language knowledge learned by the pre-trained vector model can be reserved, training time is shortened, and training speed is increased. According to the method and the device, the sequence length of the position code embedded matrix of the model to be trained is expanded, and partial model parameters of the network model to be trained are trained, so that the maximum sequence length of the target vector model is improved and expanded on the premise that language knowledge of a short text learned in advance is reserved and the training speed is ensured. Thus, the segmentation process or the sliding window process is not needed when the long document is processed, and the influence on the retrieval performance due to the loss of the context information is avoided.

The above embodiment describes the training method of the vector model in detail, and in order to complete the training of the vector model, the first text training data for training the vector model is also very important. In the embodiment of the present application, in order to improve the processing performance of the long document by the vector model obtained by training, in S201, when the first text training data is constructed, the sequence length of the target document is mainly described as an example that is greater than the original sequence length. In the embodiment of the application, the server may determine the target document with the text sequence length greater than the original sequence length in a plurality of different manners, and the manner of determining the target document by the server is not limited in the application. In one possible implementation, the server may determine the target document from an open source text dataset. Specifically, the server may screen, as candidate documents, documents having a text sequence length greater than the original sequence length from an open source text dataset, which may be a collection of open source text data, including query text and documents having an association with the query text, and candidate documents may be documents waiting to be selected. After the server filters out the candidate documents, the server may determine a target document based on the candidate documents. After the server determines the target document, the query text in the open source text dataset, which has an association relationship with the target document, may be determined as the first query text. The server constructs first text training data using the determined target document and the corresponding first query text. If the original sequence length is 512, the text sequence length of the candidate document may exceed 512, and may even exceed 2048.

According to the method, the first text training data is acquired in the open source text data set through the text sequence length of the target document, and the first text training data for model training can be conveniently and rapidly acquired through the method.

The server may determine the target document among the candidate documents in a variety of ways, e.g., the server may directly treat the candidate document as the target document.

Fig. 5 is a schematic diagram of acquiring a target document in an open source text dataset according to an embodiment of the present application. The large rectangular box in fig. 5 represents the open source text data set, the small rectangular box represents the data in the open source text data set, and the query text and the document in the same small rectangular box have an association relationship. In the open source text dataset, the text sequence length of document a, document C and document E is greater than the original sequence length. The server screens the document A, the document C and the document E as candidate documents in the open source text data set, and the server can directly take the document A, the document C and the document E as target documents.

In some cases, the key information of the document may be concentrated at the beginning or first half of the document, and it may be necessary to focus only on the first half of the document to determine whether the document matches or is relevant to the query text, without looking at the second half. For example, the capitals of the document "XX" existing in the open source text dataset are AA, XX has a long culture and history, is an economic center and a cultural center of the country, and the query text of the document having an association is "where is the capitals of XX? ". For the documents and corresponding query text. During training, no matter how long the text sequence length of the document is, the initial network model only needs to pay attention to more than ten words at the beginning of the document, and whether the document is matched with the query text can be judged.

In this case, in order to avoid that key information of the target document in the first text training data is too concentrated at the beginning or the first half, and further avoid that a vector model obtained through training of the first text training data may have a shortcut and pay attention to text information of only the first half, the embodiment of the application provides another method for determining the target document in the candidate documents. Specifically, if the server screens out a plurality of candidate documents, the server may determine, for each candidate document, a location of key information in the candidate document, where the key information is information that matches a first query text corresponding to the candidate document. The server may determine whether the key information is concentrated at the beginning or the first half of the candidate document by determining the position of the key information in the candidate document by the key information determination using the query text having an association relationship with the candidate document. If the position is the target position, the target position may be the beginning or the first half of the candidate document, the server may determine that the key information is concentrated at the beginning or the first half of the candidate document, and the server may discard part of the candidate documents in the plurality of candidate documents according to a preset proportion, so as to obtain the target document. The preset proportion can be set according to actual requirements, for example, the preset proportion can be 50%.

Fig. 6 is a schematic diagram of another method for acquiring a target document in an open source text dataset according to an embodiment of the present application. The open source text data set part and the candidate document part in fig. 6 are consistent with those in fig. 5, and after screening candidate documents, the server performs key information judgment on the document a by using the query text a, performs key information judgment on the document C by using the query text C, and performs key information judgment on the document E by using the query text E. And judging that the key information of the document A is positioned at the target position (such as the first half part of the document A) of the document A, the key information of the document C is positioned at the target position (such as the first half part of the document C) of the document C, and the key information of the document E is positioned at the second half part of the document E and is not positioned at the target position of the document E. If the preset ratio is 50%, the server may select the candidate documents to discard among the document a and the document C. Also, the preset ratio is 50%, and the server can select 50% to discard in the document a and the document C, that is, the server can discard one randomly in the document a and the document C. In FIG. 6, the server has randomly discarded document C, and the target documents are document A and document E.

When the first text training data is acquired in the open source text data set, the text sequence length of the target document is considered, and meanwhile the position of the key information in the target document is also considered. The key information of the target document acquired by the embodiment of the application is not concentrated in a large amount at a specific position, for example, the key information of the target document where a large amount of key information does not exist is concentrated at the beginning or the first half. The vector model obtained by training the first text training data obtained by the method provided by the embodiment of the application can pay attention to each position of the document more comprehensively, so that the situation that only a specific position is focused to the greatest extent and a shortcut is generated is avoided as much as possible, and the method has a better effect when processing the long document.

The location of the key information in the candidate document may be determined in a number of different ways, and is not limited in this embodiment of the application. In one possible implementation, the original sequence length is determined, and the server may segment the candidate document according to the original sequence length to obtain a plurality of text segments. The server calculates fifth similarity between the text segments and the first query text corresponding to the candidate documents respectively, wherein the fifth similarity is used for representing the matching degree of the text segments and the first query text, and the higher the fifth similarity is, the more the corresponding text segments are matched with the first query text. The server determines the text segment with the highest fifth similarity as the position of the key information of the candidate document in the candidate document.

The server may calculate the fifth similarity between the text segment and the first query text corresponding to the candidate document by various methods, and the method for calculating the fifth similarity is not limited in this application. For example, the server may calculate the fifth similarity between the plurality of text segments and the first query text, respectively, using a vector model provided by a related art, which may be a chinese-english semantic vector model or a generic sentence vector model. For another example, the server may calculate the fifth similarity between the plurality of text segments and the first query text, respectively, using a similarity calculation formula.

Fig. 7 is a schematic diagram of calculating a fifth similarity according to an embodiment of the present application, where a document a is a candidate document, and a text sequence length of the document a is 2048. The server segments the document A according to the original sequence length 512 to obtain a text segment 1, a text segment 2, a text segment 3 and a text segment 4, wherein the text sequence lengths of the text segment 1, the text segment 2, the text segment 3 and the text segment 4 are 512. The server calculates fifth similarity between the four text segments and a first query text A, wherein the first query text A and the document A have an association relation. The fifth similarity between the text segment 1 and the first query text a may be represented as score1, the fifth similarity between the text segment 2 and the first query text a may be represented as score2, the fifth similarity between the text segment 3 and the first query text a may be represented as score3, the fifth similarity between the text segment 4 and the first query text a may be represented as score4, score1 is 10, score2 is 26, score3 is 80, and score4 is 20. Since score3 is highest, the server can determine text segment 3 as the location of the key information of document a in document a.

According to the method provided by the embodiment of the invention, the candidate document is split into a plurality of text segments according to the original sequence length pairs, and the positions of the key information in the candidate document are represented by the text segments, so that the positions of the key information in the candidate document can be more accurately and conveniently located. Meanwhile, because the original sequence length can be used for distinguishing short documents from long documents, different positions of candidate documents can be distinguished in a measured manner by splitting the original sequence length, and whether key information is positioned in different text segments meets the requirement of the long documents is reflected, so that a plurality of text segments obtained by splitting are more in line with the actual requirement.

It was mentioned above that the server may determine the target document having a text sequence length greater than the original sequence length in a number of different ways, for example from an open source text dataset, and further obtain the first text training data from the open source text dataset. In the process of training the initial network model, the more the first text training data, the better the performance of the finally obtained vector model, and the very good acquisition and collection of the unsupervised document with the text sequence length being greater than the original sequence length are, so in order to expand the richness of the first text training data, another way of acquiring the first text training data is provided, and the server can generate the first text training data based on the unsupervised document. An unsupervised document refers to a document that has no query text that has an association with the document.

The non-supervised documents may be located in non-supervised corpora, so that the server may determine the target documents from the non-supervised corpora, which may be a plurality of documents collected from news, encyclopedia, articles, and other sources, without corresponding query text. Specifically, the server may determine an unsupervised document having a text sequence length greater than an original sequence length in the unsupervised corpus as the target document. For example, the original sequence length is 512, the server may determine an unsupervised document having a text sequence length greater than 512 in the unsupervised corpus as the target document, and in particular, the server may screen for an unsupervised document having a text sequence length greater than 2048.

After determining the target document, the server may generate a first query text having an association with the target document based on the target document. In the embodiment of the application, the server can determine the first query text with the association relation with the target document in various modes. For example, for a target document with a title, the server may determine the title of the target document as the first query text.

For another example, the server may generate text of the target document according to a prompt template through a generated language model, so as to obtain a first query text having an association relationship with the target document, where the prompt template includes the target document and a task instruction. The generative language model is a large language model with strong semantic understanding capability and context learning capability, and the embodiment of the invention does not limit the model structure of the generative language model, and the generative language model can be, for example, a chat generating model (Chat Generative Pre-trained Transformer, chatGPT), and particularly can be a fourth generation chat generating model (Chat Generative Pre-trained Transformer, gpt 4).

The prompt template (prompt) can be a prompt word input to the large language model, has the function of prompting how the generated language model generates text, and the task instruction of the prompt template is an instruction which the generated language model understands and follows in the running process. The task instructions of the hint template may be manually written and adapted such that the generative language model generates a matching first query text for the target document with its powerful semantic understanding and instruction following capabilities.

The alert template may include a plurality of different forms, fig. 8 is a schematic diagram of an alert template provided in an embodiment of the present application, where the top in fig. 8 is a task instruction of the alert template, and the bottom is a target document of the alert template. The prompt template shown in fig. 8 is input into a generative language model, and the generative language model generates a first query text for the target document according to the task instruction in the prompt template. The task instruction requires that the key information of the first query text is concentrated in the second half of the target document. The generative language model generates a first query text using strong semantic understanding and instruction following capabilities, which may be "when the original population of XX region appears? ". The server may construct the generated language model using the hint template to generate the first query text and the target document as first text training data.

According to the method, the first query text with the association relation with the unsupervised document is generated for the unsupervised document through the generated language model, and the unsupervised document is better acquired and collected. In addition, the first query text in the embodiment of the application is generated through the prompt template, the position of the key information in the target document can be specified through the task instruction in the prompt template, more various first text training data can be generated through the method, training is carried out through the more various first text training data, and the vector model performance obtained through training can be stronger.

It can be understood that the above two construction modes of the first text training data are respectively that the first text training data is obtained in the open source text data set, and the first query text corresponding to the unsupervised document is generated through the generated language model, so as to construct the first text training data. The two construction modes can be used independently or in combination to obtain richer first text training data, and the embodiment of the application is not limited in this regard.

After the target vector model is obtained, the target vector model may be used in a vector retrieval stage. When the target vector model is used for vector retrieval, the server can acquire a text to be queried and a plurality of documents to be processed, wherein the text to be queried can be sent by a terminal of a user, and the text to be queried can be text which is input by the user and used for reflecting the retrieval intention of the user. It can be appreciated that, since the text to be queried is typically a short document, the server can process a plurality of documents to be processed through the target vector model, and process the text to be queried through the open source vector model provided by the related art. In another possible implementation, since the target vector model has better processing performance on both short documents and long documents, the server can process the text to be queried and the plurality of documents to be processed through the target vector model. Specifically, the server may process the text to be queried and the plurality of documents to be processed using one target vector model, and the server may also process the text to be queried and the plurality of documents to be processed using a plurality of target vector models, respectively, e.g., the server may process the text to be queried using one target vector model and process the plurality of documents to be processed using another target vector model. The server outputs a first vector representation of the text to be queried through the target vector model, and outputs a second vector representation of each document to be processed through the target vector model. The server calculates a sixth similarity between the first vector representation and each second vector representation, wherein the higher the sixth similarity is, the more likely the document to be processed and the text to be queried have an association relation. The server can determine the corresponding document to be processed, which is represented by the second vector with the highest sixth similarity, as the document retrieved through the target vector model, and the server returns the document to be processed to the terminal of the user.

Fig. 9 is a schematic diagram of a process of searching through a target vector model according to an embodiment of the present application, where a text to be queried and a plurality of documents to be processed in fig. 9 are respectively processed through respective target vector models, and parameters of the two target vector models are shared. And processing the text to be queried through the target vector model to obtain a first vector representation, and processing each document to be processed through the target vector model to obtain a second vector representation of each document to be processed. And then obtaining a sixth similarity between the first vector representation and each second vector representation, wherein the sixth similarity can be used as a score for measuring whether the document to be processed has an association relationship with the text to be queried. In another possible implementation, the text to be queried and the plurality of documents to be processed may be processed by the same object vector model.

It will be appreciated that the purpose of the target vector model is to convert the input query text or document into a corresponding vector representation, which is similar to the encoder function in a transducer model structure, whereas the model structure of the pre-trained network model obtained by embodiments of the present application may be a transducer model structure, so the target vector model may be equivalent to an encoder.

It should be noted that, in the embodiment of the present application, the timing of outputting the second vector representation of each document to be processed through the target vector model is not limited. In one possible implementation, the server may perform the outputting of the first vector representation and the second vector representation after receiving the text to be queried sent by the terminal. In another possible implementation manner, the server may output the second vector representation of each document to be processed in advance, and after the target vector model outputs the second vector representation of each document to be processed, the server may save the second vector representation of each document to be processed, without waiting for the server to start processing the document to be processed after receiving the text to be queried sent by the terminal, so as to reduce the time required for vector retrieval and improve the vector retrieval efficiency.

Specifically, N documents to be processed exist in the corpus, and the target vector model respectively processes the N documents to be processed in the corpus to obtain second vector representations { p } of the N documents to be processed in a fixed dimension ₁ , p ₂ ,……,p _n }。

After the server receives the text to be queried, the server inputs the text to be queried into a target vector model, and the target vector model is utilized to output a first vector representation q of the text to be queried. Then a sixth similarity between the first vector representation and each of the second vector representations is calculated separately, the sixth similarity between the first vector representation and the i-th second vector representation being calculated by the following formula:

Where score represents a sixth similarity, i is a positive integer less than or equal to N. q is a first vector representation, p _i Is the i-th second vector representation of the N second vector representations. |q| represents the modulus value of the first vector representation q, |p _i The i represents the modulus value represented by the i second vector. The value range of the sixth similarity is 0-1.

The higher the sixth similarity is, the greater the similarity between the text to be queried and the document to be processed corresponding to the i second vector representation is, and the lower the sixth similarity is, the lower the similarity between the text to be queried and the document to be processed corresponding to the i second vector representation is. And the server selects a second vector with highest sixth similarity to indicate that the corresponding document to be processed is returned to the terminal of the user.

The training method based on the aforementioned vector model is described below in connection with fig. 10 as an overall description of the training and use method of the vector model. Fig. 10 is a schematic diagram of a training method and a process for using a vector model according to an embodiment of the present application, where fig. 10 is divided into three stages altogether, and the first stage is a stage for constructing first text training data. The server may construct the first text training data in an open source text dataset, or may construct the first text training data based on a generative language model. The second stage is a model training stage, which in turn is divided into training of part of the model parameters and training of all the model parameters (i.e. fine tuning of all the model parameters). The third stage is a stage of vector retrieval using a target vector model.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further combined to provide further implementation manners.

Based on the training method of the vector model provided in the corresponding embodiment of fig. 2, the embodiment of the application further provides a training device 1100 of the vector model. Referring to fig. 11, the training apparatus 1100 of the vector model includes a first acquisition unit 1101, a to-be-second acquisition unit 1102, an initialization unit 1103, and a training unit 1104:

the first obtaining unit 1101 is configured to obtain first text training data configured by a first query text and a target document having an association relationship with the first query text;

the second obtaining unit 1102 is configured to obtain a network model to be trained, where a position code embedding matrix of the network model to be trained includes an original position code and an extended position code, the sequence length of the original position code is an original sequence length, the extended position code is a position code for extending the original position code, and the original position code is extended by the extended position code so that the sequence length of the position code embedding matrix is extended from the original sequence length to a target sequence length, and the target sequence length is greater than the original sequence length;

The initializing unit 1103 is configured to initialize model parameters of the network model to be trained to obtain an initial network model, where initialization parameter values of extended position codes in the initial network model are obtained by random initialization, initialization parameter values of remaining model parameters in the initial network model are obtained by initializing model parameters of an open source vector model that is pre-trained, the remaining model parameters are model parameters except for the extended position codes in all model parameters of the initial network model, and the remaining model parameters include the original position codes;

the training unit 1104 is configured to train the extended position code of the initial network model by using the first text training data, so as to obtain a target vector model.

In one possible implementation manner, the training unit is specifically configured to:

training the expansion position code of the initial network model by using the first text training data to obtain an intermediate network model;

acquiring second text training data consisting of a second query text and a target document with an association relationship with the second query text;

And fine tuning all model parameters of the intermediate network model by using the second text training data to obtain the target vector model.

In one possible implementation manner, the number of the first text training data is multiple, each first text training data includes a first query text and a target document having an association relationship with the first query text, and the training unit is specifically configured to:

for a first query text in each piece of first text training data, acquiring a positive sample and a negative sample of the first query text, wherein the positive sample of the first query text is a target document belonging to the same piece of first text training data as the first query text, and the negative sample of the first query text is a target document belonging to different pieces of first text training data as the first query text;

outputting, for each first query text, a first text vector of the first query text, a second text vector of a positive sample of the first query text, and a third text vector of a negative sample of the first query text through the initial network model;

calculating a first similarity between the first query text and positive samples of the first query text based on the first text vector and the second text vector, and calculating a second similarity between the first query text and negative samples of the first query text based on the first text vector and the third text vector;

Constructing a first loss function according to the first similarity and the second similarity;

training the expansion position code of the initial network model based on the first loss function to obtain the intermediate network model.

In one possible implementation manner, the number of the second text training data is multiple, each second text training data includes a second query text and a target document having an association relationship with the second query text, and the training unit is specifically configured to:

for a second query text in each piece of second text training data, acquiring a positive sample and a negative sample of the second query text, wherein the positive sample of the second query text is a target document belonging to the same piece of second text training data as the second query text, and the negative sample of the second query text is a target document belonging to different pieces of second text training data as the second query text;

outputting, for each second query text, a fourth text vector of the second query text, a fifth text vector of positive samples of the second query text, and a sixth text vector of negative samples of the second query text through the intermediate network model;

Calculating a third similarity between the second query text and positive samples of the second query text based on the fourth text vector and the fifth text vector, and calculating a fourth similarity between the second query text and negative samples of the second query text based on the fourth text vector and the sixth text vector;

determining a second loss function from the third similarity and the fourth similarity;

constructing a target loss function based on the second loss function;

and fine tuning all model parameters of the intermediate network model based on the target loss function to obtain the target vector model.

acquiring the residual model parameters of the intermediate network model;

determining a model parameter loss function based on the initialized parameter values and updated parameter values of the remaining model parameters;

the target loss function is constructed based on the second loss function and the model parameter loss function.

In one possible implementation manner, the first obtaining unit is specifically configured to:

determining the target document from documents with text sequence lengths larger than the original sequence length, and determining a first query text with an association relation with the target document;

And constructing the first text training data based on the first query text and a target document with an association relation with the first query text.

screening documents with text sequence length larger than the original sequence length from an open source text data set as candidate documents;

determining the target document based on the candidate document;

and determining the query text with the association relation with the target document in the open source text data set as a first query text with the association relation with the target document.

determining the position of key information of the candidate document in the candidate document according to each candidate document, wherein the key information is information matched with a first query text corresponding to the candidate document;

and if the position is the target position, discarding the candidate document from a plurality of candidate documents according to a preset proportion to obtain the target document.

segmenting the candidate document according to the original sequence length to obtain a plurality of text segments;

Respectively calculating fifth similarity between the text segments and the first query text corresponding to the candidate document;

and determining the text segment with the highest fifth similarity as the position of the key information of the candidate document in the candidate document.

determining an unsupervised document with a text sequence length larger than the original sequence length in an unsupervised corpus as the target document;

the determining the first query text with the association relation with the target document comprises the following steps:

and generating text of the target document through a generated language model according to a prompt template comprising the target document and the task instruction, and obtaining a first query text with an association relation with the target document.

In one possible implementation manner, the training device for the vector model provided by the application may further include a search unit, where the search unit is specifically configured to:

acquiring a text to be queried and a plurality of documents to be processed;

outputting a first vector representation of the text to be queried through the target vector model, and outputting a second vector representation of each document to be processed through the target vector model;

Calculating a sixth similarity between the first vector representation and each of the second vector representations, respectively;

and returning a second vector with highest similarity to represent the corresponding document to be processed.

The device can be seen that the original position codes in the position code embedding matrix are expanded by expanding the position codes on the basis of the existing model structure, and the sequence length of the original position codes is the original sequence length, so that the sequence length of the position code embedding matrix is expanded from the original sequence length to the target sequence length. Because the length of the target sequence is longer than that of the original sequence, the target vector model trained based on the network model to be trained can process long files with the sequence length longer than that of the original sequence. In the training process, model parameters of a network model to be trained are initialized to obtain an initial network model. The initialization parameter values of the expansion position codes in the initial network model are obtained by random initialization, the initialization parameter values of the residual model parameters in the initial network model are obtained by initializing model parameters of the pre-trained open source vector model, the residual model parameters are model parameters except the expansion position codes in all model parameters of the initial network model, the residual model parameters comprise original position codes, and therefore the residual model parameters which are already pre-trained can be frozen during training, and only the expansion position codes of the initial network model are trained by using the acquired first text training data to obtain a target vector model. Therefore, rich language knowledge learned by the pre-trained vector model can be reserved, training time is shortened, and training speed is increased. According to the method and the device, the sequence length of the position code embedded matrix of the model to be trained is expanded, and partial model parameters of the network model to be trained are trained, so that the maximum sequence length of the target vector model is improved and expanded on the premise that language knowledge of a short text learned in advance is reserved and the training speed is ensured. Thus, the segmentation process or the sliding window process is not needed when the long document is processed, and the influence on the retrieval performance due to the loss of the context information is avoided.

The embodiment of the application also provides computer equipment which can execute the training method of the vector model. The computer device may be a terminal, and fig. 12 shows a structure diagram of a terminal provided in an embodiment of the present application. In fig. 12, taking a terminal as a smart phone as an example:

referring to fig. 12, the smart phone includes: radio Frequency (RF) circuit 1210, memory 1220, input unit 1230, display unit 1240, sensor 1250, audio circuit 1260, wireless fidelity (WiFi) module 1270, processor 1280, and power supply 1290. The input unit 1230 may include a touch panel 1231 and other input devices 1232, the display unit 1240 may include a display panel 1241, and the audio circuit 1260 may include a speaker 1261 and a microphone 1262. It will be appreciated that the smartphone structure shown in fig. 12 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

Memory 1220 may be used to store software programs and modules, and processor 1280 may perform various functional applications and data processing for the smartphone by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Processor 1280 is a control center of the smartphone, connects various parts of the entire smartphone using various interfaces and lines, performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.

In this embodiment, the processor 1280 in the smart phone may execute the training method of the vector model provided in each embodiment of the present application.

The computer device provided in the embodiment of the present application may also be a server, as shown in fig. 13, fig. 13 is a block diagram of a server 1300 provided in the embodiment of the present application, where the server 1300 may have a relatively large difference due to different configurations or performances, and may include one or more processors, such as a central processing unit (Central Processing Units, abbreviated as CPU) 1322, a memory 1332, one or more storage media 1330 (such as one or more mass storage devices) storing application programs 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.

The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems 1341, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM , Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the cpu 1322 in the server 1300 may execute the training method of the vector model provided in the embodiments of the present application.

According to an aspect of the present application, there is provided a computer readable storage medium for storing a computer program for executing the training method of the vector model according to the foregoing embodiments.

According to one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a terminal, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of training a vector model, the method comprising:

2. The method of claim 1, wherein training the extended position code of the initial network model using the first text training data to obtain a target vector model comprises:

3. The method according to claim 2, wherein the number of the first text training data is a plurality, each first text training data includes a first query text and a target document having an association relationship with the first query text, and the training the extended position code of the initial network model by using the first text training data to obtain an intermediate network model includes:

4. The method according to claim 2, wherein the number of the second text training data is a plurality, each second text training data includes a second query text and a target document having an association relationship with the second query text, and the fine tuning of all model parameters of the intermediate network model by using the second text training data to obtain the target vector model includes:

constructing a target loss function based on the second loss function;

5. The method of claim 4, wherein constructing a target loss function based on the second loss function comprises:

acquiring the residual model parameters of the intermediate network model;

6. The method according to any one of claims 1-5, wherein a text sequence length of a target document having an association with the first query text is greater than the original sequence length, and the acquiring first text training data composed of the first query text and the target document having an association with the first query text includes:

7. The method of claim 6, wherein said determining the target document from documents having a text sequence length greater than the original sequence length comprises:

determining the target document based on the candidate document;

8. The method of claim 7, wherein the number of candidate documents is a plurality, the determining the target document based on the candidate documents comprising:

9. The method of claim 8, wherein the determining the location of key information of the candidate document in the candidate document comprises:

10. The method of claim 6, wherein said determining the target document from documents having a text sequence length greater than the original sequence length comprises:

11. The method according to any one of claims 1-5, further comprising:

acquiring a text to be queried and a plurality of documents to be processed;

12. A training device for a vector model, which is characterized in that the device comprises a first acquisition unit, a second acquisition unit, an initialization unit and a training unit:

13. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-11 according to instructions in the computer program.

14. A computer readable storage medium for storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1-11.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-11.