RU2775815C2

RU2775815C2 - Methods and servers for ranking digital documents in response to a query

Info

Publication number: RU2775815C2
Application number: RU2020142462A
Authority: RU
Inventors: Эдуард Мечиславович Волынец; Денис Сергеевич Пастушик; Евгений Александрович Гречников
Original assignee: Общество С Ограниченной Ответственностью «Яндекс»
Filing date: 2020-12-22
Publication date: 2022-07-11

Abstract

FIELD: computing technology.

SUBSTANCE: group of inventions relates to search engine technologies and can be used for ranking digital documents for a query. A search engine associated with an inverted index is stored on a server. The method includes accessing the inverted index to retrieve query-independent data for a first document-term pair and a second document-term pair, wherein the query-independent data indicates (i) a term-dependent occurrence of the first term in the document and (ii) a term-dependent occurrence of the second term in the document. The method includes generating, using the query-independent data, a query-dependent feature indicating a group occurrence of the first term with the second term in the document. The method includes generating a ranking feature for the document based on at least the first term, the second term and the query-dependent feature, and ranking the document based on at least the ranking feature.

EFFECT: increase in the performance speed.

20 cl, 7 dwg

Description

Область техникиTechnical field

[01] Настоящая технология относится к технологиям поисковых машин. В частности, настоящая технология направлена на способы и серверы для ранжирования цифровых документов в ответ на запрос.[01] The present technology relates to search engine technologies. In particular, the present technology is directed to methods and servers for ranking digital documents in response to a query.

Уровень техникиState of the art

[02] Интернет обеспечивает доступ к широкому спектру ресурсов, например, видеофайлам, файлам изображений, аудиофайлам или веб-страницам. Для поиска этих ресурсов используются поисковые машины. Например, цифровые изображения, которые удовлетворяют информационные потребности пользователя, могут быть идентифицированы поисковой машиной в ответ на получение пользовательского запроса, отправленного пользователем. Пользовательские запросы могут состоять из одного или более термов (term, слово) запроса. Поисковая система выбирает и ранжирует результаты поиска на основе их релевантности запросу пользователя и их важности или качества по сравнению с другими результатами поиска, а также предоставляет пользователю наилучшие результаты поиска.[02] The Internet provides access to a wide range of resources, such as video files, image files, audio files, or web pages. Search engines are used to find these resources. For example, digital images that satisfy a user's information needs may be identified by a search engine in response to receiving a user query submitted by the user. User requests can consist of one or more terms (term, word) of the request. The search engine selects and ranks search results based on their relevance to the user's query and their importance or quality compared to other search results, and provides the user with the best search results.

[03] Системы поисковых машин содержат компонент, называемый инвертированным индексом, который включает в себя большое количество списков документов (posting lists), связанных с соответствующими поисковыми термами, и хранит указания документов, которые содержат соответствующие поисковые термы. Такие структуры данных позволяют сократить время, память и ресурсы обработки для выполнения поиска.[03] Search engine systems contain a component called an inverted index, which includes a large number of document lists (posting lists) associated with the corresponding search terms, and stores indications of documents that contain the corresponding search terms. Such data structures reduce the time, memory, and processing resources for performing searches.

[04] Во время поиска система извлекает информацию, относящуюся к термам (из используемого запроса), из инвертированного индекса и конфигурируется для использования этой информации для ранжирования документов в ответ на используемый запрос. Данные, хранящиеся в инвертированном индексе, не зависят от запроса и зависят от терма. Обработка извлеченных данных в реальном времени для ранжирования документов является сложной задачей из-за компромисса между качеством ранжирования и требуемым временем обработки для этого.[04] During the search, the system extracts information related to terms (from the query used) from the inverted index and is configured to use this information to rank documents in response to the query used. The data stored in an inverted index is query independent and term dependent. Processing real-time extracted data for document ranking is challenging due to the trade-off between ranking quality and the required processing time to do so.

[05] Патент США №10,417,687, озаглавленный «Generating modified query to identify similar items in a data store» и опубликованный 17 сентября 2019 года, раскрывает методики идентификации подобных элементов в хранилище данных путем генерирования измененного запроса из пользовательского запроса и из информации о конкретном элементе, хранящейся в хранилище данных. Предварительно определенные описательные термы могут быть обозначены как полезные для идентификации элементов, и эти термы могут быть расположены в данных ключевых слов для конкретного элемента. Корреляции элемента также могут быть идентифицированы относительно обозначения категории элемента. Измененный запрос может быть сгенерирован на основе предварительно определенных описательных термов в данных ключевых слов для элемента и корреляций элементов.[05] U.S. Patent No. 10,417,687, entitled "Generating modified query to identify similar items in a data store" and published September 17, 2019, discloses techniques for identifying similar items in a data store by generating a modified query from a user query and from information about a specific item stored in the data store. Predefined descriptive terms may be designated as useful for identifying elements, and these terms may be located in the keyword data for a particular element. Element correlations can also be identified with respect to the element's category designation. The modified query can be generated based on the predefined descriptive terms in the keyword data for the element and element correlations.

Сущность изобретенияThe essence of the invention

[06] Варианты осуществления настоящей технологии были разработаны на основе выявления разработчиками по меньшей мере одной технической проблемы, связанной с подходами предшествующего уровня техники к классификации объектов.[06] Embodiments of the present technology were developed based on developers' identification of at least one technical problem related to prior art approaches to object classification.

[07] По меньшей мере в одном широком аспекте настоящей технологии предоставляется сервер, выполняющий множество реализованных на компьютере алгоритмов, называемых здесь «поисковой машиной», которая в широком смысле сконфигурирована для предоставления результатов поиска в ответ на запрос, представленный пользователем электронного устройства.[07] In at least one broad aspect of the present technology, a server is provided that runs a plurality of computer-implemented algorithms, referred to herein as a "search engine," which is broadly configured to provide search results in response to a query submitted by a user of an electronic device.

[08] В широком смысле, сетевая поисковая машина - это программная система, предназначенная для выполнения веб-поиска. Результаты поиска обычно представлены в ранжированном формате, то есть результаты поиска представлены в списке, ранжированном на основе релевантности запросу. Информация обычно отображается пользователю через страницу результатов поисковой машины (SERP). Результатами поиска могут быть цифровые документы, такие как, но не ограничиваясь этим, веб-страницы, изображения, видео, инфографика, статьи, исследовательские работы и другие типы цифровых файлов.[08] Broadly speaking, a web search engine is a software system for performing web searches. Search results are usually presented in a ranked format, that is, search results are presented in a list ranked based on relevance to the query. The information is usually displayed to the user through a search engine results page (SERP). Search results may be digital documents such as, but not limited to, web pages, images, videos, infographics, articles, research papers, and other types of digital files.

[09] По меньшей мере в одном аспекте настоящей технологии сервер соединен с возможностью осуществления связи с запоминающим устройством, хранящим структурированный набор данных, называемый здесь «инвертированным индексом». В широком смысле, инвертированный индекс - это структура данных, которая используется как компонент алгоритмов индексации поисковой машины. Например, сначала может быть сгенерирован прямой индекс, в котором хранятся списки слов для каждого документа - затем прямой индекс может быть в некотором смысле «инвертирован», так чтобы он хранил списки документов для каждого слова. Запрос к прямому индексу может потребовать последовательных итераций по каждому документу и каждому слову для проверки совпадающего документа. Время, память и ресурсы обработки для выполнения такого запроса не всегда технически реалистичны. Напротив, одно из преимуществ инвертированного индекса состоит в том, что для запроса к такой структуре данных требуется сравнительно меньше времени, памяти и ресурсов обработки.[09] In at least one aspect of the present technology, a server is communicatively coupled to a storage device storing a structured set of data, referred to herein as an "inverted index". In a broad sense, an inverted index is a data structure that is used as a component of search engine indexing algorithms. For example, a direct index can first be generated that stores word lists for each document - then the direct index can be "inverted" in some sense so that it stores lists of documents for each word. A direct index query may require successive iterations over each document and each word to check for a matching document. The time, memory, and processing resources to complete such a request are not always technically realistic. On the contrary, one of the advantages of an inverted index is that it takes comparatively less time, memory, and processing resources to query such a data structure.

[10] По меньшей мере в одном широком аспекте настоящей технологии инвертированный индекс хранит данные для соответствующих «пар терм-документ». Предполагается, что данные, хранящиеся в ассоциации с соответствующими парами терм-документ (DT), содержат «запросо-независимые» данные. По меньшей мере в некоторых вариантах осуществления настоящей технологии запросо-независимые данные для данной пары TD могут указывать на зависимое от терма вхождение соответствующего терма в содержимое, связанное с соответствующим документом. Можно сказать, что инвертированный индекс хранит «зависимые от терма» данные для соответствующих термов, в отличие от данных, которые зависят от более чем одного терма из запроса.[10] In at least one broad aspect of the present technology, an inverted index stores data for respective term-document pairs. The data stored in association with the corresponding term-document (DT) pairs is assumed to contain "query-independent" data. In at least some embodiments of the present technology, query-independent data for a given pair of TDs may indicate a term-dependent occurrence of the corresponding term in the content associated with the corresponding document. We can say that an inverted index stores "term-dependent" data for the corresponding terms, as opposed to data that depends on more than one term from the query.

[11] По меньшей мере в некоторых аспектах настоящей технологии разработчики настоящей технологии разработали способы и серверы для использования запросо-независимых данных, хранящихся в инвертированном индексе, для генерирования в реальном времени во время заданного поиска, выполняемого в ответ на отправленный запрос, одного или более «запросо-зависимых признаков» на основе извлеченных запросо-независимых данных для одного или более термов из отправленного запроса. По меньшей мере в некоторых вариантах осуществления настоящей технологии можно сказать, что данные, ранее сохраненные в инвертированном индексе, могут быть извлечены во время поиска в реальном времени и обработаны множеством реализованных на компьютере алгоритмов, которые упоминаются здесь как «генератор динамических признаков», который в широком смысле сконфигурирован для генерирования одного или более динамических признаков, где каждый динамический признак зависит от более чем одного терма из запроса.[11] In at least some aspects of the present technology, the developers of the present technology have developed methods and servers for using query-independent data stored in an inverted index to generate in real time during a given search performed in response to a submitted query, one or more "query-dependent features" based on the extracted query-independent data for one or more terms from the submitted query. In at least some embodiments of the present technology, it can be said that data previously stored in an inverted index can be retrieved during a real-time search and processed by a variety of computer-implemented algorithms, which are referred to here as a "dynamic feature generator", which in broadly configured to generate one or more dynamic features, where each dynamic feature depends on more than one term from the query.

[12] По меньшей мере в одном варианте осуществления генератор динамических признаков, выполняемый сервером, сконфигурирован для (i) использования запросо-независимых данных, связанных с соответствующими термами, для (ii) генерирования запросо-зависимого признака, указывающего на «групповое вхождение» по меньшей мере пары термов из запроса в содержимое, связанное с соответствующим документом. Предполагается, что сервер сконфигурирован для использования запросо-зависимого признака для ранжирования соответствующего документа среди множества потенциально релевантных документов в ответ на запрос.[12] In at least one embodiment, the server-executed dynamic feature generator is configured to (i) use query-independent data associated with corresponding terms to (ii) generate a query-specific feature indicative of a "group occurrence" of at least a pair of terms from the query into the content associated with the corresponding document. It is assumed that the server is configured to use a query-specific feature to rank a corresponding document among a set of potentially relevant documents in response to a query.

[13] В первом широком аспекте настоящей технологии предоставляется способ ранжирования цифровых документов в ответ на запрос. Цифровые документы являются потенциально релевантными запросу, содержащему первый и второй термы. Запрос был отправлен пользователем электронного устройства, соединенного с возможностью осуществления связи с сервером, на котором размещена поисковая машина. Поисковая машина связана с инвертированным индексом, в котором хранится информация, связанная с парами документ-терм (DT). Способ исполняется на сервере. Способ содержит, для данного документа из множества потенциально релевантных документов, осуществление доступа сервером к инвертированному индексу для извлечения запросо-независимых данных для первой пары DT и второй пары DT. Первая пара DT имеет данный документ и первый терм. Вторая пара DT имеет данный документ и второй терм. Запросо-независимые данные указывают на (i) зависимое от терма вхождение первого терма в содержимое, связанное с данным документом, и (ii) зависимое от терма вхождение второго терма в содержимое, связанное с данным документом. Способ содержит для данного документа из множества потенциально релевантных документов генерирование сервером запросо-зависимого признака с использованием запросо-независимых данных, извлеченных для первой пары DT и второй пары DT. Запросо-зависимый признак указывает на групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом. Способ содержит для данного документа из множества потенциально релевантных документов генерирование сервером ранжирующего признака для данного документа на основе, по меньшей мере, первого терма, второго терма и запросо-зависимого признака. Способ содержит ранжирование сервером данного документа из множества потенциально релевантных документов на основе, по меньшей мере, ранжирующего признака.[13] In a first broad aspect of the present technology, a method for ranking digital documents in response to a query is provided. The digital documents are potentially relevant to the query containing the first and second terms. The request was sent by the user of an electronic device that is connected with the ability to communicate with the server hosting the search engine. The search engine is associated with an inverted index that stores information associated with document-term (DT) pairs. The method is executed on the server. The method comprises, for a given document from a plurality of potentially relevant documents, accessing the inverted index by the server to retrieve query-independent data for the first pair of DTs and the second pair of DTs. The first DT pair has the given document and the first term. The second DT pair has the given document and the second term. The query-independent data indicates (i) a term-dependent occurrence of the first term in the content associated with the given document, and (ii) a term-dependent occurrence of the second term in the content associated with the given document. The method comprises, for a given document from a plurality of potentially relevant documents, generating a query-dependent feature by the server using query-independent data retrieved for the first DT pair and the second DT pair. The query-dependent attribute indicates the group occurrence of the first term with the second term in the content associated with this document. The method comprises, for a given document from a set of potentially relevant documents, generating by the server a ranking feature for the given document based on at least the first term, the second term, and the query-dependent feature. The method comprises ranking by the server a given document from a plurality of potentially relevant documents based on at least a ranking feature.

[14] В некоторых вариантах осуществления способа генерирование ранжирующего признака для данного документа выполняется нейронной сетью (NN).[14] In some embodiments of the method, generating a ranking feature for a given document is performed by a neural network (NN).

[15] В некоторых вариантах осуществления способа способ дополнительно содержит обучение сервером NN для генерирования ранжирующего признака. Обучение NN содержит генерирование сервером обучающего набора для обучающей пары документ-запрос (DQ), которая должна использоваться во время данной итерации обучения NN. Обучающая пара DQ имеет обучающий запрос и обучающий документ. Генерирование содержит генерирование сервером множества вложений обучающих термов на основе соответствующих термов из обучающего запроса. Генерирование содержит осуществление доступа сервером к инвертированному индексу, связанному с поисковой машиной, для извлечения множества запросо-независимых наборов данных, связанных с соответствующими парами из множества обучающих пар DT. Заданная одна из множества обучающих пар DT включает в себя обучающий документ и соответствующий один из множества термов из обучающего запроса. Генерирование содержит генерирование сервером множества векторов обучающих признаков для множества обучающих пар DT с использованием множества запросо-независимых наборов данных. Обучение NN содержит, во время данной итерации обучения NN, ввод сервером в NN упомянутого множества вложений обучающих термов и упомянутого множества векторов обучающих признаков для генерирования предсказанного ранжирующего признака для обучающей пары DQ. Обучение NN содержит, во время данной итерации обучения NN, настройку NN сервером на основе сравнения между меткой и предсказанным ранжирующим признаком, так что NN генерирует для данной используемой пары DQ соответствующий предсказанный ранжирующий признак, который указывает на релевантность соответствующего используемого документа соответствующему используемому запросу.[15] In some embodiments of the method, the method further comprises training by the NN server to generate a ranking feature. The training of the NN comprises the server generating a training set for the training document-query (DQ) pair to be used during a given iteration of training the NN. A training pair DQ has a training query and a training document. The generation comprises generating by the server a plurality of training term embeddings based on the corresponding terms from the training query. The generation comprises accessing by the server an inverted index associated with the search engine to retrieve a plurality of query-independent datasets associated with corresponding pairs from the plurality of training pairs DT. The given one of the set of training pairs DT includes the training document and the corresponding one of the set of terms from the training query. The generation comprises generating by the server a plurality of training feature vectors for a plurality of training pairs DT using a plurality of query-independent datasets. Training the NN comprises, during a given iteration of training the NN, inputting by the server into the NN said set of training term embeddings and said set of training feature vectors to generate a predicted ranking feature for the training pair DQ. NN training comprises, during a given iteration of NN training, setting the NN by the server based on a comparison between the label and the predicted ranking feature, such that the NN generates, for a given used DQ pair, an appropriate predicted ranking feature that indicates the relevance of the corresponding used document to the corresponding used query.

[16] В некоторых вариантах осуществления способа запросо-независимые данные были сохранены в инвертированном индексе до получения запроса от электронного устройства, и при этом запросо-зависимый признак генерируется после получения запроса от электронного устройства.[16] In some embodiments of the method, query-independent data has been stored in the inverted index prior to receiving a query from the electronic device, and the query-dependent feature is generated after receiving a query from the electronic device.

[17] В некоторых вариантах осуществления способа запросо-зависимый признак генерируется с использованием запросо-независимых данных в реальном времени во время процедуры ранжирования документов поисковой машины.[17] In some embodiments of the method, a query-dependent feature is generated using real-time query-independent data during a search engine document ranking procedure.

[18] В некоторых вариантах осуществления способа ранжирование выполняется с помощью алгоритма машинного обучения (MLA) на основе дерева решений, сконфигурированного для ранжирования множества потенциально релевантных документов на основе их релевантности запросу.[18] In some embodiments of the method, the ranking is performed using a machine learning algorithm (MLA) based on a decision tree configured to rank a set of potentially relevant documents based on their relevance to a query.

[19] В некоторых вариантах осуществления способа способ дополнительно содержит определение подобного терма для данного одного из множества термов. При осуществлении доступа к инвертированному индексу для извлечения запросо-независимых данных, извлеченные запросо-независимые данные содержат запросо-независимые данные для третьей пары DT, причем третья пара DT имеет данный документ и упомянутый подобный терм.[19] In some embodiments of the method, the method further comprises determining a similar term for a given one of the plurality of terms. When accessing the inverted index to retrieve query-independent data, the retrieved query-independent data contains the query-independent data for a third pair of DTs, the third pair of DTs having the given document and said similar term.

[20] В некоторых вариантах осуществления способа доступ к инвертированному индексу предназначен для дополнительного извлечения запросо-независимых данных на основе содержимого, связанных с первой парой DT и второй парой DT. Запросо-независимые данные на основе содержимого указывают на текстовый контекст соответствующего терма в содержимом, связанном с данным документом.[20] In some embodiments of the method, the inverted index access is for further retrieval of query-independent data based on the content associated with the first DT pair and the second DT pair. Content-based query-independent data points to the textual context of the corresponding term in the content associated with the given document.

[21] В некоторых вариантах осуществления способа зависимое от терма вхождение первого терма содержит по меньшей мере одно из: одной или более позиций первого терма в заголовке, связанном с данным документом; одной или более позиций первого терма в URL, связанном с данным документом; и одной или более позиций первого терма в теле данного документа.[21] In some embodiments of the method, the term-dependent occurrence of the first term contains at least one of: one or more positions of the first term in the header associated with this document; one or more positions of the first term in the URL associated with this document; and one or more positions of the first term in the body of this document.

[22] В некоторых вариантах осуществления способа групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом, содержит по меньшей мере одно из: количества раз, когда второй терм из запроса включен в дополнение к первому терму в заголовок, связанный с данным документом; количества раз, когда второй терм из запроса включен в дополнение к первому терму в URL, связанный с данным документом; и количества раз, когда второй терм из запроса включен в дополнение к первому терму в тело данного документа.[22] In some embodiments of the method, the grouping of the first term with the second term in the content associated with a given document contains at least one of: the number of times the second term from the query is included in addition to the first term in the header associated with the given document; the number of times the second term from the query is included in addition to the first term in the URL associated with the document; and the number of times the second term from the query is included in addition to the first term in the body of the given document.

[23] Во втором широком аспекте настоящей технологии предоставляется сервер для ранжирования цифровых документов в ответ на запрос. Цифровые документы являются потенциально релевантными запросу, содержащему первый и второй термы. Запрос был отправлен пользователем электронного устройства, соединенного с возможностью осуществления связи с сервером, на котором размещена поисковая машина. Поисковая машина связана с инвертированным индексом, в котором хранится информация, связанная с парами документ-терм (DT). Сервер сконфигурирован с возможностью, для данного документа из множества потенциально релевантных документов, осуществлять доступ к инвертированному индексу для извлечения запросо-независимых данных для первой пары DT и второй пары DT. Первая пара DT имеет данный документ и первый терм, а вторая пара DT имеет данный документ и второй терм. Запросо-независимые данные указывают на (i) зависимое от терма вхождение первого терма в содержимое, связанное с данным документом, и (ii) зависимое от терма вхождение второго терма в содержимое, связанное с данным документом. Сервер сконфигурирован с возможностью, для данного документа из множества потенциально релевантных документов, генерировать запросо-зависимый признак с использованием запросо-независимых данных, извлеченных для первой пары DT и второй пары DT. Запросо-зависимый признак указывает на групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом. Сервер сконфигурирован с возможностью, для данного документа из множества потенциально релевантных документов, генерировать ранжирующий признак для данного документа на основе, по меньшей мере, первого терма, второго терма и запросо-зависимого признака. Сервер сконфигурирован для ранжирования данного документа из множества потенциально релевантных документов на основе, по меньшей мере, ранжирующего признака.[23] In a second broad aspect of the present technology, a server is provided for ranking digital documents in response to a request. The digital documents are potentially relevant to the query containing the first and second terms. The request was sent by the user of an electronic device that is connected with the ability to communicate with the server hosting the search engine. The search engine is associated with an inverted index that stores information associated with document-term (DT) pairs. The server is configured to, for a given document from a plurality of potentially relevant documents, access the inverted index to retrieve query-independent data for the first DT pair and the second DT pair. The first DT pair has the given document and the first term, and the second DT pair has the given document and the second term. The query-independent data indicates (i) a term-dependent occurrence of the first term in the content associated with the given document, and (ii) a term-dependent occurrence of the second term in the content associated with the given document. The server is configured to, for a given document from a plurality of potentially relevant documents, generate a query-dependent feature using the query-independent data retrieved for the first DT pair and the second DT pair. The query-dependent attribute indicates the group occurrence of the first term with the second term in the content associated with this document. The server is configured to, for a given document from a plurality of potentially relevant documents, generate a ranking feature for the given document based on at least the first term, the second term, and the query-dependent feature. The server is configured to rank a given document from a plurality of potentially relevant documents based on at least a ranking feature.

[24] В некоторых вариантах реализации сервера сервер использует нейронную сеть для генерирования ранжирующего признака для данного документа.[24] In some server implementations, the server uses a neural network to generate a ranking feature for a given document.

[25] В некоторых вариантах осуществления сервера сервер дополнительно сконфигурирован для обучения NN генерированию ранжирующего признака. Сервер сконфигурирован для генерирования обучающего набора для обучающей пары документ-запрос (DQ), которая должна использоваться во время данной итерации обучения NN. Обучающая пара DQ имеет обучающий запрос и обучающий документ. Для генерирования обучающего набора сервер сконфигурирован для генерирования множества вложений обучающих термов на основе соответствующих термов из обучающего запроса. Для генерирования обучающего набора сервер сконфигурирован для осуществления доступа к инвертированному индексу, связанному с поисковой машиной, для извлечения множества запросо-независимых наборов данных, связанных с соответствующими парами из множества обучающих пар DT. Заданная одна из множества обучающих пар DT включает в себя обучающий документ и соответствующий один из множества термов из обучающего запроса. Для генерирования обучающего набора сервер сконфигурирован для генерирования множества векторов обучающих признаков для множества обучающих пар DT с использованием множества запросо-независимых наборов данных. Сервер сконфигурирован с возможностью, во время данной итерации обучения NN, вводить сервером в NN упомянутое множество вложений обучающих термов и упомянутое множество векторов обучающих признаков для генерирования предсказанного ранжирующего признака для обучающей пары DQ. Сервер сконфигурирован с возможностью, во время данной итерации обучения NN, настраивать NN на основе сравнения между меткой и предсказанным ранжирующим признаком, так что NN генерирует для данной используемой пары DQ соответствующий предсказанный ранжирующий признак, который указывает на релевантность соответствующего используемого документа соответствующему используемому запросу.[25] In some server embodiments, the server is further configured to teach the NN to generate a ranking feature. The server is configured to generate a training set for a training document-query (DQ) pair to be used during a given training iteration of the NN. A training pair DQ has a training query and a training document. To generate the training set, the server is configured to generate a plurality of training term embeddings based on the corresponding terms from the training query. To generate the training set, the server is configured to access an inverted index associated with the search engine to retrieve a plurality of query-independent datasets associated with matching pairs from the plurality of training pairs DT. The given one of the set of training pairs DT includes the training document and the corresponding one of the set of terms from the training query. To generate a training set, the server is configured to generate a plurality of training feature vectors for a plurality of DT training pairs using a plurality of query-independent datasets. The server is configured to, during a given training iteration of the NN, input by the server into the NN said set of training term embeddings and said set of learning feature vectors to generate a predicted ranking feature for the training pair DQ. The server is configured to, during a given iteration of training the NN, tune the NN based on a comparison between the label and the predicted ranking feature such that the NN generates for a given used DQ pair an appropriate predicted ranking feature that indicates the relevance of the corresponding used document to the corresponding used query.

[26] В некоторых вариантах осуществления сервера запросо-независимые данные были сохранены в инвертированном индексе до получения запроса от электронного устройства, и при этом запросо-зависимый признак генерируется после получения запроса от электронного устройства.[26] In some server embodiments, query-independent data has been stored in an inverted index prior to receiving a query from the electronic device, and the query-dependent feature is generated after receiving a query from the electronic device.

[27] В некоторых вариантах осуществления сервера запросо-зависимый признак генерируется с использованием запросо-независимых данных в реальном времени во время процедуры ранжирования документов поисковой машины.[27] In some server embodiments, a query-dependent feature is generated using real-time query-independent data during a search engine document ranking procedure.

[28] В некоторых вариантах осуществления сервера сервер сконфигурирован для ранжирования с использованием алгоритма машинного обучения (MLA) на основе дерева решений, сконфигурированного для ранжирования множества потенциально релевантных документов на основе их релевантности запросу.[28] In some server embodiments, the server is configured to rank using a machine learning algorithm (MLA) based on a decision tree configured to rank a set of potentially relevant documents based on their relevance to a query.

[29] В некоторых вариантах осуществления сервера сервер дополнительно сконфигурирован для определения подобного терма для данного одного из множества термов. При осуществлении доступа к инвертированному индексу для извлечения запросо-независимых данных, извлеченные запросо-независимые данные содержат запросо-независимые данные для третьей пары DT, причем третья пара DT имеет данный документ и упомянутый подобный терм.[29] In some server embodiments, the server is further configured to determine a similar term for a given one of the plurality of terms. When accessing the inverted index to retrieve query-independent data, the retrieved query-independent data contains the query-independent data for a third pair of DTs, the third pair of DTs having the given document and said similar term.

[30] В некоторых вариантах осуществления сервера сервер сконфигурирован для осуществления доступа к инвертированному индексу для дополнительного извлечения запросо-независимых данных на основе содержимого, связанных с первой парой DT и второй парой DT. Запросо-независимые данные на основе содержимого указывают на текстовый контекст соответствующего терма в содержимом, связанном с данным документом.[30] In some server embodiments, the server is configured to access the inverted index to further retrieve query-independent data based on the content associated with the first DT pair and the second DT pair. Content-based query-independent data points to the textual context of the corresponding term in the content associated with the given document.

[31] В некоторых вариантах реализации сервера зависимое от терма вхождение первого терма содержит по меньшей мере одно из: одной или более позиций первого терма в заголовке, связанном с данным документом; одной или более позиций первого терма в URL, связанном с данным документом; и одной или более позиций первого терма в теле данного документа.[31] In some server implementations, a term-dependent occurrence of a first term contains at least one of: one or more first term positions in a header associated with a given document; one or more positions of the first term in the URL associated with this document; and one or more positions of the first term in the body of this document.

[32] В некоторых вариантах реализации сервера групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом, включает в себя по меньшей мере одно из: количества раз, когда второй терм из запроса включен в дополнение к первому терму в заголовок, связанный с данным документом; количества раз, когда второй терм из запроса включен в дополнение к первому терму в URL, связанный с данным документом; и количества раз, когда второй терм из запроса включен в дополнение к первому терму в тело данного документа.[32] In some server implementations, the grouping of the first term with the second term in the content associated with a given document includes at least one of: the number of times the second term from the request is included in addition to the first term in the header associated with the with this document the number of times the second term from the query is included in addition to the first term in the URL associated with the document; and the number of times the second term from the query is included in addition to the first term in the body of the given document.

[33] В контексте настоящего описания, если явно не указано иное, «электронное устройство», «электронное устройство», «сервер», «удаленный сервер» и «компьютерная система» представляют собой любое аппаратное обеспечение и/или программное обеспечение, подходящее для соответствующей задачи. Таким образом, некоторые неограничивающие примеры аппаратного обеспечения и/или программного обеспечения включают в себя компьютеры (серверы, настольные компьютеры, ноутбуки, нетбуки и т.д.), смартфоны, планшеты, сетевое оборудование (маршрутизаторы, коммутаторы, шлюзы и т.д.) и/или их комбинации.[33] As used herein, unless expressly stated otherwise, "electronic device", "electronic device", "server", "remote server", and "computer system" are any hardware and/or software suitable for the corresponding task. Thus, some non-limiting examples of hardware and/or software include computers (servers, desktops, laptops, netbooks, etc.), smartphones, tablets, network equipment (routers, switches, gateways, etc.). ) and/or combinations thereof.

[34] В контексте настоящего описания, если специально не указано иное, выражения «компьютерно-читаемый носитель» и «память» предназначены включать в себя носители любого характера и вида, неограничивающие примеры которых включают в себя RAM, ROM, диски (CD-ROM, DVD, гибкие диски, жесткие диски и т.д.), USB-ключи, карты флэш-памяти, твердотельные накопители и ленточные накопители.[34] As used herein, unless specifically noted otherwise, the terms "computer-readable media" and "memory" are intended to include media of any nature and kind, non-limiting examples of which include RAM, ROM, disks (CD-ROM , DVDs, floppy disks, hard drives, etc.), USB keys, flash memory cards, solid state drives and tape drives.

[35] В контексте настоящего описания, если прямо не предусмотрено иное, "указание" информационного элемента может быть самим информационным элементом или указателем, ссылкой, гиперссылкой или другим опосредованным механизмом, позволяющим получателю такого указания найти местоположение в сети, памяти, базе данных или другом компьютерно-читаемом носителе, из которого информационный элемент может быть извлечен. Например, указание документа может включать в себя сам документ (т.е. его содержимое), или оно может быть уникальным дескриптором документа, идентифицирующим файл относительно некоторой конкретной файловой системы, или некоторым другим средством направления получателя такого указания в местоположение в сети, таблицу базы данных или иное местоположение, в котором можно осуществить доступ к файлу. Специалист в данной области поймет, что степень точности, требуемая в указании, зависит от степени какого-либо предварительного понимания того, какая интерпретация будет обеспечена информации, обмениваемой во взаимодействии между отправителем и получателем такого указания. Например, если понимается, что до связи между отправителем и получателем указание информационного элемента будет иметь форму ключа базы данных для записи в некоторой конкретной таблице предварительно определенной базы данных, содержащей информационный элемент, то отправка ключа базы данных является всем, что требуется для эффективной передачи информационного элемента получателю, даже если сам информационный элемент не был передан во взаимодействии между отправителем и получателем такого указания.[35] As used herein, unless expressly provided otherwise, an "indication" of an information element may be the information element itself, or a pointer, link, hyperlink, or other indirect mechanism that allows the recipient of such indication to locate a location in a network, memory, database, or other a computer-readable medium from which the information element can be retrieved. For example, a document designation may include the document itself (i.e., its contents), or it may be a unique document descriptor identifying a file with respect to some particular file system, or some other means of directing the recipient of such designation to a network location, a base table data or other location where the file can be accessed. One skilled in the art will appreciate that the degree of precision required in an indication depends on the degree of any prior understanding of what interpretation will be provided of the information exchanged in the interaction between the sender and recipient of such indication. For example, if it is understood that, prior to communication between the sender and recipient, the indication of the information element will be in the form of a database key to be recorded in some particular table of the predefined database containing the information element, then sending the database key is all that is required to efficiently convey the information element. element to the recipient, even if the information element itself was not transmitted in the interaction between the sender and recipient of such an indication.

[36] В контексте настоящего описания, если специально не предусмотрено иное, слова «первый», «второй», «третий» и т.д. Использовались в качестве прилагательных только с целью обеспечения различия между существительными, которые они изменяют, от одного к другому, и не с целью описания каких-либо конкретных отношений между этими существительными. Таким образом, например, следует понимать, что использование понятий "первый сервер" и "третий сервер" не подразумевает какого-либо конкретного порядка, типа, хронологии, иерархии или ранжирования (например) таких/между такими серверами, равно как и их использование (само по себе) не означает, что какой-либо "второй сервер" должен обязательно существовать в любой определенной ситуации. Кроме того, как обсуждается в других контекстах данного документа, ссылка на "первый" элемент и "второй" элемент не исключает того, что эти два элемента фактически являются одним и тем же элементом реального мира. Таким образом, например, в некоторых случаях "первый" сервер и "второй" сервер могут быть одним и тем же программным обеспечением и/или аппаратным обеспечением, в других случаях они могут представлять собой разное программное обеспечение и/или аппаратное обеспечение.[36] In the context of the present description, unless specifically provided otherwise, the words "first", "second", "third", etc. They were used as adjectives only for the purpose of providing a distinction between the nouns they modify from one to another, and not for the purpose of describing any specific relationship between these nouns. Thus, for example, it should be understood that the use of the terms "first server" and "third server" does not imply any particular order, type, chronology, hierarchy or ranking (for example) of such / between such servers, nor does their use ( by itself) does not mean that some "second server" must necessarily exist in any given situation. Also, as discussed elsewhere in this document, reference to a "first" element and a "second" element does not exclude that the two elements are in fact the same real world element. Thus, for example, in some cases the "first" server and the "second" server may be the same software and/or hardware, in other cases they may be different software and/or hardware.

[37] Каждая из реализаций настоящей технологии обладает по меньшей мере одним из вышеупомянутых аспектов и/или целей, но не обязательно имеет их все. Следует понимать, что некоторые аспекты настоящей технологии, которые возникли в попытке достичь вышеупомянутой цели, могут не удовлетворять этой цели и/или удовлетворять другим целям, которые не описаны в данном документе явным образом. Дополнительные и/или альтернативные признаки, аспекты и преимущества реализаций настоящей технологии станут понятными из нижеследующего описания, сопроводительных чертежей и приложенной формулы изобретения.[37] Each of the implementations of the present technology has at least one of the above aspects and/or goals, but not necessarily all of them. It should be understood that some aspects of the present technology that have arisen in an attempt to achieve the aforementioned goal may not satisfy this goal and/or satisfy other goals that are not explicitly described in this document. Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[38] Для лучшего понимания настоящей технологии, а также других аспектов и ее дополнительных признаков, производится обращение к нижеследующему описанию, которое должно использоваться в сочетании с сопроводительными чертежами, на которых:[38] For a better understanding of the present technology, as well as other aspects and additional features thereof, reference is made to the following description, which is to be used in conjunction with the accompanying drawings, in which:

[39] Фиг. 1 изображает схему системы, реализуемой в соответствии с неограничивающими вариантами осуществления настоящей технологии.[39] FIG. 1 is a diagram of a system implemented in accordance with non-limiting embodiments of the present technology.

[40] Фиг. 2 изображает представление данных, хранящихся в подсистеме базы данных системы, показанной на Фиг. 1, в соответствии с неограничивающими вариантами осуществления настоящей технологии.[40] FIG. 2 depicts a view of the data stored in the database subsystem of the system shown in FIG. 1, in accordance with non-limiting embodiments of the present technology.

[41] Фиг. 3 изображает представление итерации обучения генератора ранжирующих признаков системы, показанной на Фиг. 1, в соответствии с неограничивающими вариантами осуществления настоящей технологии.[41] FIG. 3 depicts a representation of a learning iteration of the ranking feature generator of the system shown in FIG. 1, in accordance with non-limiting embodiments of the present technology.

[42] Фиг. 4 изображает представление того, как генератор динамических признаков системы, показанной на Фиг. 1, сконфигурирован для генерирования запросо-зависимых данных на основе запросо-независимых данных, полученных из инвертированного индекса системы на Фиг. 1, в соответствии с неограничивающими вариантами осуществления настоящей технологии.[42] FIG. 4 is a representation of how the dynamic feature generator of the system shown in FIG. 1 is configured to generate query-dependent data based on query-independent data obtained from the inverted system index in FIG. 1, in accordance with non-limiting embodiments of the present technology.

[43] Фиг. 5 изображает представление фазы использования генератора ранжирующих признаков системы, показанной на Фиг. 1, в соответствии с неограничивающими вариантами осуществления настоящей технологии.[43] FIG. 5 is a representation of the usage phase of the ranking feature generator of the system shown in FIG. 1, in accordance with non-limiting embodiments of the present technology.

[44] Фиг. 6 изображает представление того, как модель ранжирования системы, показанной на Фиг. 1, сконфигурирована для генерирования ранжированного списка документов в соответствии с неограничивающими вариантами осуществления настоящей технологии.[44] FIG. 6 is a representation of how the ranking model of the system shown in FIG. 1 is configured to generate a ranked list of documents in accordance with non-limiting embodiments of the present technology.

[45] Фиг. 7 - схематическое представление способа, выполняемого сервером системы, показанной на Фиг. 1, в соответствии с неограничивающими вариантами осуществления настоящей технологии.[45] FIG. 7 is a schematic representation of a method performed by a server of the system shown in FIG. 1, in accordance with non-limiting embodiments of the present technology.

Подробное описаниеDetailed description

[46] Приведенные в данном документе примеры и условные формулировки призваны главным образом помочь читателю понять принципы настоящей технологии, а не ограничить ее объем такими конкретно приведенными примерами и условиями. Следует понимать, что специалисты в данной области смогут разработать различные механизмы, которые, хоть и не описаны в данном документе явным образом, тем не менее воплощают принципы настоящей технологии и включаются в ее сущность и объем.[46] The examples and conventions used herein are intended primarily to help the reader understand the principles of this technology, and not to limit its scope to such specific examples and terms. It should be understood that those skilled in the art will be able to develop various mechanisms that, while not explicitly described herein, nevertheless embody the principles of the present technology and are included within its spirit and scope.

[47] Кроме того, нижеследующее описание может описывать реализации настоящей технологии в относительно упрощенном виде для целей упрощения понимания. Специалисты в данной области техники поймут, что различные реализации настоящей технологии могут иметь и большую сложность.[47] In addition, the following description may describe implementations of the present technology in a relatively simplified manner for purposes of ease of understanding. Those skilled in the art will appreciate that various implementations of the present technology may be more complex.

[48] В некоторых случаях также могут быть изложены примеры модификаций настоящей технологии, которые считаются полезными. Это делается лишь для содействия пониманию и, опять же, не для строгого определения объема или очерчивания границ настоящей технологии. Эти модификации не являются исчерпывающим списком, и специалист в данной области может осуществлять другие модификации, все еще оставаясь при этом в рамках объема настоящей технологии. Кроме того, случаи, когда примеры модификаций не приводятся, не следует толковать так, что никакие модификации не могут быть осуществлены и/или что описанное является единственным способом реализации такого элемента настоящей технологии.[48] In some cases, examples of modifications to the present technology that are considered useful may also be set forth. This is only to promote understanding and, again, not to rigorously define the scope or delineate the boundaries of the present technology. These modifications are not an exhaustive list, and other modifications may be made by one skilled in the art while still remaining within the scope of the present technology. In addition, cases where examples of modifications are not given should not be interpreted to mean that no modifications can be made and/or that what is described is the only way to implement such an element of the present technology.

[49] Кроме того, все содержащиеся в данном документе утверждения, в которых указываются принципы, аспекты и реализации настоящей технологии, а также их конкретные примеры, призваны охватить как структурные, так и функциональные эквиваленты, вне зависимости от того, известны ли они в настоящее время или будут разработаны в будущем. Таким образом, например, специалисты в данной области должны понимать, что любые блок-схемы в данном документе представляют концептуальные виды иллюстративной схемы, воплощающей принципы настоящей технологии. Аналогичным образом, следует понимать, что любые блок-схемы, схемы последовательности операций, схемы изменения состояний, псевдо-коды и подобное представляют различные процессы, которые могут быть по сути представлены на компьютерно-читаемых носителях и исполнены компьютером или процессором вне зависимости от того, показан такой компьютер или процессор явным образом или нет.[49] In addition, all statements contained herein that identify principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to cover both structural and functional equivalents, whether or not they are currently known. time or will be developed in the future. Thus, for example, those skilled in the art should understand that any block diagrams herein represent conceptual views of an illustrative circuit embodying the principles of the present technology. Likewise, it should be understood that any and all flowcharts, sequence diagrams, state transition diagrams, pseudo-codes, and the like represent various processes that may per se be represented on computer-readable media and executed by a computer or processor, whether or not whether such a computer or processor is explicitly shown or not.

[50] Функции различных элементов, показанных на фигурах, в том числе любого функционального блока, помеченного как "процессор" или "графический процессор", могут быть обеспечены с помощью специализированного аппаратного обеспечения, а также аппаратного обеспечения, способного исполнять программное обеспечение и связанного с надлежащим программным обеспечением. При обеспечении процессором функции могут быть обеспечены одним выделенным процессором, одним совместно используемым процессором или множеством отдельных процессоров, некоторые из которых могут быть совместно используемыми. В некоторых вариантах осуществления настоящей технологии процессор может быть процессором общего назначения, таким как центральный процессор (CPU) или процессор, выделенный для конкретной цели, например графический процессор (GPU). Кроме того, явное использование понятия "процессор" или "контроллер" не должно истолковываться как относящееся исключительно к аппаратному обеспечению, способному исполнять программное обеспечение, и может в неявной форме включать в себя, без ограничений, аппаратное обеспечение цифрового сигнального процессора (DSP), сетевой процессор, интегральную схему специального назначения (ASIC), программируемую пользователем вентильную матрицу (FPGA), постоянную память (ROM) для хранения программного обеспечения, оперативную память (RAM) и энергонезависимое хранилище. Другое аппаратное обеспечение, традиционное и/или специализированное, также может быть включено в состав.[50] The functions of the various elements shown in the figures, including any functional block labeled "processor" or "graphics processing unit", can be provided using specialized hardware, as well as hardware capable of executing software and associated with proper software. When provided by a processor, the functions may be provided by one dedicated processor, one shared processor, or multiple individual processors, some of which may be shared. In some embodiments of the present technology, the processor may be a general purpose processor such as a central processing unit (CPU) or a processor dedicated to a specific purpose such as a graphics processing unit (GPU). In addition, explicit use of "processor" or "controller" should not be construed as referring solely to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), read-only memory (ROM) for software storage, random access memory (RAM), and non-volatile storage. Other hardware, traditional and/or specialized, may also be included.

[51] Программные модули, или просто модули, в качестве которых может подразумеваться программное обеспечение, могут быть представлены в настоящем документе как любая комбинация элементов блок-схемы последовательности операций или других элементов, указывающих выполнение этапов процесса и/или текстовое описание. Такие модули могут выполняться аппаратным обеспечением, которое явно или неявно показано.[51] Software modules, or simply modules, which may be referred to as software, may be represented herein as any combination of flowchart elements or other elements indicating the execution of process steps and/or a textual description. Such modules may be executed by hardware that is explicitly or implicitly shown.

[52] Учитывая эти основополагающие вещи, рассмотрим некоторые неограничивающие примеры, чтобы проиллюстрировать различные реализации аспектов настоящей технологии.[52] Given these fundamentals, consider some non-limiting examples to illustrate various implementations of aspects of the present technology.

[53] Со ссылкой на Фиг. 1 проиллюстрировано схематичное представление системы 100, причем система 100 подходит для реализации неограничивающих вариантов осуществления настоящей технологии. Следует четко понимать, что изображенная система 100 является лишь иллюстративной реализацией настоящей технологии. Таким образом, нижеследующее описание предназначено лишь для того, чтобы использоваться в качестве описания иллюстративных примеров настоящей технологии.[53] With reference to FIG. 1 illustrates a schematic representation of system 100, system 100 being suitable for implementing non-limiting embodiments of the present technology. It should be clearly understood that the depicted system 100 is only an illustrative implementation of the present technology. Thus, the following description is only intended to be used as a description of illustrative examples of the present technology.

[54] В проиллюстрированном примере система 100 может использоваться для предоставления одной или более онлайн-услуг данному пользователю. С этой целью система 100 содержит, среди прочего, электронное устройство 104, связанное с пользователем 101, сервер 112, множество серверов 108 ресурсов и подсистему 150 базы данных.[54] In the illustrated example, system 100 may be used to provide one or more online services to a given user. To this end, system 100 includes, among other things, an electronic device 104 associated with a user 101, a server 112, a plurality of resource servers 108, and a database subsystem 150.

[55] В контексте настоящей технологии система 100 используется для предоставления услуг поисковой машины. Например, пользователь 101 может отправить данный запрос через электронное устройство 104 на сервер 112, который, в ответ, сконфигурирован для предоставления результатов поиска пользователю 101. Сервер 112 генерирует эти результаты поиска на основе информации, которая была извлечена, например, из множества серверов 108 ресурсов и сохранена в подсистеме 150 базы данных. Эти результаты поиска, предоставленные системой 100, могут быть релевантными отправленному запросу. [55] In the context of the present technology, system 100 is used to provide search engine services. For example, user 101 may send a given request via electronic device 104 to server 112, which, in response, is configured to provide search results to user 101. Server 112 generates these search results based on information that has been retrieved from, for example, a plurality of resource servers 108 and stored in the database subsystem 150. These search results provided by system 100 may be relevant to the request submitted.

[56] Можно сказать, что сервер 112 сконфигурирован для выполнения множества реализованных на компьютере алгоритмов, которые в дальнейшем именуются «поисковой машиной» 160. Как станет очевидно из приведенного ниже описания, поисковая машина 160 в целом сконфигурирована для идентификации потенциально релевантных цифровых документов для запроса и ранжирования их на основе их релевантности запросу.[56] It can be said that the server 112 is configured to execute a variety of computer-implemented algorithms, which are hereinafter referred to as the "search engine" 160. As will become apparent from the description below, the search engine 160 is generally configured to identify potentially relevant digital documents for the query and ranking them based on their relevance to the query.

ЭЛЕКТРОННОЕ УСТРОЙСТВОELECTRONIC DEVICE

[57] Как упомянуто выше, система 100 содержит электронное устройство 104, связанное с пользователем 101. Таким образом, электронное устройство 104 или просто «устройство» 102 иногда может называться «клиентским устройством», «устройством конечного пользователя» или «клиентским электронным устройством». Следует отметить, что тот факт, что электронное устройство 104 связано с пользователем 101, не обязательно предполагает или подразумевает какой-либо режим работы - например, необходимость входа в систему, необходимость регистрации или тому подобное.[57] As mentioned above, system 100 includes an electronic device 104 associated with a user 101. Thus, electronic device 104 or simply "device" 102 may sometimes be referred to as a "client device", "end user device", or "client electronic device" . It should be noted that the fact that the electronic device 104 is associated with the user 101 does not necessarily imply or imply any mode of operation, such as needing to log in, needing to register, or the like.

[58] В контексте настоящего описания, если прямо не указано иное, «электронное устройство» или «устройство» - это любое компьютерное аппаратное обеспечение, которое способно запускать программное обеспечение, подходящее для соответствующей решаемой задачи. Таким образом, некоторые неограничивающие примеры устройства 104 включают персональные компьютеры (настольные компьютеры, ноутбуки, нетбуки и т.д.), смартфоны, планшеты и т.п. Устройство 104 содержит аппаратное и/или программное обеспечение, и/или микропрограммное обеспечение (или их комбинацию), как известно в данной области техники, для выполнения заданного приложения браузера (не показано).[58] In the context of the present description, unless expressly stated otherwise, "electronic device" or "device" is any computer hardware that is capable of running software suitable for the corresponding task being solved. Thus, some non-limiting examples of device 104 include personal computers (desktops, laptops, netbooks, etc.), smartphones, tablets, and the like. Device 104 includes hardware and/or software and/or firmware (or a combination thereof), as is known in the art, to run a given browser application (not shown).

[59] В широком смысле, назначением определенного приложения браузера является предоставление пользователю 101 возможности доступа к одному или более веб-ресурсам. Реализация определенного приложения браузера не ограничена особым образом. Один из примеров данного приложения браузера, которое выполняется устройством 104, может быть реализован как браузерЯндекс™. Например, пользователь 101 может использовать данное приложение браузера для (i) перехода к заданному веб-сайту поисковой машины и (ii) отправки запроса, в ответ на который ему (ей) должны быть предоставлены релевантные результаты поиска.[59] In a broad sense, the purpose of a particular browser application is to allow the user 101 to access one or more web resources. The implementation of a specific browser application is not particularly limited. One example of a given browser application that runs on device 104 could be implemented as a browserYandex™. For example, the user 101 may use a given browser application to (i) navigate to a given search engine website and (ii) send a request in response to which he/she should be provided with relevant search results.

[60] Устройство 104 сконфигурировано для генерирования запроса 180 для связи с сервером 112. Запрос 180 может принимать форму одного или более пакетов данных, содержащих информацию, указывающую, в одном примере, запрос, представленный пользователем 101. Устройство 104 также сконфигурировано для приема ответа 190 от сервера 112. Ответ 190 может принимать форму одного или более пакетов данных, содержащих информацию, указывающую, в одном примере, результаты поиска, которые являются релевантными отправленному запросу, и компьютерно-читаемые инструкции для отображения данным приложением браузера пользователю 101 этих результатов поиска.[60] The device 104 is configured to generate a request 180 to communicate with the server 112. The request 180 may take the form of one or more data packets containing information indicating, in one example, the request submitted by the user 101. The device 104 is also configured to receive a response 190 from the server 112. The response 190 may take the form of one or more data packets containing information indicating, in one example, search results that are relevant to the request submitted, and computer-readable instructions for a given browser application to display those search results to the user 101.

Сеть связиCommunication network

[61] Система 100 содержит сеть 110 связи. В одном неограничивающем примере сеть 110 связи может быть реализована как Интернет. В других неограничивающих примерах сеть 110 связи может быть реализована по-другому, например, любая глобальная сеть связи, локальная сеть связи, частная сеть связи и т.п. Фактически, то, как реализована сеть 110 связи, не является ограничением и будет зависеть, среди прочего, от того, как реализованы другие компоненты системы 100.[61] System 100 includes a communication network 110. In one non-limiting example, communication network 110 may be implemented as the Internet. In other non-limiting examples, communication network 110 may be implemented in other ways, such as any wide area network, local area network, private network, and the like. In fact, how the communications network 110 is implemented is not a limitation and will depend, among other things, on how the other components of the system 100 are implemented.

[62] Назначение сети 110 связи состоит в том, чтобы соединять с возможностью осуществления связи по меньшей мере некоторые компоненты системы 100, такие как устройство 104, множество серверов 108 ресурсов и сервер 112. Например, это означает, что ко множеству серверов 108 ресурсов может осуществляться доступ через сеть 110 связи устройством 104. В другом примере это означает, что ко множеству серверов 108 ресурсов может осуществляться доступ через сеть 110 связи сервером 112. В другом примере это означает, что к серверу 112 может осуществляться доступ через сеть 110 связи устройством 104.[62] The purpose of the communication network 110 is to communicate with at least some components of the system 100, such as the device 104, a plurality of resource servers 108, and a server 112. For example, this means that the plurality of resource servers 108 may accessed via communication network 110 by device 104. In another example, this means that a plurality of resource servers 108 can be accessed via communication network 110 by server 112. In another example, this means that server 112 can be accessed via communication network 110 by device 104 .

[63] Сеть 110 связи может использоваться для передачи пакетов данных между устройством 104, множеством серверов 108 ресурсов и сервером 112. Например, сеть 110 связи может использоваться для передачи запроса 180 от устройства 104 на сервер 112. В другом примере сеть 110 связи может использоваться для передачи ответа 190 от сервера 112 на устройство 104.[63] Communication network 110 may be used to transfer data packets between device 104, a plurality of resource servers 108, and server 112. For example, communication network 110 may be used to transmit request 180 from device 104 to server 112. In another example, communication network 110 may be used to send a response 190 from the server 112 to the device 104.

Множество серверов ресурсовMany resource servers

[64] Как упомянуто выше, к множеству серверов 108 ресурсов можно осуществить доступ через сеть 110 связи. Множество серверов 108 ресурсов может быть реализовано как обычные компьютерные серверы. В неограничивающем примере варианта осуществления настоящей технологии данный один из множества серверов 108 ресурсов может быть реализован как сервер Dell™ PowerEdge™, работающий под управлением операционной системы Microsoft™ Windows Server™. Данный один из множества серверов 108 ресурсов также может быть реализован в любом другом подходящем аппаратном и/или программном и/или встроенном программном обеспечении или их комбинации.[64] As mentioned above, a plurality of resource servers 108 can be accessed via the communications network 110 . A plurality of resource servers 108 may be implemented as conventional computer servers. In a non-limiting example of an embodiment of the present technology, this one of the plurality of resource servers 108 may be implemented as a Dell™ PowerEdge™ server running a Microsoft™ Windows Server™ operating system. This one of the many resource servers 108 may also be implemented in any other suitable hardware and/or software and/or firmware, or combinations thereof.

[65] Множество серверов 108 ресурсов сконфигурированы для размещения (веб) ресурсов, к которым может осуществить доступ устройство 104 и/или сервер 106. Какой тип ресурсов размещается во множестве серверов 108 ресурсов, не ограничивается. Однако в некоторых вариантах осуществления настоящей технологии ресурсы могут содержать цифровые документы или просто «документы», которые представляют веб-страницы.[65] The plurality of resource servers 108 are configured to host (web) resources that can be accessed by the device 104 and/or the server 106. What type of resources are hosted by the plurality of resource servers 108 is not limited. However, in some embodiments of the present technology, the resources may contain digital documents or simply "documents" that represent web pages.

[66] Например, множество серверов 108 ресурсов может содержать веб-страницы, что означает, что множество серверов 108 ресурсов может хранить документы, представляющие веб-страницы и доступные устройству 104 и/или серверу 112. Данный документ может быть написан на языке разметки и может содержать среди прочего (i) содержимое (контент) соответствующей веб-страницы и (ii) компьютерно-читаемые инструкции для отображения соответствующей веб-страницы (ее содержимого).[66] For example, a plurality of resource servers 108 may contain web pages, which means that a plurality of resource servers 108 may store documents representing web pages and available to device 104 and/or server 112. This document may be written in a markup language and may contain, among other things, (i) the content (content) of the relevant web page and (ii) computer-readable instructions for displaying the relevant web page (its content).

[67] Устройство 104 может осуществить доступ к данному одному из множества серверов 108 ресурсов для извлечения заданного документа, хранящегося на данном одном из множества серверов 108 ресурсов. Например, пользователь 101 может ввести веб-адрес, связанный с данной веб-страницей, в данном приложении браузера устройства 104, и в ответ устройство 104 может осуществить доступ к данному серверу ресурсов, на котором размещена данная веб-страница, для получения документа, представляющего данную веб-страницу, для отображения содержимого веб-страницы через данное приложение браузера.[67] The device 104 may access a given one of the plurality of resource servers 108 to retrieve a given document stored on that one of the plurality of resource servers 108. For example, the user 101 may enter a web address associated with a given web page into a given browser application of the device 104, and in response, the device 104 may access a given resource server hosting the given web page to obtain a document representing this web page to display the contents of the web page through this browser application.

[68] Сервер 112 может осуществить доступ к данному одному из множества серверов 108 ресурсов, чтобы извлечь данный документ, хранящийся на данном одном из множества серверов 108 ресурсов. Назначение, по которому сервер 112 осуществляет доступ и извлекает документы из множества серверов 108 ресурсов, будет описано более подробно в данном документе ниже.[68] The server 112 may access a given one of the plurality of resource servers 108 to retrieve a given document stored on that one of the plurality of resource servers 108. The purpose by which server 112 accesses and retrieves documents from a plurality of resource servers 108 will be described in more detail hereinafter.

Подсистема базы данныхDatabase Subsystem

[69] Сервер 112 соединен с возможностью осуществления связи с подсистемой 150 базы данных. В широком смысле, подсистема 150 базы данных сконфигурирована для получения данных с сервера 112, хранения данных и/или предоставления данных на сервер 106 для дальнейшего использования.[69] The server 112 is connected with the possibility of communication with the subsystem 150 of the database. Broadly, the database engine 150 is configured to receive data from the server 112, store the data, and/or provide the data to the server 106 for later use.

[70] В некоторых вариантах осуществления подсистема 150 базы данных может быть сконфигурирована для хранения информации, связанной с сервером 112, в данном документе, называемой «данными поисковой машины» 175. Например, подсистема 150 базы данных может хранить информацию о ранее выполненных поисках поисковой машиной 160, информацию о ранее отправленных запросах на сервер 112 и о документах, которые были предоставлены поисковой машиной 160 сервера 112 в качестве результатов поиска.[70] In some embodiments, the database engine 150 can be configured to store information associated with the server 112, herein referred to as "search engine data" 175. For example, the database engine 150 can store information about previously performed searches by the search engine. 160, information about previously sent requests to the server 112 and about the documents that were provided by the search engine 160 of the server 112 as search results.

[71] Предполагается, что в качестве части данных 175 поисковой машины подсистема 150 базы данных может хранить данные запроса, связанные с соответствующими запросами, отправленными в поисковую машину 160. Данные запроса, связанные с данным запросом, могут быть разных типов и не являются ограничивающими. Например, подсистема 150 базы данных может хранить данные запроса для соответствующих запросов, такие как, но не ограничиваясь этим:[71] It is contemplated that, as part of the search engine data 175, the database engine 150 may store query data associated with corresponding queries sent to the search engine 160. The query data associated with a given query can be of various types and is not limiting. For example, the database engine 150 may store query data for appropriate queries, such as, but not limited to:

популярность данного запроса;the popularity of this request;

частота отправки данного запроса;frequency of sending this request;

количество кликов (click, нажатий), связанных с данным запросом;the number of clicks (clicks) associated with this request;

указания других отправленных запросов, связанных с данным запросом;indications of other sent requests related to this request;

указания документов, связанных с данным запросом;indication of documents related to this request;

другие статистические данные, связанные с данным запросом;other statistics related to this request;

поисковые термы, связанные с данным запросом;search terms associated with a given query;

количество символов в данном запросе; иthe number of characters in this request; and

другие присущие запросу характеристики данного запроса.other query-specific characteristics of that query.

[72] Предполагается, что в качестве части данных 175 поисковой машины подсистема 150 базы данных также может хранить данные документа, связанные с соответствующими документами. Данные документа, связанные с данным документом, могут быть разных типов и не являются ограничивающими. Например, подсистема 150 базы данных может хранить данные документов для соответствующих документов, такие как, но не ограничиваясь этим:[72] It is contemplated that, as part of the search engine data 175, the database subsystem 150 may also store document data associated with corresponding documents. The document data associated with a given document can be of various types and is not limiting. For example, the database engine 150 may store document data for relevant documents, such as, but not limited to:

популярность данного документа;the popularity of this document;

соотношение числа кликов к числу показов для данного документа;the ratio of the number of clicks to the number of impressions for this document;

время на клик, связанное с данным документом;time per click associated with this document;

указания запросов, связанных с данным документом;indication of queries related to this document;

другие статистические данные, связанные с данным документом;other statistics related to this document;

текст, связанный с данным документом;the text associated with this document;

размер файла данного документа; иfile size of this document; and

другие присущие документу характеристики данного документа.other document-specific characteristics of that document.

[73] Как будет обсуждаться более подробно со ссылкой на Фиг. 2 ниже, подсистема 150 базы данных может быть сконфигурирована для хранения содержимого, связанного с соответствующими цифровыми документами, на основе документ-за-документом. Например, подсистема 150 базы данных может быть сконфигурирована для хранения содержимого, связанного с данным цифровым документом.[73] As will be discussed in more detail with reference to FIG. 2 below, database engine 150 may be configured to store content associated with corresponding digital documents on a document-by-document basis. For example, the database engine 150 may be configured to store content associated with a given digital document.

[74] Предполагается, что в качестве части данных 175 поисковой машины подсистема 150 базы данных также может хранить пользовательские данные, связанные с соответствующими пользователями. Пользовательские данные, связанные с данным пользователем, могут быть разных типов и не являются ограничивающими. Например, подсистема 150 базы данных может хранить пользовательские данные для соответствующих пользователей, такие как, но не ограничиваясь этим:[74] It is contemplated that, as part of the search engine data 175, the database subsystem 150 may also store user data associated with respective users. The user data associated with a given user can be of various types and is non-limiting. For example, the database engine 150 may store user data for respective users, such as, but not limited to:

данные прошлой веб-сессии, связанные с данным пользователем;past web session data associated with a given user;

прошлые запросы, отправленные данным пользователем;past requests sent by this user;

данные истории «кликов», связанные с данным пользователем;click history data associated with a given user;

другие данные о взаимодействии с данным пользователем и документами; иother data about the interaction with this user and documents; and

предпочтения пользователя.user preferences.

[75] Как проиллюстрировано на Фиг. 1, подсистема 150 базы данных также сконфигурирована для хранения структурированного набора данных, далее именуемого «инвертированный индекс» 170. В широком смысле, инвертированный индекс 170 - это структура данных, которую можно назвать компонентом поисковой машины 160. Например, сначала может быть сгенерирован прямой индекс, в котором хранятся списки слов (или термов) для каждого документа - затем прямой индекс может быть инвертирован таким образом, чтобы в нем сохранялись списки документов для каждого терма. Для запроса к прямому индексу может потребоваться последовательная итерация по каждому документу и каждому терму для проверки совпадающего документа. Время, память и ресурсы обработки для выполнения такого запроса не всегда технически реалистичны. Напротив, одно из преимуществ инвертированного индекса 170 состоит в том, что для запроса к такой структуре данных в реальном времени требуется сравнительно меньше времени, памяти и ресурсов обработки.[75] As illustrated in FIG. 1, the database subsystem 150 is also configured to store a structured set of data, hereinafter referred to as an "inverted index" 170. In a broad sense, an inverted index 170 is a data structure that can be called a component of the search engine 160. For example, a direct index may be generated first , which stores the lists of words (or terms) for each document - then the direct index can be inverted so that it stores the lists of documents for each term. A direct index query may require sequential iteration over each document and each term to check for a matching document. The time, memory, and processing resources to complete such a request are not always technically realistic. On the contrary, one of the advantages of the inverted index 170 is that such a real-time data structure requires comparatively less time, memory, and processing resources to query such a data structure.

[76] Как будет описано более подробно ниже со ссылкой на Фиг. 2, инвертированный индекс 170 сконфигурирован для хранения множества списков документов, каждый из которых связан с соответствующим термом, и при этом данный список документов содержит множество документов, содержащих соответствующий терм. Кроме того, инвертированный индекс 170 сконфигурирован для хранения данных для соответствующих пар «терм-документ».[76] As will be described in more detail below with reference to FIG. 2, the inverted index 170 is configured to store a plurality of lists of documents, each of which is associated with a corresponding term, and wherein this list of documents contains a plurality of documents containing the corresponding term. In addition, the inverted index 170 is configured to store data for the respective term-document pairs.

СерверServer

[77] Система 100 содержит сервер 112, который может быть реализован как обычный компьютерный сервер. В примере варианта осуществления настоящей технологии сервер 112 может быть реализован как сервер Dell™ PowerEdge™, работающий под управлением операционной системы Microsoft™ Windows Server™. Само собой разумеется, что сервер 106 может быть реализован в любом другом подходящем аппаратном и/или программном и/или встроенном программном обеспечении или их комбинации. В изображенном неограничивающем варианте осуществления настоящей технологии сервер 112 является одиночным сервером. В альтернативных неограничивающих вариантах осуществления настоящей технологии функциональные возможности сервера 106 могут быть распределены и могут быть реализованы посредством многочисленных серверов.[77] System 100 includes a server 112, which may be implemented as a conventional computer server. In an exemplary embodiment of the present technology, server 112 may be implemented as a Dell™ PowerEdge™ server running a Microsoft™ Windows Server™ operating system. It goes without saying that the server 106 may be implemented in any other suitable hardware and/or software and/or firmware, or combinations thereof. In the depicted non-limiting embodiment of the present technology, server 112 is a single server. In alternative non-limiting embodiments of the present technology, the functionality of the server 106 may be distributed and may be implemented by multiple servers.

[78] Как показано на Фиг. 1, сервер 112 сконфигурирован для размещения поисковой машины 160 для предоставления услуг поисковой машины. В некоторых вариантах осуществления сервер 112 может находиться под управлением и/или администрированием провайдера поисковой машины (не показан), такого как, например, оператор поисковоймашины Яндекс™. По существу, сервер 112 может быть сконфигурирован для размещения поисковой машины 160 для выполнения одного или более поисков в ответ на запросы, отправленные пользователями поисковой машины 160.[78] As shown in FIG. 1, server 112 is configured to host a search engine 160 to provide search engine services. In some embodiments, server 112 may be under the control and/or administration of a search engine provider (not shown), such as, for example, a search engine operator.Yandex™ machines. As such, server 112 may be configured to host search engine 160 to perform one or more searches in response to queries submitted by users of search engine 160.

[79] Например, сервер 112 может получить запрос 180 от устройства 104, указывающий запрос, отправленный пользователем 101. Сервер 112 может выполнять поиск в ответ на отправленный запрос для генерирования результатов поиска, релевантных отправленному запросу. В результате сервер 112 может быть сконфигурирован для генерирования ответа 190, указывающего результаты поиска, и может передавать ответ 190 устройству 104 для отображения результатов поиска пользователю 101, например, через данное приложение браузера.[79] For example, server 112 may receive a request 180 from device 104 indicating a request submitted by user 101. Server 112 may search in response to the submitted request to generate search results relevant to the submitted request. As a result, the server 112 may be configured to generate a response 190 indicating the search results and may transmit the response 190 to the device 104 to display the search results to the user 101, such as through a given browser application.

[80] Результаты поиска, сгенерированные для отправленного запроса, могут принимать разные формы. Однако в одном неограничивающем примере настоящей технологии результаты поиска, сгенерированные сервером 112, могут указывать на документы, которые являются релевантными отправленному запросу. То, как сервер 112 сконфигурирован для определения и извлечения документов, которые являются релевантными отправленному запросу, станет понятным из приведенного здесь описания.[80] The search results generated for a submitted request can take many forms. However, in one non-limiting example of the present technology, the search results generated by the server 112 may point to documents that are relevant to the submitted request. How the server 112 is configured to determine and retrieve documents that are relevant to the submitted request will become clear from the description given here.

[81] Сервер 106 также может быть сконфигурирован для выполнения приложения 120 поискового робота. В широком смысле, приложение 120 поискового робота может использоваться сервером 112, чтобы «посещать» ресурсы, доступные через сеть 110 связи, и извлекать/загружать их для дальнейшего использования. Например, приложение 120 поискового робота может использоваться сервером 106 для доступа к множеству серверов 108 ресурсов и для извлечения/загрузки документов, представляющих веб-страницы, размещенные на множестве серверов 108 ресурсов.[81] The server 106 may also be configured to execute the crawler application 120. In a broad sense, the crawler application 120 may be used by the server 112 to "visit" resources available via the communications network 110 and retrieve/download them for further use. For example, the crawler application 120 may be used by the server 106 to access a plurality of resource servers 108 and to retrieve/download documents representing web pages hosted on the plurality of resource servers 108.

[82] Предполагается, что приложение 120 поискового робота может периодически выполняться сервером 112 для извлечения/загрузки документов, которые были обновлены и/или стали доступными по сети 110 связи с момента предыдущего выполнения приложения 120 поискового робота.[82] It is contemplated that the crawler application 120 may be periodically executed by the server 112 to retrieve/download documents that have been updated and/or made available over the communications network 110 since the previous execution of the crawler application 120.

[83] Сервер 112 также может быть сконфигурирован для выполнения «модели ранжирования» 130, в широком смысле сконфигурированной для использования информации о заданном запросе и множестве потенциально релевантных документов для ранжирования этих документов в ответ на запрос. По меньшей мере в одном варианте осуществления настоящей технологии модель 130 ранжирования может быть реализована как один или более алгоритмов машинного обучения (MLA). Предполагается, что множество потенциально релевантных документов может быть идентифицировано сервером 112 с использованием информации, хранящейся в инвертированном индексе 170.[83] The server 112 may also be configured to execute a "ranking model" 130 broadly configured to use information about a given query and a set of potentially relevant documents to rank those documents in response to the query. In at least one embodiment of the present technology, the ranking model 130 may be implemented as one or more machine learning (MLA) algorithms. It is contemplated that a plurality of potentially relevant documents can be identified by the server 112 using the information stored in the inverted index 170.

[84] В широком смысле, данный MLA сначала «строится» (или обучается) с использованием обучающих данных и обучающих целей. Во время данной итерации обучения в MLA вводятся обучающие входные данные, и он генерирует соответствующее предсказание. Затем сервер 112 конфигурируется для того, чтобы в некотором смысле «настраивать» MLA на основе сравнения результата предсказания с соответствующей обучающей целью для обучающих входных данных. Например, настройка может выполняться сервером 112 с использованием одного или более методов машинного обучения, таких как, но не ограничиваясь этим, метод обратного распространения ошибки. Таким образом, после большого количества итераций обучения MLA «настраивается» таким образом, чтобы делать предсказания на основе введенных данных, чтобы эти предсказания были близки к соответствующим обучающим целям.[84] In a broad sense, a given MLA is first "built" (or trained) using training data and training targets. During this training iteration, the MLA is given training input and generates the appropriate prediction. The server 112 is then configured to "tune" the MLA in some sense based on a comparison of the prediction result with the corresponding training target for the training input. For example, tuning may be performed by server 112 using one or more machine learning techniques such as, but not limited to, backpropagation. Thus, after a large number of training iterations, the MLA "tunes" in such a way as to make predictions based on the input data so that these predictions are close to the corresponding training goals.

[85] По меньшей мере в некоторых вариантах осуществления настоящей технологии модель 130 ранжирования может быть реализована как MLA на основе данного дерева решений. В широком смысле, MLA на основе данного дерева решений - это модель машинного обучения, имеющая одно или более «деревьев решений», которые используются (i) для перехода от наблюдений за объектом (представленных в ветвях) к заключениям о целевом значении объекта (представлены в листьях). В одной неограничивающей реализации настоящей технологии MLA на основе дерева решений может быть реализовано в соответствии со структурой CatBoost.[85] In at least some embodiments of the present technology, the ranking model 130 may be implemented as an MLA based on a given decision tree. Broadly speaking, a given decision tree based MLA is a machine learning model having one or more "decision trees" that are used to (i) move from observations of an object (represented in branches) to inferences about the target value of an object (represented in leaves). In one non-limiting implementation of the present technology, a decision tree-based MLA may be implemented in accordance with the CatBoost framework.

[86] Как MLA, основанный на дереве решений, может быть обучен в соответствии, по меньшей мере, с некоторыми вариантами осуществления настоящей технологии, раскрыто в патентной публикации США №2019/0164084, озаглавленной «METHOD OF AND SYSTEM FOR GENERATING PREDICTION QUALITY PARAMETER FOR A PREDICATION MODEL EXECUTED IN A MAHCINE LEARNING ALGORITHM», опубликованной 30 мая 2019 г., содержание которой полностью включено в настоящий документ посредством ссылки. Дополнительная информация о библиотеке CatBoost, ее реализации и алгоритмах градиентного бустинга доступна на https://catboost.ai.[86] How a decision tree based MLA can be trained in accordance with at least some embodiments of the present technology is disclosed in US Patent Publication No. 2019/0164084 entitled "METHOD OF AND SYSTEM FOR GENERATING PREDICTION QUALITY PARAMETER FOR A PREDICATION MODEL EXECUTED IN A MAHCINE LEARNING ALGORITHM" published May 30, 2019, the contents of which are hereby incorporated by reference in their entirety. Additional information about the CatBoost library, its implementation, and gradient boosting algorithms is available at https://catboost.ai.

[87] Какие данные используются моделью 130 ранжирования для генерирования ранжированного списка документов в ответ на запрос, будет более подробно обсуждено в данном документе ниже со ссылкой на Фиг. 6. Однако следует отметить, что модель 130 ранжирования может быть сконфигурирована для использования «ранжирующего признака» для данной «пары документ-запрос» для ранжирования соответствующего документа в ответ на запрос.[87] What data is used by ranking model 130 to generate a ranked list of documents in response to a query will be discussed in more detail hereinafter with reference to FIG. 6. However, it should be noted that the ranking model 130 can be configured to use a "ranking feature" for a given "document-query pair" to rank the corresponding document in response to a query.

[88] Сервер 106 сконфигурирован для выполнения «генератора ранжирующих признаков» 140, который в широком смысле сконфигурирован для использования запросо-зависимых данных, для генерирования одного или более ранжирующих признаков, которые будут использоваться моделью 130 ранжирования. По меньшей мере в одном варианте осуществления настоящей технологии генератор 140 ранжирующих признаков может быть реализован как нейронная сеть (NN).[88] The server 106 is configured to execute a "ranking feature generator" 140, which is broadly configured to use query-dependent data, to generate one or more ranking features to be used by the ranking model 130. In at least one embodiment of the present technology, the ranking feature generator 140 may be implemented as a neural network (NN).

[89] В широком смысле, NN - это особый класс MLA, состоящий из взаимосвязанных групп искусственных «нейронов», которые обрабатывают информацию, используя коннекционистский подход к вычислениям. NN используются для моделирования сложных взаимосвязей между входными и выходными данными (без фактического знания этих взаимосвязей) или для поиска закономерностей в данных. NN сначала подготавливаются во время фазы обучения, во время которой им предоставляется некоторый известный набор «входных данных» и информация для адаптации NN к генерированию надлежащих выходных данных (для некоторой определенной ситуации, которую пытаются смоделировать). Во время этой фазы обучения эта NN адаптируется к изучаемой ситуации и меняет свою структуру так, чтобы данная NN могла обеспечивать разумные предсказанные выходные данные для определенных входных данных во время некоторой новой ситуации (на основе того, что было изучено). Таким образом, вместо того, чтобы пытаться определить сложные статистические схемы или математические алгоритмы для некоторой определенной ситуации, данная NN пытается дать «интуитивный» ответ, основанный на «восприятии» ситуации. Таким образом, данная NN является своего рода обученным «черным ящиком», который можно использовать в ситуации, когда то, что находится в «ящике», может быть менее важным; и когда более важным является обладание «коробкой», которая дает разумные ответы на имеющиеся входные данные. Например, NN обычно используются для оптимизации распределения веб-трафика между серверами и при обработке данных, включая фильтрацию, кластеризацию, разделение сигналов, сжатие, генерирование векторов и тому подобное.[89] Broadly speaking, NNs are a special class of MLAs consisting of interconnected groups of artificial "neurons" that process information using a connectionist approach to computation. NNs are used to model complex relationships between inputs and outputs (without actually knowing those relationships) or to look for patterns in data. NNs are first trained during a training phase during which they are given some known set of "inputs" and information to adapt the NN to generate the proper output (for some particular situation that is being modeled). During this learning phase, this NN adapts to the situation being learned and changes its structure so that the given NN can provide reasonable predictive output for certain inputs during some new situation (based on what has been learned). Thus, instead of trying to define complex statistical schemes or mathematical algorithms for some specific situation, this NN tries to give an "intuitive" answer based on the "perception" of the situation. Thus, this NN is a kind of trained "black box" that can be used in a situation where what is in the "box" may be less important; and when it's more important to have a "box" that gives reasonable answers to the given inputs. For example, NNs are commonly used to optimize the distribution of web traffic between servers and data processing, including filtering, clustering, signal separation, compression, vector generation, and the like.

[90] То, как генератор 140 ранжирующих признаков (например, NN) может быть обучен генерированию данного ранжирующего признака для данной пары документ-запрос, будет более подробно описано в данном документе ниже со ссылкой на Фиг. 3. То, как генератор 140 ранжирующих признаков может затем использоваться в реальном времени во время фазы его использования поисковой машиной 160, будет более подробно описано в данном документе ниже со ссылкой на Фиг. 5. Однако следует упомянуть, что генератор 140 ранжирующих признаков может использовать запросо-зависимые данные, которые динамически генерируются для данной пары документ-запрос.[90] How a ranking feature generator (eg, NN) 140 can be trained to generate a given ranking feature for a given query document pair will be described in more detail hereinafter with reference to FIG. 3. How the ranking feature generator 140 can then be used in real time during its use phase by the search engine 160 will be described in more detail hereinafter with reference to FIG. 5. However, it should be mentioned that the ranking feature generator 140 may use query-specific data that is dynamically generated for a given query-document pair.

[91] Сервер 112 также может быть сконфигурирован для выполнения одного или более алгоритмов, реализованных на компьютере, в дальнейшем именуемых «генератором динамических признаков» 150, который в широком смысле сконфигурирован для использования запросо-независимых данных, извлеченных из инвертированного индекса 170, для одного или более термов в следующем порядке: для генерирования в реальном времени (динамически) запросо-зависимых данных. То, как запросо-независимые данные могут использоваться генератором 155 динамических признаков для генерирования запросо-зависимых данных и как генератор динамических признаков может быть реализован сервером 112, будет обсуждаться более подробно здесь ниже со ссылкой на Фиг. 4.[91] Server 112 may also be configured to execute one or more computer-implemented algorithms, hereinafter referred to as "dynamic feature generator" 150, which is broadly configured to use query-independent data retrieved from inverted index 170 for one or more terms in the following order: to generate real-time (dynamically) query-dependent data. How the query-independent data may be used by the dynamic feature generator 155 to generate query-dependent data, and how the dynamic feature generator may be implemented by the server 112, will be discussed in more detail here below with reference to FIG. four.

[92] Со ссылкой на Фиг. 2 изображено представление 200 по меньшей мере некоторых данных, хранящихся в подсистеме 150 базы данных. Например, изображено представление 202 по меньшей мере некоторых данных документа, хранящихся в подсистеме 150 базы данных в качестве части данных 175 поисковой машины, как объяснено выше.[92] With reference to FIG. 2 shows a representation 200 of at least some of the data stored in the database engine 150. For example, a representation 202 of at least some of the document data stored in the database engine 150 as part of the search engine data 175 is shown, as explained above.

[93] Как показано, подсистема 150 базы данных может быть сконфигурирована для хранения множества документов 204, связанных с соответствующими данными содержания. Например, подсистема 150 базы данных может хранить документ 210 вместе с данными 212 содержимого. В этом примере документ 210 может быть данной веб-страницей, которая была просканирована приложением 120 поискового робота и загружена с одного из серверов 108 ресурсов.[93] As shown, the database engine 150 may be configured to store a plurality of documents 204 associated with respective content data. For example, the database engine 150 may store the document 210 along with the content data 212 . In this example, the document 210 may be a given web page that has been crawled by the crawler application 120 and downloaded from one of the resource servers 108 .

[94] Данные 212 содержимого могут содержать множество типов 214 содержимого. Например, один из множества типов 214 содержимого может включать в себя заголовок документа 210. В том же примере другой из множества типов 214 содержимого может включать в себя универсальный указатель ресурса (URL), связанный с документом 210. В том же примере еще один из множества типов 214 содержимого может включать в себя основное содержимое документа 210. Дополнительно или альтернативно, другие типы содержимого, помимо тех, которые неисчерпывающе перечислены выше, могут храниться в качестве части данных содержимого для данного документа, не выходя за рамки настоящей технологии.[94] Content data 212 may contain multiple content types 214 . For example, one of the set of content types 214 may include a title of the document 210. In the same example, another of the set of content types 214 may include a Uniform Resource Locator (URL) associated with the document 210. In the same example, another of the set content types 214 may include the main content of document 210. Additionally or alternatively, other content types other than those non-exhaustively listed above may be stored as part of the content data for a given document without departing from the scope of the present technology.

[95] Кроме того, изображено представление 222 по меньшей мере некоторых данных, сохраненных в качестве части инвертированного индекса 170. Как объяснено выше, инвертированный индекс 170 может хранить множество списков 224 документов, связанных с соответствующими термами. В широком смысле, данный список документов для данного поискового терма будет содержать ссылки в форме номеров документов, например, на те документы, в которых встречается этот поисковый терм. Ссылки в данном списке документов могут сами быть в числовом порядке, в тоже время между номерами документов будут пробелы, по мере того как поисковый терм не встречается в документах, номера которых пропущены и образуют пробелы. Например, инвертированный индекс 170 сконфигурирован для хранения списка 250 документов, связанного с термом 220 (T1) и множеством документов 230. Как проиллюстрировано, множество документов содержит документ 210 (D1) и другие документы, которые содержат терм 220 (T1) в своем содержимом.[95] Also shown is a representation 222 of at least some of the data stored as part of the inverted index 170. As explained above, the inverted index 170 may store a plurality of lists 224 of documents associated with respective terms. In a broad sense, a given list of documents for a given search term will contain links in the form of document numbers, for example, to those documents in which this search term occurs. The links in a given list of documents may themselves be in numerical order, while there will be spaces between document numbers, as the search term does not occur in documents whose numbers are omitted and form spaces. For example, the inverted index 170 is configured to store a list 250 of documents associated with term 220 (T1) and a plurality of documents 230. As illustrated, the plurality of documents contains document 210 (D1) and other documents that contain term 220 (T1) in their content.

[96] Следует отметить, что данный список документов может содержать данные, указывающие на один или более указателей местоположения касательно позиции соответствующего терма в соответствующих документах.[96] It should be noted that this list of documents may contain data pointing to one or more location indicators regarding the position of the corresponding term in the corresponding documents.

[97] В некоторых вариантах осуществления предполагается, что множество списков 224 документов может содержать списки документов для «подобных термов». Например, первый терм может быть подобен второму терму, если второй терм является, но не ограничивается этим, синонимом первого терма, нормализованной версией первого терма и т.п. В этом примере первым термом может быть «занятие», а вторым термом может быть «работа». В этом примере первым термом может быть «рабочий», а вторым термом может быть «работа». Как станет очевидно из приведенного ниже описания, сервер 112 может быть сконфигурирован для идентификации одного или более термов, подобных термам из запроса, и их использования в дополнение к термам из запроса для извлечения данных из инвертированного индекса 170.[97] In some embodiments, it is contemplated that the set of document lists 224 may contain lists of documents for "like terms". For example, the first term may be similar to the second term if the second term is, but is not limited to, a synonym for the first term, a normalized version of the first term, and the like. In this example, the first term could be "occupation" and the second term could be "work". In this example, the first term could be "work" and the second term could be "work". As will become apparent from the description below, the server 112 can be configured to identify one or more terms similar to the terms from the query and use them in addition to the terms from the query to retrieve data from the inverted index 170.

[98] В контексте настоящей технологии инвертированный индекс 170 сконфигурирован для хранения запросо-независимых данных для соответствующих пар документ-терм (DT). Например, сервер 112 может быть сконфигурирован для хранения запросо-независимых данных 260 для пары D1-T1, то есть запросо-независимые данные 260 связаны с парой, содержащей терм 220 и документ 210.[98] In the context of the present technology, the inverted index 170 is configured to store query-independent data for corresponding document-term (DT) pairs. For example, server 112 may be configured to store query-independent data 260 for the D1-T1 pair, i.e., query-independent data 260 is associated with a pair containing term 220 and document 210.

[99] Можно сказать, что запросо-независимые данные 260 включают в себя набор данных, зависимый от пары DT, сохраненный для соответствующей пары D1-T1. Аналогично, инвертированный индекс 170 может быть сконфигурирован для хранения множества запросо-независимых данных 240 (множество наборов данных, зависимых от пары DT) для соответствующих пар DT.[99] It can be said that query-independent data 260 includes a data set dependent on a DT pair stored for a corresponding D1-T1 pair. Similarly, the inverted index 170 may be configured to store a set of query-independent data 240 (a set of data sets dependent on a DT pair) for the respective DT pairs.

[100] Следует отметить, что в контексте настоящей технологии «запросо-независимые» данные могут относиться к данным, которые определяются и сохраняются без предварительного знания того, что представляет собой текущий используемый запрос. Предполагается, что запросо-независимые данные 260 указывают на (i) зависимое от терма вхождение терма 220 (T1) в содержимое, связанное с документом 210 (D1).[100] It should be noted that, in the context of the present technology, "query-independent" data may refer to data that is determined and stored without prior knowledge of what the currently used query is. The query-independent data 260 is assumed to point to (i) a term-dependent occurrence of the term 220 (T1) in the content associated with the document 210 (D1).

[101] Предполагается, что зависимое от терма вхождение терма 220 (T1) может включать в себя по меньшей мере одно из: (i) одной или более позиций терма 220 (T1) в заголовке, связанном с документом 210 (D1), (ii) одной или более позиций терма 220 (T1) в URL, связанном с документом 210 (D1), и (iii) одной или более позиций терма 220 (T1) в теле документа 210 (D1).[101] It is contemplated that a term-dependent occurrence of term 220(T1) may include at least one of: (i) one or more positions of term 220(T1) in the header associated with document 210(D1), (ii ) one or more term positions 220 (T1) in the URL associated with document 210 (D1), and (iii) one or more term positions 220 (T1) in the body of document 210 (D1).

[102] В некоторых вариантах осуществления настоящей технологии запросо-независимые данные 260 для пары D1-T1 могут дополнительно содержать запросо-независимые данные на основе содержимого, связанные с парой D1-T1. В этих вариантах осуществления запросо-независимые данные на основе содержимого могут указывать на текстовый контекст соответствующего терма в содержимом, связанном с данным документом. В примере пары D1-T1 запросо-независимые данные на основе содержимого в запросо-независимых данных 260 могут указывать на один или более соседних термов с термом 220 (T1) в содержимом документа 210 (D1).[102] In some embodiments of the present technology, query-independent data 260 for the D1-T1 pair may further comprise content-based query-independent data associated with the D1-T1 pair. In these embodiments, the content-based query-independent data may point to the textual context of the corresponding term in the content associated with the given document. In the example of a D1-T1 pair, query-independent data based on content in query-independent data 260 may point to one or more neighboring terms with term 220 (T1) in the content of document 210 (D1).

[103] Как станет очевидно из приведенного ниже описания, генератор 155 динамических признаков может использовать запросо-независимые данные, хранящиеся в инвертированном индексе 170, для генерирования в реальном времени запросо-зависимых данных для данной пары документ-запрос (DQ). Генератор 140 ранжирующих признаков затем может использовать запросо-зависимые данные и данные, связанные с различными термами из запроса, для генерирования данного ранжирующего признака для данной пары DQ.[103] As will become apparent from the description below, dynamic feature generator 155 can use query-independent data stored in inverted index 170 to generate real-time query-specific data for a given document-query (DQ) pair. Ranking feature generator 140 can then use query-specific data and data associated with various terms from the query to generate a given ranking feature for a given DQ pair.

[104] То, как сервер 112 сконфигурирован для использования в реальном времени, в ответ на получение текущего запроса, генератора 155 динамических признаков для генерирования в реальном времени ранжирующего признака с помощью генератора 140 ранжирующих признаков, будет обсуждаться здесь дополнительно ниже со ссылкой на Фиг. 5. Однако то, как сервер 112 сконфигурирован для использования генератора 155 динамических признаков для генерирования обучающих данных для обучения генератора 140 ранжирующих признаков, будет сначала обсуждаться со ссылкой на Фиг. 3 и 4.[104] How server 112 is configured to use real-time, in response to receiving a current request, dynamic feature generator 155 to generate a real-time ranking feature using ranking feature generator 140 will be discussed here further below with reference to FIG. 5. However, how the server 112 is configured to use the dynamic feature generator 155 to generate training data for training the ranking feature generator 140 will first be discussed with reference to FIG. 3 and 4.

[105] На Фиг. 3 изображено представление 300 обучающих данных 310 и представление 350 одной итерации обучения генератора 140 ранжирующих признаков (NN). Сервер 112 может быть сконфигурирован для использования обучающих данных 310, которые могут использоваться для генерирования обучающего набора, который будет использоваться для выполнения одной итерации обучения.[105] In FIG. 3 shows a representation 300 of training data 310 and a representation 350 of one training iteration of a ranking feature (NN) generator 140. Server 112 may be configured to use training data 310, which may be used to generate a training set that will be used to perform one training iteration.

[106] Обучающие данные 310 содержат обучающую пару 302 DQ, имеющую обучающий запрос 304 (Qa) и обучающий документ 306 (Da). Обучающий запрос 304 имеет множество термов 330, содержащих первый терм 332 (Ta), второй терм 334 (Tb) и третий терм 336 (Tc). Обучающие данные 310 связаны с меткой 320. Метка 320 указывает на соответствие обучающего документа 306 (Da) обучающему запросу 304 (Qa).[106] The training data 310 comprises a DQ training pair 302 having a training query 304 (Qa) and a training document 306 (Da). The training query 304 has a set of terms 330 containing a first term 332 (Ta), a second term 334 (Tb), and a third term 336 (Tc). Training data 310 is associated with label 320. Label 320 indicates that training document 306 (Da) matches training query 304 (Qa).

[107] Например, метка 320 может быть определена оценщиком-человеком, которому было поручено «оценить» релевантность обучающего документа 306 (Da) для обучающего запроса 304 (Qa). Точно так же одному или более оценщикам может быть поручено оценить множество обучающих пар DQ, и множество обучающих пар DQ может быть сохранено в системе 150 базы данных вместе с соответствующими оцененными метками.[107] For example, the label 320 may be determined by a human evaluator who has been instructed to "score" the relevance of the training document 306 (Da) to the training query 304 (Qa). Similarly, one or more evaluators may be tasked with estimating a plurality of DQ training pairs, and a plurality of DQ training pairs may be stored in the database system 150 along with the corresponding estimated labels.

[108] Сервер 112 может быть сконфигурирован для использования обучающих данных 310 и метки 320 для генерирования обучающего набора 380 для выполнения одной итерации обучения генератора 140 ранжирующих признаков, показанного на Фиг. 3. Обучающий набор 380 содержит множество вложений 382 обучающих термов, множество векторов 384 обучающих признаков и метку 320.[108] Server 112 may be configured to use training data 310 and label 320 to generate training set 380 to perform one iteration of training the ranking feature generator 140 shown in FIG. 3. The training set 380 contains a set of training term embeddings 382, a set of training feature vectors 384, and a label 320.

[109] Сервер 112 может быть сконфигурирован для генерирования множества вложений 382 обучающих термов на основе соответствующих термов из обучающего запроса 304. В широком смысле, вложение (embedding) - это относительно маломерное пространство, в которое можно переводить многомерные векторы. Вложения упрощают машинное обучение для больших входных данных, например, разреженных векторов, представляющих слова. В идеале вложение захватывает некоторую семантику входных данных, помещая семантически похожие входные данные близко друг к другу в пространстве вложения. Также предполагается, что вложения могут быть изучены и повторно использованы в разных моделях. Сервер 112 может быть сконфигурирован для выполнения одного или более реализованных на компьютере алгоритмов вложения, сконфигурированных для приема заданного обучающего терма в качестве ввода и вывода соответствующего вложения обучающего терма. Например, сервер 112 может быть сконфигурирован для генерирования первого вложения 342 обучающего терма на основе первого терма 332 (Ta), второго вложения 344 обучающего терма на основе второго терма 334 (Tb) и третьего вложения 346 обучающего терма на основе третьего терма 336 (Tc).[109] The server 112 can be configured to generate a set of training term embeddings 382 based on the corresponding terms from the training query 304. In a broad sense, an embedding is a relatively low-dimensional space into which high-dimensional vectors can be translated. Embedding simplifies machine learning for large inputs, such as sparse vectors representing words. Ideally, nesting captures some of the semantics of the input by placing semantically similar inputs close together in the nesting space. It is also assumed that embeddings can be learned and reused across different models. Server 112 may be configured to execute one or more computer-implemented nesting algorithms configured to receive a given training term as input and output a corresponding training term nesting. For example, server 112 may be configured to generate a first training term embedding 342 based on first term 332 (Ta), a second training term embedding 344 based on second term 334 (Tb), and a third training term embedding 346 based on third term 336 (Tc) .

[110] Как упомянуто выше, обучающий набор 380 содержит множество векторов 384 признаков. Следует отметить, что один или более признаков из множества векторов 384 признаков могут быть сгенерированы генератором 155 динамических признаков. Как теперь будет обсуждаться более подробно со ссылкой на Фиг. 4, один или более признаков из множества векторов 384 признаков могут содержать запросо-зависимые данные, сгенерированные генератором 155 динамических признаков на основе запросо-независимых данных, извлеченных для множества обучающих термов 330 из инвертированного индекса 170.[110] As mentioned above, training set 380 contains a plurality of feature vectors 384. It should be noted that one or more features from the set of feature vectors 384 may be generated by the dynamic feature generator 155. As will now be discussed in more detail with reference to FIG. 4, one or more features from the set of feature vectors 384 may contain query-dependent data generated by the dynamic feature generator 155 based on query-independent data extracted for the set of training terms 330 from the inverted index 170.

[111] На Фиг. 4 изображено представление 400 того, как генератор 155 динамических признаков генерирует множество векторов 384 признаков для обучающего набора 380. Сервер 112 сконфигурирован для осуществления доступа к инвертированному индексу 170 для извлечения множества запросо-независимых наборов данных, связанных с соответствующими обучающими парами DT, то есть сервер 112 может извлекать первый запросо-независимый набор 402 данных для обучающей пары Da-Ta, второй запросо-независимый набор 404 данных для обучающей пары Da-Tb и третий запросо-независимый набор 406 данных для обучающей пары Da-Tc. Можно сказать, что первый запросо-независимый набор 402 данных, второй запросо-независимый набор 404 данных и третий запросо-независимый набор 406 данных представляют запросо-независимые данные, извлеченные из инвертированного индекса 170 для множества обучающих пар DT.[111] In FIG. 4 depicts a representation 400 of how dynamic feature generator 155 generates a set of feature vectors 384 for training set 380. Server 112 is configured to access an inverted index 170 to retrieve a set of query-independent datasets associated with the respective DT training pairs, i.e., the server 112 may retrieve a first query-independent dataset 402 for the Da-Ta training pair, a second query-independent dataset 404 for the Da-Tb training pair, and a third query-independent dataset 406 for the Da-Tc training pair. It can be said that the first query-independent dataset 402, the second query-independent dataset 404, and the third query-independent dataset 406 represent query-independent data extracted from the inverted index 170 for the set of training pairs DT.

[112] Сервер 112 сконфигурирован для ввода извлеченных таким образом запросо-независимых данных в генератор 155 динамических признаков. В широком смысле, генератор 155 динамических признаков выполнен с возможностью генерировать для каждого обучающего терма из обучающего запроса 304 соответствующий вектор признаков. В проиллюстрированном примере генератор 155 динамических признаков использует извлеченные таким образом запросо-независимые данные для генерирования первого вектора 352 признаков для обучающего терма 332 Ta, второго вектора 354 признаков для обучающего терма 334 Tb и третьего вектора 356 признаков для обучающего терма 336 Tc.[112] The server 112 is configured to input the query-independent data thus retrieved into the dynamic feature generator 155 . In a broad sense, the dynamic feature generator 155 is configured to generate, for each training term from the training query 304, a corresponding feature vector. In the illustrated example, dynamic feature generator 155 uses the thus extracted query-independent data to generate a first feature vector 352 for training term 332 Ta, a second feature vector 354 for training term 334 Tb, and a third feature vector 356 for training term 336 Tc.

[113] Следует отметить, что генератор 155 динамических признаков может содержать множество реализованных на компьютере алгоритмов, которые сконфигурированы для обработки запросо-независимых данных для генерирования множества признаков (для каждого обучающего терма), типы которых предварительно определены. Например, типы признаков, которые должны быть сгенерированы генератором 155 динамических признаков, могут быть определены оператором сервера 112. В этом примере оператор может определить типы признаков, которые он считает полезными для конкретной реализации настоящей технологии.[113] It should be noted that the dynamic feature generator 155 may comprise a plurality of computer-implemented algorithms that are configured to process query-independent data to generate a plurality of features (for each training term) whose types are predetermined. For example, the types of features to be generated by the dynamic feature generator 155 can be determined by the operator of the server 112. In this example, the operator can define the types of features that he considers useful for a particular implementation of the present technology.

[114] В некоторых вариантах осуществления типы признаков могут включать в себя запросо-независимые признаки и запросо-зависимые признаки. Разработчики настоящей технологии выявили, что запросо-независимые данные, хранящиеся в инвертированном индексе 170, могут быть извлечены в реальном времени и обработаны для генерирования запросо-зависимых данных. В частности, разработчики настоящей технологии выявили, что запросо-независимые наборы данных, извлеченные для соответствующих пар DT, взятые по отдельности, не включают в себя информацию, которая зависит от более чем одного терма из запроса (следовательно, запросо-независимые). Однако, когда запросо-независимые наборы данных извлекаются для каждой пары DT и анализируются в сочетании друг с другом, сервер 112 может в некотором смысле «извлекать» из него запросо-зависимые данные в форме одного или более запросо-зависимых признаков.[114] In some embodiments, feature types may include query-independent features and query-dependent features. The developers of the present technology have recognized that the query-independent data stored in the inverted index 170 can be retrieved in real time and processed to generate query-dependent data. In particular, the developers of the present technology have found that the query-independent datasets extracted for the respective DT pairs, taken individually, do not include information that depends on more than one term from the query (hence query-independent). However, when query-independent data sets are retrieved for each pair of DTs and parsed in combination with each other, server 112 may in some sense "extract" query-dependent data from it in the form of one or more query-dependent features.

[115] Чтобы лучше проиллюстрировать это, предположим, что запросом 304 является «граница таблицы css» («table border css») и, следовательно, терм 332 (Ta) - это «таблица» («table»), терм 334 (Tb) - это «граница» («border»), а терм 336 (Tc) - это «css». Также предположим, что документ 306 (Da) связан со следующим URL: «https://www.w3schools.com/css/css_table.asp», имеет заголовок «Таблицы стилей CSS» («CSS styling tables»), а тело содержит следующие предложения: «Границы таблицы» («Table Borders»), «Чтобы указать границы таблицы в CSS, используйте свойство границы» («To specify table borders in CSS, use the border property») и «В приведенном ниже примере указана черная граница для элементов <table>, <th> и <td>» («The example below specifies a black border for <table>, <th>, and <td> elements»).[115] To better illustrate this, suppose that request 304 is "table border css" ("table border css") and therefore term 332 (Ta) is "table" ("table"), term 334 (Tb ) is "border" and term 336 (Tc) is "css". Also suppose that document 306 (Da) is linked to the following URL: "https://www.w3schools.com/css/css_table.asp", has the title "CSS styling tables", and the body contains the following sentences: "Table Borders", "To specify table borders in CSS, use the border property", and "The example below specifies a black border for <table>, <th>, and <td> elements" ("The example below specifies a black border for <table>, <th>, and <td> elements").

[116] В этом примере первый запросо-независимый набор 402 данных указывает одно или более зависимых от терма вхождений терма 332 (Ta) «таблица» в содержимое, связанное с документом. Например, первый запросо-независимый набор 402 данных может указывать на присутствие терма 332 (Ta) «таблица» в заголовке документа 306. В другом случае первый запросо-независимый набор 402 данных может указывать на присутствие терма 332 (Ta) «таблица» в теле документа 306. В дополнительном случае первый запросо-независимый набор 402 данных может указывать на присутствие терма 332 (Ta) «таблица» в URL документа 306. В еще одном случае первый запросо-независимый набор 402 данных может указывать позицию (или смещение) терма 332 (Ta) «таблица» в содержимом документа 306. В дополнительном случае первый запросо-независимый набор 402 данных может указывать, сколько раз терм 332 (Ta) «таблица» встречается в содержимом документа 306.[116] In this example, the first query-independent dataset 402 specifies one or more term-dependent occurrences of the "table" term 332 (Ta) in document-related content. For example, the first query-independent dataset 402 may indicate the presence of a "table" term 332 (Ta) in the document header 306. Alternatively, the first query-independent dataset 402 may indicate the presence of a "table" term 332 (Ta) in the document body. document 306. In an additional case, the first query-independent data set 402 may indicate the presence of a "table" term 332 (Ta) in the URL of the document 306. In yet another case, the first query-independent data set 402 may indicate the position (or offset) of the term 332 (Ta) "table" in the content of the document 306. Optionally, the first query-independent data set 402 may indicate how many times the term 332 (Ta) "table" occurs in the content of the document 306.

[117] Аналогичным образом второй запросо-независимый набор 404 данных указывает одно или более зависимых от терма вхождений терма 334 (Tb) «таблица» в содержимое, связанное с документом 306. Аналогичным образом третий запросо-независимый набор 406 данных указывает одно или более зависимых от терма вхождений терма 336 (Tc) «таблица» в содержимое, связанное с документом 306. Следует отметить, что характер.[117] Similarly, the second query-independent dataset 404 specifies one or more term-dependent occurrences of the (Tb) "table" term 334 in the content associated with the document 306. Similarly, the third query-independent dataset 406 specifies one or more term-dependent from the term occurrences of the term 336 (Tc) "table" in the content associated with the document 306. It should be noted that character.

[118] Следует отметить, что запросо-независимые данные могут использоваться для генерирования запросо-зависимого признака, который указывает на групповое вхождение данного терма с одним или более другими термами из запроса 304.[118] It should be noted that query-independent data can be used to generate a query-specific feature that indicates a group occurrence of a given term with one or more other terms from query 304.

[119] Например, оператор может предварительно определить, что один тип запросо-зависимого признака, который генератор 155 должен генерировать на основе запросо-независимых данных - это количество раз, когда данный терм из запроса 304 включен в дополнение ко второму терму в URL, связанный с документом 306. Таким образом, при генерировании первого вектора 352 признаков для пары Da-Ta сервер 112 может определить, что терм 332 (Ta) «таблица» не встречается в дополнение к терму 334 (Tb) «граница» в URL, связанном с документом 306 («https://www.w3schools.com/css/css_table.asp»). В этом случае сервер 112 может определить значение этого признака в первом векторе 352 признаков как «0».[119] For example, the operator may predetermine that one type of query-specific feature generator 155 should generate based on query-independent data is the number of times a given term from request 304 is included in addition to the second term in the URL associated with the document 306. Thus, when generating the first feature vector 352 for the Da-Ta pair, the server 112 may determine that the "table" term 332 (Ta) does not occur in addition to the "border" term 334 (Tb) in the URL associated with document 306 ("https://www.w3schools.com/css/css_table.asp"). In this case, the server 112 may determine the value of this feature in the first feature vector 352 as "0".

[120] В другом примере оператор может предварительно определить, что еще один тип запросо-зависимого признака, который генератор 155 должен генерировать на основе запросо-независимых данных - это количество раз, когда данный терм из запроса 304 включен в дополнение к третьему терму в URL, связанный с документом 306. Таким образом, при генерировании первого вектора 352 признаков для пары Da-Ta сервер 112 может определить, что терм 332 (Ta) «таблица» встречается один раз в дополнение к присутствию терма 336 (Tc) «css» в URL, связанном с документом 306 («https://www.w3schools.com/css/css_table.asp»). В этом случае сервер 112 может определить значение этого признака в первом векторе 352 признаков как «1».[120] In another example, the operator may predetermine that another type of query-specific feature that generator 155 should generate based on query-independent data is the number of times a given term from request 304 is included in addition to the third term in the URL. associated with the document 306. Thus, when generating the first feature vector 352 for the Da-Ta pair, the server 112 can determine that the "table" term 332 (Ta) occurs once in addition to the presence of the "css" term 336 (Tc) in The URL associated with document 306 ("https://www.w3schools.com/css/css_table.asp"). In this case, the server 112 may determine the value of this feature in the first feature vector 352 as "1".

[121] В дополнительном примере оператор может предварительно определить, что еще один тип запросо-зависимого признака, который генератор 155 должен генерировать на основе запросо-независимых данных - это количество раз, когда данный терм из запроса 304 включен в дополнение к первому терму в URL, связанный с документом 306. Таким образом, при генерировании третьего вектора 356 признаков для пары Da-Tc сервер 112 может определить, что терм 336 (Tc) «css» встречается дважды в дополнение к присутствию терма 332 (Ta) «таблица» в URL, связанном с документом 306 («https://www.w3schools.com/css/css_table.asp»). В этом случае сервер 112 может определить значение этого признака в третьем векторе 356 признаков как «2».[121] In an additional example, the operator may predetermine that another type of query-specific feature that generator 155 should generate based on query-independent data is the number of times a given term from request 304 is included in addition to the first term in the URL. associated with the document 306. Thus, when generating the third feature vector 356 for the Da-Tc pair, the server 112 may determine that the "css" term 336 (Tc) occurs twice in addition to the presence of the "table" term 332 (Ta) in the URL. associated with document 306 ("https://www.w3schools.com/css/css_table.asp"). In this case, the server 112 may determine the value of this feature in the third feature vector 356 as "2".

[122] В еще одном примере оператор может предварительно определить, что еще один тип запросо-зависимого признака, который генератор 155 должен генерировать на основе запросо-независимых данных, представляет собой процент термов из запроса 304, которые входят с данным термом в URL, связанный с документом 306. Таким образом, при генерировании первого вектора 352 признаков для пары Da-Ta сервер 112 может определить, что два из трех термов из запроса 304 встречаются с термом 332 (Ta) «таблица» в URL, связанном с документом 306 («https://www.w3schools.com/css/css_table.asp»). В этом случае сервер 112 может определить значение этого признака в первом векторе 352 признаков как «2/3». Однако при генерировании второго вектора 354 признаков для пары Da-Tb сервер 112 может определить, что ни один из термов из запроса 304 не встречается с термом 332 (Tb) «граница» в URL, связанном с документом 306 ( «Https://www.w3schools.com/css/css_table.asp»), потому что терм 332 (Tb) «граница» не встречается в этом URL. В этом случае сервер 112 может определить значение этого признака во втором векторе 354 признаков как «0».[122] In yet another example, the operator may predetermine that another type of query-dependent feature that the generator 155 should generate based on the query-independent data is the percentage of terms from request 304 that appear with the given term in the URL associated with the query. with the document 306. Thus, when generating the first feature vector 352 for the Da-Ta pair, the server 112 may determine that two of the three terms from the request 304 occur with the "table" term 332 (Ta) in the URL associated with the document 306 (" https://www.w3schools.com/css/css_table.asp"). In this case, the server 112 may determine the value of this feature in the first feature vector 352 as "2/3". However, when generating the second feature vector 354 for the Da-Tb pair, the server 112 may determine that none of the terms from the request 304 occur with the "border" term 332 (Tb) in the URL associated with the document 306 ("https://www. .w3schools.com/css/css_table.asp") because term 332 (Tb) "border" does not occur in this URL. In this case, the server 112 may determine the value of this feature in the second feature vector 354 as "0".

[123] Следует отметить, что при генерировании векторов признаков для соответствующих пар DT генератор 155 может быть сконфигурирован для генерирования векторов признаков предварительно определенного размера. Например, размер может быть определен оператором на основе, например, количества запросо-зависимых признаков, которые он (она) считает полезными для конкретной реализации. Тем не менее, предполагается, что данный вектор признаков может содержать, в дополнение к одному или более запросо-зависимым признакам различных типов, запросо-независимые признаки для соответствующего терма. В одной реализации настоящей технологии размер векторов признаков, генерируемых генератором 155, равен «5».[123] It should be noted that when generating feature vectors for respective DT pairs, generator 155 can be configured to generate feature vectors of a predetermined size. For example, the size may be determined by the operator based on, for example, the number of query-dependent features that he (she) considers useful for a particular implementation. However, it is contemplated that a given feature vector may contain, in addition to one or more query-dependent features of various types, query-independent features for the corresponding term. In one implementation of the present technology, the size of the feature vectors generated by the generator 155 is "5".

[124] Возвращаясь к описанию Фиг. 3, сервер 112 генерирует для обучающего набора 380 первый вектор 352 обучающих признаков, второй вектор 354 обучающих признаков и третий вектор 356 обучающих признаков, как описано выше. Сервер 112 вводит вложения 342, 344 и 346 термов и векторы 352, 354 и 356 обучающих признаков в обучаемую NN (генератор 140 ранжирующих признаков). В некоторых вариантах осуществления NN может содержать множество параллельных входных слоев и множество полностью связанных слоев.[124] Returning to the description of FIG. 3, server 112 generates for training set 380 a first training feature vector 352, a second training feature vector 354, and a third training feature vector 356 as described above. The server 112 inserts term embeddings 342, 344 and 346 and training feature vectors 352, 354 and 356 into the trainable NN (ranking feature generator 140). In some embodiments, the implementation of the NN may contain multiple parallel input layers and multiple fully connected layers.

[125] Предполагается, что сервер 106 может быть сконфигурирован для конкатенации вложенных термов с соответствующими векторами обучающих признаков. Например, сервер 106 может быть сконфигурирован для конкатенации вложения 342 терма с вектором 352 обучающих признаков для генерирования конкатенированных обучающих входных данных. В том же примере сервер 106 может быть сконфигурирован для конкатенации вложения 344 терма с вектором 354 обучающих признаков для генерирования конкатенированных обучающих входных данных. В том же примере сервер 106 может быть сконфигурирован для конкатенации вложения 346 терма с вектором 356 обучающих признаков для генерирования конкатенированных обучающих входных данных. Сервер 106 может быть сконфигурирован для ввода в NN конкатенированных входных данных, так что NN, в некотором смысле, понимает, какие вложения термов связаны с какими векторами обучающих признаков.[125] It is contemplated that server 106 can be configured to concatenate nested terms with corresponding training feature vectors. For example, server 106 may be configured to concatenate term embedding 342 with training feature vector 352 to generate concatenated training inputs. In the same example, server 106 may be configured to concatenate term embedding 344 with training feature vector 354 to generate concatenated training inputs. In the same example, server 106 may be configured to concatenate term embedding 346 with training feature vector 356 to generate concatenated training inputs. Server 106 can be configured to feed concatenated inputs into the NN so that the NN, in a sense, understands which term embeddings are associated with which training feature vectors.

[126] В ответ на входные данные NN выводит предсказанный ранжирующий признак 360. Сервер 112 сконфигурирован для сравнения предсказанного ранжирующего признака 360 с меткой 320. Например, сервер 112 может применить заданную функцию штрафа во время сравнения предсказанного ранжирующего признака 360 с меткой 320. Результат сравнения затем используется сервером 112 для того, чтобы, в некотором смысле, «настроить» NN таким образом, чтобы NN генерировала предсказанные ранжирующие признаки, которые близки к соответствующим меткам. Например, настройка может выполняться с помощью одного или более методов обучения NN (например, обратного распространения ошибки). После большого количества итераций, выполненных аналогичным образом, NN обучается генерировать предсказанные ранжирующие признаки, которые являются предсказаниями релевантности соответствующих документов для соответствующих запросов.[126] In response to input, the NN outputs predicted ranking feature 360. Server 112 is configured to compare predicted ranking feature 360 with label 320. For example, server 112 may apply a given penalty function while comparing predicted ranking feature 360 with label 320. The result of the comparison then used by server 112 to, in a sense, "tune" the NN so that the NN generates predicted ranking features that are close to the corresponding labels. For example, tuning may be performed using one or more NN learning methods (eg, backpropagation). After a large number of iterations performed in a similar manner, NN is trained to generate predicted ranking features, which are predictions of the relevance of the relevant documents to the relevant queries.

[127] То, как сервер 112 сконфигурирован для использования генератора 140 ранжирующих признаков и генератора 155 динамических признаков в реальном времени для текущего запроса, отправленного пользователем 102, теперь будет описано со ссылкой на Фиг. 5. Здесь изображено представление 500 того, как сервер 112 сконфигурирован для генерирования предсказанного ранжирующего признака 540 для данной пары DQ.[127] How the server 112 is configured to use the ranking feature generator 140 and the real-time dynamic feature generator 155 for the current query sent by the user 102 will now be described with reference to FIG. 5. Shown here is a representation 500 of how server 112 is configured to generate predicted ranking feature 540 for a given DQ pair.

[128] Предположим, что сервер 112 принимает текущий (используемый) запрос 502, отправленный пользователем 102, от электронного устройства 104 через запрос 180. Можно сказать, что как только запрос 502 получен сервером 112, сервер 112 сконфигурирован для использования поисковой машины 160 в реальном времени для генерирования заданной SERP. Запрос 502 (Qw) имеет терм 506 (Tx), терм 507 (Ty) и терм 508 (Tz).[128] Suppose that the server 112 receives the current (used) request 502 sent by the user 102 from the electronic device 104 through the request 180. It can be said that once the request 502 is received by the server 112, the server 112 is configured to use the search engine 160 in real time to generate a given SERP. Request 502 (Qw) has term 506 (Tx), term 507 (Ty), and term 508 (Tz).

[129] Сервер 112 сконфигурирован для определения множества потенциально релевантных документов 504 для используемого запроса 502. В некоторых вариантах осуществления сервер 112 может использовать одну или более компьютерно-реализуемых процедур для определения множества документов, которые являются потенциально релевантными используемому запросу 502. Предполагается, что сервер 112 может использовать известные в данной области техники способы для идентификации документов, которые должны быть включены во множество потенциально релевантных документов 504.[129] The server 112 is configured to determine the set of potentially relevant documents 504 for the query 502 in use. In some embodiments, the server 112 may use one or more computer-implemented procedures to determine the set of documents that are potentially relevant to the query 502 in use. It is assumed that the server 112 may use methods known in the art to identify documents that should be included in the set of potentially relevant documents 504.

[130] Сервер 112 сконфигурирован для генерирования предсказанных ранжирующих признаков для соответствующих из множества потенциально релевантных документов 504. В одном примере сервер 112 может быть сконфигурирован для генерирования предсказанного ранжирующего признака 540 для пары, включающей запрос 502 (Qw) и документ 505 (Dw) из множества потенциально релевантных документов 504.[130] Server 112 is configured to generate predicted ranking features for the respective ones from a plurality of potentially relevant documents 504. In one example, server 112 can be configured to generate predicted ranking features 540 for a pair including query 502 (Qw) and document 505 (Dw) from a set of potentially relevant documents 504.

[131] Сервер 112 сконфигурирован для осуществления доступа к инвертированному индексу 170 для извлечения запросо-независимых данных 510. Запросо-независимые данные 510 содержат первый запросо-независимый набор 512 данных для пары Dw-Tx, второй запросо-независимый набор 514 данных для пары Dw-Ty и третий запросо-независимый набор 516 данных для пары Dw-Tz.[131] The server 112 is configured to access the inverted index 170 to retrieve query-independent data 510. The query-independent data 510 comprises a first query-independent data set 512 for the Dw-Tx pair, a second query-independent data set 514 for the Dw pair. -Ty and a third query-independent dataset 516 for the Dw-Tz pair.

[132] В некоторых вариантах осуществления можно сказать, что запросо-независимые данные могут быть «зависимыми от терма» данными. Следует отметить, что первый запросо-независимый набор 512 данных для пары Dw-Tx указывает на зависимое от терма вхождение терма 506 (Tx) в содержимое документа 505 (Dw), второй запросо-независимый набор 514 данных для пары Dw-Ty указывает на зависимое от терма вхождение терма 507 (Tx) в содержимое документа 505 (Dw), а третий запросо-независимый набор 516 данных для пары Dw-Tz указывает на зависимое от терма вхождение терма 508 (Tz) в содержимое документа 505 (Dw).[132] In some embodiments, query-independent data may be said to be "term dependent" data. It should be noted that the first query-independent data set 512 for the Dw-Tx pair indicates a term-dependent occurrence of the term 506 (Tx) in the content of the document 505 (Dw), the second query-independent data set 514 for the Dw-Ty pair indicates a dependent term 507 (Tx) occurrence in document content 505 (Dw), and the third query-independent data set 516 for the Dw-Tz pair indicates a term-dependent occurrence of term 508 (Tz) in document content 505 (Dw).

[133] В некоторых вариантах осуществления запросо-независимые данные 510 могут дополнительно содержать запросо-независимые (зависимые от терма) данные на основе содержимого, указывающие текстовый контекст соответствующего терма в содержимом, связанном с данным документом. Например, текстовый контекст может включать в себя предыдущий терм и следующий терм для соответствующего терма в содержимом, связанном с данным документом.[133] In some embodiments, query-independent data 510 may further comprise content-based query-independent (term-dependent) data indicating the textual context of the corresponding term in the content associated with the given document. For example, the text context may include the previous term and the next term for the corresponding term in the content associated with the given document.

[134] Сервер 112 сконфигурирован для ввода запросо-независимых данных 510 в генератор 155 динамических признаков для генерирования множества векторов признаков, содержащих первый вектор 532 признаков, второй вектор 534 признаков и третий вектор 536 признаков. Как объяснено выше, первый вектор 532 признаков, второй вектор 534 признаков и третий вектор 536 признаков содержат по меньшей мере один запросо-зависимый признак, указывающий на групповое вхождение соответствующего терма с одним или более другими термами из запроса 502.[134] Server 112 is configured to input query-independent data 510 into dynamic feature generator 155 to generate a plurality of feature vectors comprising first feature vector 532, second feature vector 534, and third feature vector 536. As explained above, the first feature vector 532, the second feature vector 534, and the third feature vector 536 contain at least one query-dependent feature indicating the group occurrence of the corresponding term with one or more other terms from the query 502.

[135] По меньшей мере в некоторых вариантах осуществления настоящей технологии можно сказать, что запросо-независимые данные были сохранены в инвертированном индексе 170 до получения запроса 502 от электронного устройства 104, а запросо-зависимый признак генерируется после получения запроса 502 от электронного устройства 104. Можно также сказать, что запросо-зависимый признак генерируется с использованием запросо-независимых данных в реальном времени во время процедуры ранжирования документов поисковой машины 160. Следует отметить, что данные, указывающие на запросо-зависимый признак, не могли быть сохранены в инвертированном индексе 170 для данной пары DT до получения запроса 502, поскольку это зависит от одного или более других термов из запроса 502.[135] In at least some embodiments of the present technology, it can be said that the query-independent data was stored in the inverted index 170 prior to the receipt of the query 502 from the electronic device 104, and the query-dependent feature is generated after the query 502 is received from the electronic device 104. It can also be said that the query-specific feature is generated using real-time query-independent data during the document ranking procedure of the search engine 160. It should be noted that data indicative of the query-specific feature could not be stored in the inverted index 170 for given pair of DTs before request 502 is received because it depends on one or more other terms from request 502.

[136] Сервер 112 также сконфигурирован для генерирования вложений 518, 520 и 522 термов для термов 506, 507 и 508 соответственно. Можно сказать, что сервер 112 может быть сконфигурирован для генерирования используемого набора 550 для ввода в генератор 140 ранжирующих признаков (обученная NN), и где используемый набор 550 содержит вложения 518, 520 и 522 термов и векторы 532, 534 и 536 признаков.[136] Server 112 is also configured to generate term attachments 518, 520, and 522 for terms 506, 507, and 508, respectively. It can be said that the server 112 can be configured to generate a used set 550 for input to the ranking feature generator 140 (trained NN), and where the used set 550 contains term embeddings 518, 520 and 522 and feature vectors 532, 534 and 536.

[137] Сервер 112 сконфигурирован для ввода используемого набора 550 в генератор 140 ранжирующих признаков, который, в ответ, генерирует предсказанный ранжирующий признак 540 для пары документа 505 и запроса 502. Предполагается, что сервер 106 может быть сконфигурирован для генерирования конкатенированных входных данных путем конкатенации вложения 518 терма с вектором 532 признаков, вложения 520 терма с вектором 534 признаков и вложения 522 терма с вектором 536 признаков и ввода конкатенированных входных данных в генератор 140 ранжирующих признаков.[137] The server 112 is configured to input the used set 550 into the ranking feature generator 140, which, in response, generates a predicted ranking feature 540 for the pair of document 505 and request 502. It is contemplated that the server 106 can be configured to generate concatenated inputs by concatenating term embeddings 518 with feature vector 532, term embeddings 520 with feature vector 534, and term embeddings 522 with feature vector 536, and inputting the concatenated input data into ranking feature generator 140.

[138] Как упомянуто выше, сервер 112 может быть сконфигурирован для использования предсказанного ранжирующего признака 540 для ранжирования документа 505 среди множества потенциально релевантных документов 504 в ответ на запрос 502. Со ссылкой на Фиг. 6 изображено представление 600 того, как сервер 112 сконфигурирован для использования модели 130 ранжирования (например, MLA на основе дерева решений) для генерирования ранжированного списка 680 документов для пользователя 102 в ответ на запрос 502. Как видно, сервер 112 может вводить в модель 130 ранжирования данные 602 запроса. Например, сервер 112 может обращаться к подсистеме 150 базы данных и извлекать информацию, связанную с запросом 502, такую как данные прошлого взаимодействия, связанные с запросом 502.[138] As mentioned above, server 112 may be configured to use predicted ranking feature 540 to rank document 505 among a set of potentially relevant documents 504 in response to query 502. With reference to FIG. 6 depicts a representation 600 of how server 112 is configured to use a ranking model 130 (e.g., a decision tree based MLA) to generate a ranked list 680 of documents for user 102 in response to request 502. As can be seen, server 112 can input ranking model 130 request data 602. For example, server 112 may access database engine 150 and retrieve information associated with request 502, such as past interaction data associated with request 502.

[139] Сервер 112 может также вводить для документа 505 данные 604 документа и предсказанный ранжирующий признак 540. Например, сервер 112 может быть сконфигурирован для осуществления доступа к подсистеме 150 базы данных и извлечения информации, связанной с документом 505. Например, данные 604 документа могут содержать информацию на основе содержимого и/или данные прошлого взаимодействия, связанные с документом 505. Аналогичным образом сервер 112 может также вводить для документа 506 (другого из множества потенциально релевантных документов 504) данные 606 документа и предсказанный ранжирующий признак 620. Сервер 112 может быть сконфигурирован для генерирования предсказанного ранжирующего признака 620 для пары документ 509 - запрос 502 аналогично тому, как сервер 112 сконфигурирован для генерирования предсказанного ранжирующего признака 640 для пары документ 505 - запрос 502. Модель 130 ранжирования сконфигурирована для генерирования в качестве вывода ранжированного списка документов 680, ранжированных на основе их релевантности запросу 502. Сервер 112 может быть сконфигурирован для использования ранжированного списка 680 документов для генерирования SERP и передачи данных, указывающих на это, через ответ 190 на электронное устройство 104 для отображения пользователю 102.[139] Server 112 may also input document data 604 and predicted ranking feature 540 for document 505. For example, server 112 may be configured to access database engine 150 and retrieve information associated with document 505. For example, document data 604 may contain content-based information and/or past interaction data associated with document 505. Similarly, server 112 may also input document data 606 and predicted ranking feature 620 for document 506 (another of a plurality of potentially relevant documents 504). Server 112 may be configured to generate a predicted ranking feature 620 for a document 509-query pair 502 in a manner similar to how the server 112 is configured to generate a predicted ranking feature 640 for a document 505-query pair 502. The ranking model 130 is configured to generate as output a ranked list of documents 68 0, ranked based on their relevance to query 502. Server 112 can be configured to use a ranked list 680 of documents to generate SERPs and send data indicating this via response 190 to electronic device 104 for display to user 102.

[140] На Фиг. 7 схематично показан способ 700 ранжирования цифровых документов в ответ на запрос. Например, сервер 112 может быть сконфигурирован для выполнения способа 700, этапы которого теперь будут обсуждаться более подробно.[140] In FIG. 7 schematically shows a method 700 for ranking digital documents in response to a query. For example, server 112 may be configured to perform method 700, the steps of which will now be discussed in more detail.

ЭТАП 702: для данного документа, осуществление доступа к инвертированному индексу для извлечения запросо-независимых данных для первой пары документ-терм (DT) и второй пары DTSTEP 702: for a given document, accessing the inverted index to retrieve query-independent data for the first document-term (DT) pair and the second DT pair

[141] Способ 700 начинается на этапе 702, на котором сервер 112 сконфигурирован для осуществления доступа к инвертированному индексу 170 для извлечения запросо-независимых данных для первой пары DT и второй пары DT. Например, предположим, что запрос содержит первый терм и второй терм. Первая пара DT имеет данный документ и первый терм, а вторая пара DT имеет данный документ и второй терм.[141] Method 700 begins at step 702, where server 112 is configured to access the inverted index 170 to retrieve query-independent data for the first DT pair and the second DT pair. For example, suppose a query contains a first term and a second term. The first DT pair has the given document and the first term, and the second DT pair has the given document and the second term.

[142] Как объяснено выше, запросо-независимые данные, полученные сервером 112, указывают на (i) зависимое от терма вхождение первого терма в содержимое, связанное с данным документом, и (ii) зависимое от терма вхождение второго терма в содержимое, связанное с данным документом.[142] As explained above, the query-independent data received by the server 112 indicates (i) the term-dependent occurrence of the first term in the content associated with the given document, and (ii) the term-dependent occurrence of the second term in the content associated with the document. by this document.

[143] В некоторых вариантах осуществления настоящей технологии сервер 112 может быть сконфигурирован для извлечения запросо-независимых данных для терма, который подобен одному из первого или второго термов. Например, сервер 112 может быть сконфигурирован с возможностью, для данного одного из множества термов из используемого запроса, определять подобный терм, а при осуществлении доступа к инвертированному индексу 170 для извлечения запросо-независимых данных, сервер 112 может быть сконфигурирован для извлечения запросо-независимых данных для третьей пары DT, где третья пара DT имеет данный документ и упомянутый подобный терм.[143] In some embodiments of the present technology, server 112 may be configured to retrieve query-independent data for a term that is similar to one of the first or second terms. For example, server 112 may be configured to, for a given one of a plurality of terms from a query in use, determine a similar term, and when accessing inverted index 170 to retrieve query-independent data, server 112 may be configured to retrieve query-independent data. for the third DT pair, where the third DT pair has the given document and the similar term referred to.

[144] В дополнительных вариантах осуществления предполагается, что сервер 112 может осуществлять доступ к инвертированному индексу 170, чтобы извлекать запросо-независимые данные на основе содержимого, связанные с первой парой DT и второй парой DT. Запросо-независимые данные на основе содержимого могут указывать на текстовый контекст соответствующего терма в содержимом, связанном с данным документом. Дополнительно или альтернативно, подсистема 150 базы данных может хранить прямой индекс в дополнение к инвертированному индексу 170, который может использоваться сервером 112 для получения таким образом запросо-независимых данных на основе содержимого (текстовый контекст термов в документе).[144] In additional embodiments, it is contemplated that the server 112 can access the inverted index 170 to retrieve content-based query-independent data associated with the first DT pair and the second DT pair. Content-based query-independent data can point to the textual context of the corresponding term in the content associated with the given document. Additionally or alternatively, the database engine 150 may store a forward index in addition to the inverted index 170, which may be used by the server 112 to thus obtain query-independent data based on the content (the text context of the terms in the document).

[145] Предполагается, что зависимое от терма вхождение данного терма может содержать по меньшей мере одно из: (i) одной или более позиций первого терма в заголовке, связанном с данным документом, (ii) одной или более позиций первого терма в URL, связанном с данным документом, и (iii) одной или более позиций первого терма в теле данного документа.[145] It is contemplated that a term-dependent occurrence of a given term may contain at least one of: (i) one or more first term positions in a title associated with a given document, (ii) one or more first term positions in a URL associated with a with this document, and (iii) one or more positions of the first term in the body of this document.

ЭТАП 704: для данного документа, генерирование запросо-зависимого признака с использованием запросо-независимых данных, извлеченных для первой пары DT и второй пары DTSTEP 704: for a given document, generating a query-dependent feature using the query-independent data extracted for the first DT pair and the second DT pair

[146] Способ 700 переходит к этапу 704, на котором сервер 112 сконфигурирован для генерирования запросо-зависимого признака с использованием запросо-независимых данных, извлеченных для первой пары DT и второй пары DT, и при этом запросо-зависимый признак указывает на групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом.[146] The method 700 proceeds to step 704, where the server 112 is configured to generate a query-specific feature using the query-independent data retrieved for the first DT pair and the second DT pair, where the query-specific feature indicates a group occurrence of the first term with a second term into the content associated with this document.

[147] В одном неограничивающем примере, показанном на Фиг. 5, сервер 112 может быть сконфигурирован для осуществления доступа к инвертированному индексу 170 для извлечения запросо-независимых данных 510 (например, этап 702). Запросо-независимые данные 510 содержат первый запросо-независимый набор 512 данных для пары Dw-Tx, второй запросо-независимый набор 514 данных для пары Dw-Ty и третий запросо-независимый набор 516 данных для пары Dw-Tz.[147] In one non-limiting example shown in FIG. 5, server 112 may be configured to access inverted index 170 to retrieve query-independent data 510 (eg, block 702). The query-independent data 510 comprises a first query-independent data set 512 for the Dw-Tx pair, a second query-independent data set 514 for the Dw-Ty pair, and a third query-independent data set 516 for the Dw-Tz pair.

[148] В некоторых вариантах осуществления можно сказать, что запросо-независимые данные могут быть «зависимыми от терма» данными. Следует отметить, что первый запросо-независимый набор 512 данных для пары Dw-Tx указывает на зависимое от терма вхождение терма 506 (Tx) в содержимое документа 505 (Dw), второй запросо-независимый набор 514 данных для пары Dw-Ty указывает на зависимое от терма вхождение терма 507 (Tx) в содержимое документа 505 (Dw), а третий запросо-независимый набор 516 данных для пары Dw-Tz указывает на зависимое от терма вхождение терма 508 (Tz) в содержимое документа 505 (Dw).[148] In some embodiments, query-independent data may be said to be "term dependent" data. It should be noted that the first query-independent data set 512 for the Dw-Tx pair indicates a term-dependent occurrence of the term 506 (Tx) in the content of the document 505 (Dw), the second query-independent data set 514 for the Dw-Ty pair indicates a dependent term 507 (Tx) occurrence in document content 505 (Dw), and the third query-independent data set 516 for the Dw-Tz pair indicates a term-dependent occurrence of term 508 (Tz) in document content 505 (Dw).

[149] В качестве части этапа 704, сервер 112 может быть сконфигурирован для ввода запросо-независимых данных 510 в генератор 155 динамических признаков для генерирования множества векторов признаков, содержащих первый вектор 532 признаков, второй вектор 534 признаков и третий вектор 536 признаков. Как объяснено выше, первый вектор 532 признаков, второй вектор 534 признаков и третий вектор 536 признаков содержат по меньшей мере один запросо-зависимый признак, указывающий на групповое вхождение соответствующего терма с одним или более другими термами из запроса 502.[149] As part of step 704, server 112 may be configured to input query-independent data 510 into dynamic feature generator 155 to generate a plurality of feature vectors comprising first feature vector 532, second feature vector 534, and third feature vector 536. As explained above, the first feature vector 532, the second feature vector 534, and the third feature vector 536 contain at least one query-dependent feature indicating the group occurrence of the corresponding term with one or more other terms from the query 502.

[150] По меньшей мере в некоторых вариантах осуществления настоящей технологии можно сказать, что запросо-независимые данные были сохранены в инвертированном индексе 170 до получения запроса 502 от электронного устройства 104, а запросо-зависимый признак генерируется после получения запроса 502 от электронного устройства 104. Можно также сказать, что запросо-зависимый признак генерируется с использованием запросо-независимых данных в реальном времени во время процедуры ранжирования документов поисковой машины 160. Следует отметить, что данные, указывающие на запросо-зависимый признак, не могли быть сохранены в инвертированном индексе 170 для данной пары DT до получения запроса 502, поскольку это зависит от одного или более других термов из запроса 502.[150] In at least some embodiments of the present technology, it can be said that the query-independent data was stored in the inverted index 170 prior to the receipt of the query 502 from the electronic device 104, and the query-specific feature is generated after the query 502 is received from the electronic device 104. It can also be said that the query-specific feature is generated using real-time query-independent data during the document ranking procedure of the search engine 160. It should be noted that data indicative of the query-specific feature could not be stored in the inverted index 170 for given pair of DTs before request 502 is received because it depends on one or more other terms from request 502.

[151] По меньшей мере в некоторых вариантах осуществления настоящей технологии групповое вхождение первого терма со вторым термом в содержимое, связанное с данным документом, содержит по меньшей мере одно из: (i) количества раз, когда второй терм из запроса включен в дополнение к первому терму в заголовок, связанный с данным документом, (ii) количества раз, когда второй терм из запроса включен в дополнение к первому терму в URL, связанный с данным документом, (iii) количества раз, когда второй терм из запроса включен в дополнение к первому терму в тело данного документа, и (iv) позиционного смещения между первым термом и вторым термом в теле данного документа.[151] In at least some embodiments of the present technology, the grouping of a first term with a second term in the content associated with a given document comprises at least one of: (i) the number of times the second term from the query is included in addition to the first term in the title associated with the given document, (ii) the number of times the second term from the query is included in addition to the first term in the URL associated with the given document, (iii) the number of times the second term from the query is included in addition to the first term in the body of the given document, and (iv) a positional offset between the first term and the second term in the body of the given document.

ЭТАП 706: для данного документа, генерирование ранжирующего признака для данного документа на основе, по меньшей мере, первого термаSTEP 706: for a given document, generating a ranking feature for a given document based on at least the first term

[152] Способ 700 переходит к этапу 706, на котором сервер 112 сконфигурирован для генерирования ранжирующего признака для данного документа на основе, по меньшей мере, первого терма, второго терма и запросо-зависимого признака. Например, сервер 112 может быть сконфигурирован для использования генератора 140 ранжирующих признаков для генерирования ранжирующего признака для данного документа.[152] The method 700 proceeds to step 706, where the server 112 is configured to generate a ranking feature for a given document based on at least the first term, the second term, and the query-dependent feature. For example, server 112 may be configured to use ranking feature generator 140 to generate a ranking feature for a given document.

[153] В некоторых вариантах осуществления настоящей технологии генератор 140 ранжирующих признаков может быть реализован как заданная NN.[153] In some embodiments of the present technology, the ranking feature generator 140 may be implemented as a given NN.

ЭТАП 708: ранжирование данного документа из множества потенциально релевантных документов на основе, по меньшей мере, ранжирующего признакаSTEP 708: ranking a given document from a set of potentially relevant documents based on at least a ranking feature

[154] Способ 700 переходит к этапу 708, на котором сервер 112 сконфигурирован для ранжирования данного документа из множества потенциально релевантных документов на основе, по меньшей мере, ранжирующего признака, определенного на этапе 706. Например, сервер 112 может быть сконфигурирован для выполнения модели 130 ранжирования.[154] The method 700 proceeds to block 708, where the server 112 is configured to rank a given document from a set of potentially relevant documents based on at least the ranking feature determined in block 706. For example, the server 112 may be configured to execute the model 130 ranking.

[155] По меньшей мере в некоторых вариантах осуществления настоящей технологии модель 130 ранжирования может быть реализована как MLA на основе данного дерева решений. В широком смысле, MLA на основе данного дерева решений - это модель машинного обучения, имеющая одно или более «деревьев решений», которые используются (i) для перехода от наблюдений за объектом (представленных в ветвях) к заключениям о целевом значении объекта (представлены в листьях). В одной неограничивающей реализации настоящей технологии MLA на основе дерева решений может быть реализовано в соответствии со структурой CatBoost.[155] In at least some embodiments of the present technology, the ranking model 130 may be implemented as an MLA based on a given decision tree. Broadly speaking, a given decision tree based MLA is a machine learning model having one or more "decision trees" that are used to (i) move from observations of an object (represented in branches) to inferences about the target value of an object (represented in leaves). In one non-limiting implementation of the present technology, a decision tree-based MLA may be implemented in accordance with the CatBoost framework.

[156] Предполагается, что сервер 112 может быть сконфигурирован для ранжирования данного документа с использованием MLA на основе дерева решений, сконфигурированного для ранжирования множества потенциально релевантных документов на основе их релевантности запросу.[156] It is contemplated that server 112 can be configured to rank a given document using an MLA based on a decision tree configured to rank a set of potentially relevant documents based on their relevance to a query.

[157] Следует четко понимать, что не все технические эффекты, упомянутые в данном документе, обязательно будут достигаться в каждом и каждом варианте осуществления настоящей технологии. Например, варианты осуществления настоящей технологии могут быть реализованы без достижения некоторых из этих технических эффектов, в то время как другие варианты осуществления могут быть реализованы с достижением других технических эффектов или вообще без них.[157] It should be clearly understood that not all of the technical effects mentioned herein will necessarily be achieved in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without achieving some of these technical effects, while other embodiments may be implemented with or without other technical effects.

[158] Некоторые из вышеупомянутых этапов, а также отправка/прием сигналов хорошо известны в данной области техники и, как таковые, были опущены в некоторых частях этого описания для его упрощения. Сигналы могут отправляться/приниматься с использованием оптических средств (например, оптоволоконного соединения), электронных средств (например, используя проводное или беспроводное соединение), а также механических средств (например, средств, основанных на давлении, на температуре, или на основе любого другого подходящего физического параметра).[158] Some of the above steps, as well as sending/receiving signals, are well known in the art and, as such, have been omitted from parts of this description for simplicity. Signals can be sent/received using optical means (for example, a fiber optic connection), electronic means (for example, using a wired or wireless connection), as well as mechanical means (for example, pressure-based, temperature-based, or any other suitable physical parameter).

Модификации и улучшения вышеописанных реализаций настоящей технологии могут стать понятными для специалистов в данной области техники. Предшествующее описание предназначено для того, чтобы быть примерным, а не ограничивающим. Поэтому подразумевается, что объем настоящей технологии ограничен только объемом прилагаемой формулы изобретения.Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The preceding description is intended to be exemplary and not limiting. Therefore, the scope of the present technology is intended to be limited only by the scope of the appended claims.

Claims

1. A method for ranking digital documents in response to a request, wherein the digital documents are potentially relevant to a request having a first term and a second term, the request being sent by a user of an electronic device communicatively connected to a server hosting a search engine, the search engine the machine is associated with an inverted index storing information associated with document-term (DT) pairs, the method being executed by a server, the method comprising:

for a given document from a set of potentially relevant documents:

the server accesses the inverted index to retrieve query-independent data for the first DT pair and the second DT pair, where the first DT pair has the given document and the first term, the second DT pair has the given document and the second term,

wherein the query-independent data indicates (i) a term-dependent occurrence of the first term in the content associated with the given document, and (ii) a term-dependent occurrence of the second term in the content associated with the given document;

a query-dependent feature is generated by the server using the query-independent data retrieved for the first DT pair and the second DT pair,

at the same time, the query-dependent attribute indicates the group occurrence of the first term with the second term in the content associated with this document;

generating by the server a ranking feature for the given document based on at least the first term, the second term, and the query-dependent feature; and

the server ranks this document from a set of potentially relevant documents based on at least a ranking feature.

2. The method of claim 1, wherein the ranking feature for a given document is generated by a neural network (NN).

3. The method of claim 2, wherein the method further comprises training the NN server to generate a ranking feature, wherein training the NN comprises:

a training set is generated by the server for a training document-request (DQ) pair to be used during a given training iteration NN, wherein the training pair DQ has a training query and a training document, the training document is associated with a label, the label indicating the relevance of the training document to the training request, and the generation contains the steps in which:

generating by the server a plurality of training term embeddings based on the corresponding terms from the training query;

access by the server to the inverted index associated with the search engine to extract a set of query-independent data sets associated with the corresponding pairs from the set of training pairs DT,

wherein the given one of the plurality of training pairs DT includes the training document and the corresponding one of the plurality of terms from the training query;

generating by the server a plurality of training feature vectors for the plurality of training pairs DT using the plurality of query-independent data sets;

during a given iteration of NN training:

introducing said set of training term embeddings and said set of learning feature vectors into the NN by the server to generate a predicted ranking feature for the training pair DQ; and

tuned by the server NN based on the comparison between the label and the predicted ranking feature, so that the NN generates for a given used DQ pair an appropriate predicted ranking feature that indicates the relevance of the corresponding used document to the corresponding used query.

4. The method of claim 1, wherein the query-independent data was stored in the inverted index prior to receiving a query from the electronic device, and wherein the query-dependent feature is generated after receiving a query from the electronic device.

5. The method of claim 1, wherein the query-dependent feature is generated using real-time query-independent data during a search engine document ranking procedure.

6. The method of claim 1, wherein the ranking is performed by a decision tree-based machine learning algorithm (MLA) configured to rank a set of potentially relevant documents based on their relevance to a query.

7. The method of claim 1, wherein the method further comprises, for a given one of the plurality of terms, defining a similar term, and wherein:

when accessing the inverted index to retrieve query-independent data, the retrieved query-independent data contains query-independent data for a third pair of DTs, the third pair of DTs having the given document and said similar term.

8. The method of claim 1, wherein the inverted index access is for further retrieval of query-independent data based on the content associated with the first DT pair and the second DT pair,

query-independent content-based data points to the textual context of the corresponding term in the content associated with the given document.

9. The method of claim 1, wherein the term-dependent occurrence of the first term comprises at least one of:

one or more positions of the first term in the heading associated with this document;

one or more positions of the first term in the URL associated with this document; and

one or more positions of the first term in the body of this document.

10. The method of claim. 1, in which the group occurrence of the first term with the second term in the content associated with this document contains at least one of:

the number of times the second term from the query is included in addition to the first term in the heading associated with the document;

the number of times the second term from the query is included in addition to the first term in the URL associated with the document; and

the number of times the second term from the query is included in addition to the first term in the body of the given document.

11. A server for ranking digital documents in response to a request, wherein the digital documents are potentially relevant to a request having a first term and a second term, the request being sent by a user of an electronic device communicatively connected to a server hosting the search engine, wherein the search engine is associated with an inverted index storing information associated with document-term (DT) pairs, and the server is configured to:

for a given document from a set of potentially relevant documents:

access the inverted index to retrieve query-independent data for the first DT pair and the second DT pair, where the first DT pair has the given document and the first term, the second DT pair has the given document and the second term,

generate a query-dependent feature using the query-independent data retrieved for the first DT pair and the second DT pair,

generate a ranking feature for the given document based on at least the first term, the second term, and the query-dependent feature; and

to rank the given document from a set of potentially relevant documents based on at least a ranking feature.

12. The server of claim 11, wherein the server uses a neural network to generate a ranking feature for a given document.

13. The server of claim 12, wherein the server is further configured to teach the NN to generate a ranking feature, the server is configured to:

generate a training set for a training document-query pair (DQ) to be used during a given training iteration NN, where the training pair DQ has a training query and a training document, where the training document is associated with a label, where the label indicates the relevance of the training document to the training query , and to generate the training set, the server is configured to:

generate a plurality of training term embeddings based on the corresponding terms from the training query;

access the inverted index associated with the search engine to retrieve a set of query-independent datasets associated with the corresponding pairs from the set of training pairs DT,

generate a plurality of training feature vectors for the plurality of training pairs DT using the plurality of query-independent datasets;

during a given iteration of NN training:

input by the server into the NN said set of training term embeddings and said set of learning feature vectors to generate a predicted ranking feature for the training pair DQ; and

adjust the NN based on the comparison between the label and the predicted ranking feature, so that the NN generates, for a given used DQ pair, an appropriate predicted ranking feature that indicates the relevance of the corresponding used document to the corresponding used query.

14. The server of claim 11, wherein the query-independent data was stored in the inverted index prior to receiving a query from the electronic device, and wherein the query-dependent feature is generated after receiving a query from the electronic device.

15. The server of claim 11, wherein the query-dependent feature is generated using real-time query-independent data during a search engine document ranking procedure.

16. The server of claim 11, wherein the server is configured to rank using a machine learning algorithm (MLA) based on a decision tree configured to rank a plurality of potentially relevant documents based on their relevance to a query.

17. The server of claim 11, wherein the server is further configured to determine a similar term for a given one of the plurality of terms, and wherein:

18. The server of claim 11, wherein the server is configured to access the inverted index to further retrieve query-independent data based on the content associated with the first DT pair and the second DT pair,

wherein the query-independent content-based data points to the textual context of the corresponding term in the content associated with the given document.

19. The server of claim 11, wherein the term-dependent occurrence of the first term includes at least one of:

one or more positions of the first term in the body of this document.

20. The server of claim 11, wherein the group occurrence of the first term with the second term in the content associated with the document contains at least one of: