CN113535958B

CN113535958B - Production line aggregation method, device and system, electronic equipment and medium

Info

Publication number: CN113535958B
Application number: CN202110859746.8A
Authority: CN
Inventors: 万志文; 雷谦; 姚后清; 施鹏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2023-08-08
Anticipated expiration: 2041-07-28
Also published as: CN113535958A

Abstract

The disclosure provides a production line aggregation method, a device, a system, electronic equipment, a computer readable storage medium and a computer program product, and relates to the field of computers, in particular to the technical field of intelligent searching. The implementation scheme is as follows: obtaining production clues from the search logs; classifying the production threads based on the category information to obtain one or more first production thread groups; for each of the one or more first production cue groups, performing the following aggregation operations: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which a similarity of each production thread in the first production thread group is greater than a first threshold based on the vectors; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

Description

Production line aggregation method, device and system, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computers, and more particularly, to the field of intelligent searching, and in particular, to a production line aggregation method, apparatus, system, electronic device, computer-readable storage medium, and computer program product.

Background

Existing knowledge search engines are used to provide users with a simple and dependable way of information acquisition. Meanwhile, the searching requirement of the knowledge content is continuously updated and iterated, and the requirement of the user on the new knowledge content can be better met by mining production clues related to the requirement of the user on searching knowledge to conduct directional production. However, the production cues mined based on the search logs are of a larger magnitude, and the deep semantics expressed by some production cues are the same. Therefore, how to quickly semantically aggregate production threads to reduce repetitive production becomes a problem to be solved.

Disclosure of Invention

The present disclosure provides a production line aggregation method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a production line aggregation method, including: obtaining production clues from the search logs; classifying the production cues based on category information to obtain one or more first production cue groups; for each of the one or more first production cue groups: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

According to another aspect of the present disclosure, there is provided a production line aggregation apparatus including: an acquisition unit configured to obtain a production cue from the search log; a classification unit configured to classify the production cues based on category information to obtain one or more first production cue groups; one or more aggregation units, wherein each of the aggregation units is configured to perform the following slave operations: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

According to another aspect of the present disclosure, there is provided a production line aggregation system, comprising: a first apparatus configured to: obtaining production clues from the search logs; and classifying the production cues based on category information to obtain one or more first production cue groups; one or more second apparatuses, wherein each of the second apparatuses is configured to perform the following slave operations: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method described in the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method described in the present disclosure.

According to one or more embodiments of the present disclosure, the production threads are classified based on category information, so that the production threads within the classified production thread groups are similar, and the probability of missing calls can be greatly reduced by vector recall based on vector similarity, so that the aggregation time consumption of tens of millions of production threads can be controlled within an acceptable range, and the aggregation efficiency is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a production cue aggregation method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart for classifying production cues based on category information according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart for aggregating production threads according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a production cue aggregation apparatus according to an embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a production cue aggregation system according to an embodiment of the present disclosure; and

fig. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the production cue aggregation method.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 to enter query content and obtain search knowledge. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, appli OS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., google Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store data such as search logs, production leads, and the like. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

Knowledge-based products such as hundred degrees encyclopedia, hundred degrees awareness, hundred degrees experience, hundred degrees library, baby awareness, etc. provide output of various types of knowledge content for user searches. The user can enter a search term or search sentence, for example: what is the baby eczema? ", to obtain a corresponding reply page, thereby obtaining knowledge content corresponding to the retrieval information. Meanwhile, the requirements of the knowledge content are continuously updated and iterated, and the requirements of the user on the new knowledge content can be better met by mining search words and/or search sentences (queries) which can reflect the requirements of the user on searching the knowledge as production clues and carrying out directional production based on the production clues. However, the production clues obtained by mining the search logs are of a large magnitude, and the deep semantics expressed by some search terms and/or search sentences are the same, such as "how do baby eczema? "and" how does infantile eczema treat? "expressed meanings are identical. How to quickly perform semantic aggregation on production clues so as to reduce repeated production becomes a problem to be solved.

Thus, embodiments according to the present disclosure provide a production line aggregation method. Fig. 2 illustrates a flow chart of a production cue aggregation method 200 according to an embodiment of the present disclosure. As shown in fig. 2, the method 200 includes: (step 210); classifying the production threads based on the category information to obtain one or more first groups of production threads (step 220); for each of the one or more first production cue groups, performing the following aggregation operations: obtaining a vector for each production thread in the first set of production threads (step 230); determining a first number of production threads having a similarity to each production thread of the first group of production threads greater than a first threshold based on the vectors (step 240); and aggregating each production cue and all of the determined first number of production cues to obtain one or more clusters (step 250).

According to the embodiment of the disclosure, the production threads are classified based on category information, so that the production threads in the classified production thread groups are similar, and the probability of missed calls can be greatly reduced through vector recall based on vector similarity, so that the aggregation time consumption of tens of millions of production threads can be controlled within an acceptable range, and the aggregation efficiency is improved.

In some embodiments, the search logs may be search histories of users obtained by search engines, typically, to improve the accuracy of the search, each search engine typically maintains a search history of the user including, among other things, the user's query content, query time, query IP, operating system and browser information. And the search history record can also record the search results of all the query contents clicked and checked by the user, wherein the search results comprise display page position information, product line identification and the like. Further, the query content (search query) may be one or more sentences, one or more questions, one or more nouns, one or more symbols, etc., and the search results corresponding to the query content may be paraphrased content corresponding to the query content, such as an explanation of a noun or symbol; the search result corresponding to the query content may be content after further producing the search content, for example, the query content is "tomato-fried chicken eggs", and the search result is "how do tomato-fried chicken eggs? ".

In some embodiments, the production cue is query content of the user in the search log that meets the production needs of the production line business, for example, some "how do it? "class query questions" are query contents that can be produced by knowing business lines or experience business lines, and can be used as production clues for knowing business lines or experience business lines.

In some embodiments, the production hints may be query content and search results corresponding to the query content. For example, the query content is "tomato stir-fried egg method", and the search result is "how do tomato stir-fried egg? By way of "(both the query content and the search results may be used as production clues.

Alternatively, the production thread may also be at least one thread selected from a plurality of initial production threads that meet the production needs of the production line business. Illustratively, the initial production clues are query contents and/or search results of users who initially meet the service requirements in the search log, and the production clues can be obtained by processing (such as screening, etc.) the initial clues.

In some examples, all query content (i.e., search queries) in the search log that meets the line business requirements may be mined to obtain a production hint.

Illustratively, a product line is a business line that produces (or processes) a production cue, e.g., a knowledge business line, an encyclopedia business line. The product line provides the processed content (e.g., landing pages) to a search engine to enable the search engine to search for the processed content.

According to some embodiments, the category information of the production cue may include a first category and a second category. Illustratively, as shown in FIG. 3, classifying the production cue based on category information (step 220) may include: classifying the production threads based on the first category to obtain one or more second groups of production threads (step 310); determining whether the number of production threads in each of the second production thread groups is less than a second threshold (step 320); and in response to determining that the number of production threads in the at least one second production thread group is greater than a second threshold (step 320, "no"), classifying the at least one second production thread group in which the number of production threads is greater than the second threshold based further on the second category to obtain one or more first production thread groups (step 330); otherwise (step 320, "yes"), the one or more second production thread groups are taken as the obtained one or more first production thread groups (step 340).

In some examples, the second category may be fine-grained category information relative to the first category. After classifying the production threads in the obtained second production thread group based on the second category, the production thread group obtained by the further classification is used as the first production thread group together with the second production thread group whose number of production threads is not greater than the second threshold.

In some examples, the production lead obtained based on the search logs may include first category information, which may be determined by a preset classification model, without limitation. The first category information may be category information, such as K12, disease knowledge, etc. For example, the production cues may be classified first by a first category, but since the distribution of production cues is non-uniform, the number of production cues for a portion of the first category, such as K12, disease knowledge, etc., is much greater than the number of production cues for other categories. Accordingly, a corresponding threshold may be set, and after sorting based on the first category, if there is a production thread group with a number of production threads greater than the threshold, sorting is continued for the production thread group with the number of production threads greater than the threshold based on the second category.

In some examples, the second category of the production cue is category data obtained after deep semantic refinement of the production cue. Production threads with the same second category have greater similarity in deep semantics. Therefore, the cluster is respectively aggregated after being classified based on the second category, so that the production threads in each cluster after aggregation are similar, and when each production thread is recalled (the similar production threads are determined), the probability of missing the recall is reduced, the subsequent directional production based on the production thread cluster obtained after aggregation is facilitated, and the possibility of repeated production is reduced.

According to some embodiments, classifying the production threads in the at least one second group of production threads, further based on the second category, respectively, may comprise: for each of the at least one second production thread group: counting the number of production threads corresponding to each second category; and classifying the second production thread groups based on the corresponding production thread number of each second category and a second threshold value so as to obtain one or more first production thread groups. Production threads having the same second category belong to the same first production thread group.

In some examples, for each second thread group having a number of threads greater than a preset threshold, the number of threads corresponding to each second category in the thread group may be counted. For example, the number of production threads corresponding to each second category may be ranked from large to small, and then classified based on the number of production threads corresponding to each second category and a preset second threshold. For example: the number of production cues corresponding to the second category 1 is 5 ten thousand, the number of production cues corresponding to the second category 2 is 4.5 ten thousand, the number of production cues corresponding to the second category 3 is 4 ten thousand, the number of production cues corresponding to the second category 4 is 3 ten thousand, the number of production cues corresponding to the second category 5 is 2.9 ten thousand …, and the preset second threshold is 10 ten thousand production cues, so that the production cues corresponding to the second category 1 and the second category 2 can be combined into one group (9.5 ten thousand), and the production cues corresponding to the second category 3, the second category 4 and the second category 5 are combined into one group (9.9 ten thousand) …. Of course, it should be understood that the classification may be directly performed based on the number of production threads corresponding to each second category and the preset second threshold, so long as the number of production threads in each group obtained by classification is not greater than the preset second threshold.

In some examples, the second category of production cues may be obtained by a semantic classification model (e.g., an open-source LAC model) based on Natural Language Processing (NLP). The second category may be, for example, category information obtained by classifying one or more of the first categories based on the first category. Additionally or alternatively, the second category may also be custom fine-grained category information, not necessarily subject to the first category.

In some embodiments, the production cues are divided in a fine granularity, and the second category may include an entity category, for example, a name of a specific concept such as a hometown, great wall, etc., and an abstract category, for example, a name of an abstract concept such as a mother, an infant, a religion, etc.

According to some embodiments, the entity category and the abstract category may each be determined by a deep learning based model. For example, the category of entities may be determined by identifying entity words in the query statement as production clues through a lexical analysis model (e.g., LAC). The abstract category may be determined by a deep learning based abstract category model. For example, the abstract category model may be a multi-label classification model trained based on the Ernie 2.0 chinese pretraining model.

Thus, in accordance with some embodiments, when the second category includes an entity category and an abstract category, the method 200 may further include: in response to determining that the production thread contains both entity and abstract categories, a second category of the production thread is deduplicated.

In some examples, for each second thread group having a number of threads greater than a preset threshold, the number of threads corresponding to each second category in the thread group may be counted. If there are production threads that contain both entity and abstract categories, it is sufficient to participate in the statistics and classification of the number of production threads based on only one of the second categories, e.g., classification based on only its entity category or classification based on only its abstract category.

After the categorized production thread groups are obtained, each production thread group can be aggregated respectively, so that the directional knowledge production can be performed through the aggregated production thread groups.

According to some embodiments, the method 200 may further comprise: the first production cue groups are each sent to a plurality of units or machines such that the respective first production cue groups are aggregated on the units or machines.

For example, the categorized group of production leads may be issued to respective machines (slaves) to aggregate the production leads in the group of production leads at the respective machines. Therefore, when the number of production threads is very large, the time consumption during polymerization can be greatly reduced, and the polymerization efficiency is greatly improved.

In some examples, a number of production threads similar to each production thread (i.e., search query) need to be recalled first before each production thread group is aggregated, so that based on the recall result, the similarity of the production thread pairs is calculated, and similar production threads are aggregated into clusters.

In general, keywords of production leads may be obtained such that keywords between production leads are matched to recall, for each production lead, a predetermined number of production leads for which the same keywords exist. That is, keyword matching is a measure of text similarity, which requires the same keywords between recall results and original production cues. Thus, production threads having the same semantics but different expressions cannot be recalled. In the embodiment according to the present disclosure, the recall mode based on the vector similarity is a measure of the semantic similarity, and even if there is no identical keyword between the production thread to be recalled and the original production thread, the recall can still be performed as long as the expressed semantics are identical.

In some examples, the precondition for vector recall is the need to transform the production thread into a multidimensional vector, i.e., to obtain a vector for each production thread in the first set of production threads. For example, the vector representation of the production cue may be obtained by an Ernie-Sim-sender model trained based on an Ernie 2.0 pre-training model, with a vector latitude of, for example, 64 dimensions. After the vector of the production thread is obtained, the vector recall is performed after the data containing the vector is written to a search model (e.g., an elastic search model).

Because vector recall is a similarity measure for the semantic level of the production clues, compared with keyword recall, the probability of missing recall is greatly reduced, so that a good clustering effect can be achieved by a small number of recall results, the recall quantity is reduced, the subsequent aggregation time is further saved, and the aggregation efficiency is improved.

According to some embodiments, determining the first number of production threads based on the vector may include: a first number of production threads for each production thread having a similarity greater than a first threshold is determined by an approximate nearest neighbor search algorithm.

In the field of machine learning, one problem often involved in the directions of semantic retrieval, image recognition, recommendation systems, etc. is: given a vector x= [ X1, X2, x3...xn ], the top K most similar vectors need to be searched from a huge vector library. These vectors are typically very high in dimension, and are time consuming with conventional search methods, easily making the time delay a bottleneck, so that the most similar search can be converted to an approximate nearest neighbor (Approximate Nearest Neighbor, ann) search. By approximating the nearest neighbor search algorithm, the first K vectors returned are not necessarily the K most similar ones, but they can quickly search for valid content in the massive data, saving search time.

The approximate nearest neighbor search algorithm may be implemented based on any suitable algorithm, including but not limited to annoy, faiss, nmslib, falconn, etc.

According to some embodiments, aggregating each production cue and all of the determined first number of production cues may comprise: the following operations are performed for each production cue to obtain a second number of clusters (step 410): clustering the production thread itself (step 4101); determining a similarity score for each of a first number of production threads corresponding to the production thread with each of the clusters in turn (step 4102); and responsive to the similarity scores each being greater than a third threshold, merging respective ones of the first number of production cues into the cluster (step 4103); and merging the second number of clusters to obtain a merged cluster or clusters (step 420). Note that the similarity score between production threads in each cluster is greater than the third threshold.

It is understood that the second number is the number of production threads in the corresponding first production thread group obtained after classification.

Illustratively, each production thread is first aggregated with a first number of production threads based on its recall. For example, given that for each production thread 10 production threads similar to it are recalled in their respective production thread group, the respective production threads of the 11 production threads are clustered such that the similarity scores between the production threads in the generated clusters are each greater than a third threshold. After the clusters corresponding to each production cue are obtained, then cluster-to-cluster merging may be performed. Also, in the cluster merging process, the similarity score between the production threads in each cluster needs to be satisfied to be greater than the third threshold.

According to some embodiments, a method according to the present disclosure may further comprise: the similarity scores between production threads are saved during the aggregation process such that merging between clusters is performed based on the saved similarity scores. Therefore, the similarity scores among the production cue clusters are prevented from being repeatedly calculated, the aggregation time is saved, and the operation efficiency is improved.

According to some embodiments, the similarity score may be determined by a trained semantic matching model and a semantic sentence pattern classification model. Thus, determining the similarity score may comprise: obtaining a first similarity score between pairs of production cues based on the semantic matching model; and in response to the first similarity score being greater than a fourth threshold, obtaining a second similarity score between the pair of production cues based further on a semantic sentence pattern classification model to treat the second similarity score as the determined similarity score.

The semantic matching model and the semantic sentence pattern classification model can accurately measure the similarity between the production clue pairs. Thus, the deep semantics among the aggregated thread clusters are made identical to reduce the repetitive production.

According to some embodiments, the training data of the semantic matching model may include supervision data formed after data filtering of the log containing the user click data.

In some examples, to increase the accuracy of the search, each search engine typically maintains a search history of the user, which may include search results for all query content that the user clicks on to view. For example, the content of the user query (i.e., search query) is "tomato-fried chicken eggs", and the user clicks on "how do tomato-fried chicken eggs? "page content. The user click data contains, to some extent, content of interest to the user, i.e., what the user wants to acquire. It can thus be seen that the user click data is associated or similar to some extent to the query content (i.e., search query). Therefore, the similarity between the production clue pairs can be more accurately obtained by training the semantic matching model through the supervision data formed after the data screening of the log containing the user click data.

According to some embodiments, the input of the semantic sentence pattern classification model may include a first similarity score output by the semantic matching model, and a feature obtained after feature extraction of each production thread in the production thread pair.

According to some embodiments, the features include at least one of: number of terms, whether or not reference is included, whether or not it is an active/passive sentence, etc.

According to some embodiments, the semantic sentence-based classification model may be based on a gradient-lifted decision tree (Gradient Boosted Decision Tree, GBDT) model.

For example, after recall based on vectors, similarity clustering may be performed based on recall results. Since vector-based similarity matching also does not accurately measure the similarity between pairs of production threads to some extent. To more accurately measure the similarity between pairs of production cues, a similarity model may be pre-trained. The similarity model is formed by fusing a semantic matching model and a semantic sentence pattern classification model. The training data of the semantic matching model comprises weak supervision data formed after screening the click log of the search engine, wherein the screening process can comprise screening of question search words/search sentences, data standardization and the like. The semantic matching model may use, for example, a SimNETC network, a presentation layer bow, a loss function using hingless, and the like. It should be appreciated that the structure of the semantic matching model is not so limited and that any other suitable network (e.g., CNN) and algorithm are possible. The semantic sentence pattern classification model is used as a supplement of a semantic matching model, and the matching precision is improved by identifying whether the sentence patterns of the production clue pairs are the same. For example, the semantic sentence pattern classification model may be obtained by extracting a number of terms of the production clues, whether the terms (Term, e.g., him, me, etc.), whether the terms are active/passive sentences, similarity scores output by the semantic matching model, and the like, and training based on the GBDT model. It should be understood that the semantic sentence classification model is not so limited and that any other suitable model and network are possible. After obtaining the similarity scores of the corresponding production cue pairs, similarity clustering can be performed based on the similarity scores.

The embodiment of the disclosure also provides a production line aggregation device. Fig. 5 shows a block diagram of a production line aggregation apparatus 500 according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 may include: an acquisition unit 510 configured to obtain a production cue from the search log; a classification unit 520 configured to classify the production cues based on category information to obtain one or more first production cue groups; one or more aggregation units 530, wherein each of the aggregation units 530 is configured to perform the following slave operations: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

Here, the operations of the above units 510 to 530 of the production line aggregation apparatus 500 are similar to those of the steps 210 to 250 described above, respectively, and are not repeated here.

There is also provided, in accordance with an embodiment of the present disclosure, a production cue aggregation system. Fig. 6 shows a block diagram of a production line aggregation apparatus 600 according to an embodiment of the present disclosure. As shown in fig. 6, the system 600 may include: the first apparatus 610 is configured to: obtaining production clues from the search logs; and classifying the production cues based on category information to obtain one or more first production cue groups; one or more second apparatuses 620, wherein each of the second apparatuses is configured to perform the following slave operations: obtaining a vector of each production cue in the first production cue group; determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

Here, the operations of the above units 610 to 620 of the production line aggregation system 600 are similar to the operations of the steps 210 to 250 described above, respectively, and are not repeated here.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 7, a block diagram of an electronic device 700 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 708 may include, but is not limited to, magnetic disks, optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices through computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. One or more of the steps of the method 200 described above may be performed when a computer program is loaded into RAM 703 and executed by the computing unit 701. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A production line aggregation method, comprising:

obtaining production clues from the search logs;

Classifying the production cues based on category information to obtain one or more first production cue groups, wherein the category information comprises a first category and a second category, the first category information is category information determined through a preset classification model, the second category is category data obtained after deep semantic refinement of the production cues, and the classification of the production cues based on the category information comprises:

classifying the production cues based on the first category to obtain one or more second production cue groups; and

in response to determining that the number of production threads in the at least one second production thread group is greater than a second threshold, classifying production threads in the at least one second production thread group based on the second category, respectively, to obtain one or more first production thread groups;

for each of the one or more first production cue groups, performing the following aggregation operations:

obtaining a vector of each production cue in the first production cue group;

determining a first number of production threads for which the similarity of each production thread in the first group of production threads is greater than a first threshold based on the vector; and

And aggregating each production cue and all the determined first number of production cues to obtain one or more clusters.

2. The method of claim 1, wherein classifying the production threads in the at least one second production thread group based further on the second category comprises:

for each of the at least one second set of production cues:

counting the number of production threads corresponding to each second category; and

classifying the second production thread groups based on the respective corresponding number of production threads of the respective second category and the second threshold to obtain one or more first production thread groups, wherein,

production threads having the same second category belong to the same first production thread group.

3. The method of claim 2, wherein the second category includes an entity category and an abstract category, the method further comprising:

in response to determining that a production thread contains both the entity category and the abstract category, a second category of the production thread is deduplicated.

4. The method of any of claim 1-3, wherein the second category comprises an entity category and an abstract category, and wherein,

The entity category and the abstract category are each determined by a deep learning based model.

5. The method of claim 1, wherein determining the first number of production cues based on the vector comprises:

and determining a first number of production threads, the similarity of which is greater than a first threshold, of each production thread through an approximate nearest neighbor search algorithm.

6. The method of claim 1, wherein aggregating each of the production threads and all of the determined first number of production threads comprises:

the following operations are performed for each production cue to obtain a second number of clusters:

forming the production thread itself into clusters;

determining a similarity score of each production cue of the first number of production cues corresponding to the production cue with each production cue of the cluster in turn; and

merging respective ones of the first number of production cues into the cluster in response to the similarity scores each being greater than a third threshold; and

combining the second number of clusters to obtain one or more clusters after combining, wherein,

the similarity score between production threads in each cluster is greater than the third threshold.

7. The method of claim 6, further comprising: the similarity scores between production threads are saved during the aggregation process such that merging between clusters is performed based on the saved similarity scores.

8. The method of claim 6 or 7, wherein the similarity score is determined by a trained semantic matching model and a semantic sentence pattern classification model, wherein,

determining the similarity score includes:

obtaining a first similarity score between pairs of production cues based on the semantic matching model; and

and in response to the first similarity score being greater than a fourth threshold, obtaining a second similarity score between the pair of production cues based further on the semantic sentence pattern classification model to treat the second similarity score as the determined similarity score.

9. The method of claim 8, wherein,

the training data of the semantic matching model comprises supervision data formed after data screening of a log containing user click data, and wherein,

the input of the semantic sentence pattern classification model comprises a first similarity score output by the semantic matching model and a feature obtained after feature extraction of each production cue in the production cue pair.

10. The method of claim 9, wherein the features comprise at least one of the group consisting of: number of terms, whether or not the term is included, whether or not the term is active/passive.

11. The method of claim 9, wherein the semantic sentence-based classification model is based on a gradient-lifting decision tree model.

12. The method of any one of claims 1-11, further comprising: the first production cue is sent to a plurality of units or machines, respectively, such that the aggregation operation is performed on the units or machines.

13. A production line aggregation apparatus comprising:

an acquisition unit configured to obtain a production cue from the search log;

a classification unit configured to classify the production thread based on category information to obtain one or more first production thread groups, wherein the category information includes a first category and a second category, the first category information is category information determined by a preset classification model, the second category is category data obtained after deep semantic refinement of the production thread, and wherein the classification unit includes:

a unit that classifies the production cues based on the first category to obtain one or more second production cue groups; and

A unit that, in response to determining that the number of production threads in the at least one second production thread group is greater than a second threshold, classifies production threads in the at least one second production thread group based further on the second category to obtain one or more first production thread groups;

one or more aggregation units, wherein each of the aggregation units is configured to perform the following slave operations:

obtaining a vector of each production cue in the first production cue group;

14. The apparatus of claim 13, wherein the aggregation unit comprises:

the following operations are performed for each production cue to obtain units of a second number of clusters:

forming the production thread itself into clusters;

determining a similarity score for each of the corresponding first number of production threads and each of the clusters in turn; and

merging the second number of clusters to obtain a unit of one or more clusters after merging, wherein,

the similarity score between the production threads in each cluster is greater than a third threshold.

15. A production line aggregation system, comprising:

a first apparatus configured to:

obtaining production clues from the search logs; and

One or more second apparatuses, wherein each of the second apparatuses is configured to perform the following slave operations:

obtaining a vector of each production cue in the first production cue group;

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.