CN116932705A - Processing method, device, equipment, storage medium and program product for searching entry - Google Patents

Processing method, device, equipment, storage medium and program product for searching entry Download PDF

Info

Publication number
CN116932705A
CN116932705A CN202210377991.XA CN202210377991A CN116932705A CN 116932705 A CN116932705 A CN 116932705A CN 202210377991 A CN202210377991 A CN 202210377991A CN 116932705 A CN116932705 A CN 116932705A
Authority
CN
China
Prior art keywords
candidate search
search terms
repetition
candidate
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210377991.XA
Other languages
Chinese (zh)
Inventor
陈小帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210377991.XA priority Critical patent/CN116932705A/en
Publication of CN116932705A publication Critical patent/CN116932705A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a processing method, a device, equipment, a storage medium and a program product for searching entries, which can be applied to video searching and can improve the duplicate removal coverage of the searching entries. In the application, a plurality of candidate search terms corresponding to an input text are determined; determining the repetition degree between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text; counting the repetition degree between the search results of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the search results; counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the object click behaviors; the candidate search terms are de-duplicated based on the repetition of text semantics between any two candidate search terms, the repetition of search results, and the repetition of object click behaviors.

Description

Processing method, device, equipment, storage medium and program product for searching entry
Technical Field
The present application relates to the field of computer technology, and in particular, to a processing method, an apparatus, a computer device, a storage medium, and a computer program product for searching for an entry.
Background
With the rapid development of internet technology and smart devices, users can search for information on various platforms, such as searching videos, searching goods, searching articles, searching answers, and the like. Typically, to facilitate a user's search, after the user enters text, the platform provides the user with a list of candidate search terms that include more information and have more definite directionality, so that the user may select a candidate search term from the list and trigger the search.
However, the number of the candidate search terms provided by the current platform is limited, but in the case of limited number, there are also situations that the repetition degree of the candidate search terms is higher, for example, the candidate search terms "film work participated in by Alice" and the candidate search terms "film participated in by Alice", so that the number of candidate search terms actually available for the user to select is smaller, thereby reducing the search efficiency and affecting the search experience.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a search term processing method, apparatus, computer device, computer readable storage medium, and computer program product that accurately deduplicates candidate search terms to improve search efficiency.
The application provides a processing method for searching entries, which comprises the following steps:
determining a plurality of candidate search terms corresponding to the input text;
determining the repetition degree between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text;
counting the repetition degree between the search results of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the search results;
counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the object click behaviors; the tag features characterize the interestingness distribution of the historical click object group for the content of various tags;
and de-duplicating the candidate search terms based on the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms.
The application provides a processing device for searching entries, which comprises the following steps:
A candidate search term determining module for determining a plurality of candidate search terms corresponding to the input text;
the first repetition degree acquisition module is used for determining the repetition degree based on text semantics between any two candidate search terms in the plurality of candidate search terms and the semantic representation of the input text;
the second repetition degree acquisition module is used for counting the repetition degree between the search results of any two candidate search terms and obtaining the repetition degree between any two candidate search terms based on the search results;
the third repetition acquisition module is used for counting the repetition between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition between any two candidate search terms based on the object click behaviors; the tag features characterize the interestingness distribution of the historical click object group for the content of various tags;
and the deduplication module is used for deduplicating the candidate search terms based on the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms.
In one embodiment, the candidate search term determining module is further configured to obtain an input text; querying a candidate search term database to obtain candidate search terms matched with the input text; and sorting the candidate search terms matched with the input text according to the relevance of the input text and the candidate search terms, the heat of the candidate search terms and the historical click rate, so as to obtain a plurality of candidate search terms corresponding to the input text.
In one embodiment, the first repetition obtaining module is further configured to input the arbitrary two candidate search terms and the input text into a semantic repetition prediction model; respectively obtaining the respective depth semantic representation of any two candidate search terms and the depth semantic representation of the input text through the semantic representation network of the semantic repetition prediction model, and then performing splicing processing to obtain the spliced depth semantic representation; and predicting the repetition degree based on text semantics between any two candidate search terms according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the semantic repetition degree prediction model.
In one embodiment, the device comprises a first training module, a second training module and a third training module, wherein the first training module is used for acquiring a training sample of the semantic repetition prediction model, the training sample of the semantic repetition prediction model comprises a positive example sample and a negative example sample, the positive example sample is composed of candidate search terms marked with repetition in candidate search terms of sample input text and the sample input text, and the negative example sample is composed of candidate search terms not marked with repetition in candidate search terms of the sample input text and the sample input text; and optimizing network parameters of a semantic representation network and a fully-connected network in the semantic repeatability prediction model based on positive examples and negative examples included in training samples of the semantic repeatability prediction model until training is stopped.
In one embodiment, the any two candidate search terms include a first candidate search term and a second candidate search term; the second repetition obtaining module is further configured to obtain a first search result list corresponding to the first candidate search term; acquiring a second search result list corresponding to the second candidate search term; and counting the repetition degree between the top-ranked multiple search results in the first search result list and the top-ranked multiple search results in the second search result list to obtain the repetition degree between the first candidate search term and the second candidate search term based on the search results, wherein the higher the correlation degree between the top-ranked search results and the corresponding candidate search terms is.
In one embodiment, the second repetition obtaining module is further configured to determine a video intersection and a video union formed by the top-ranked videos in the first search result list and the top-ranked videos in the second search result list; for each video in the video union, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; for each video in the video intersection, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search result according to the ratio of the sum of the statistic values of all videos in the video intersection to the sum of the statistic values of all videos in the video union.
In one embodiment, the device further includes an integrity obtaining module, configured to obtain a playing duration and a playing number of times of the video in a second preset time period; and taking the ratio between the playing time length and the playing times of the video in a second preset time period as the playing average integrity of the video.
In one embodiment, the third repetition obtaining module is further configured to determine a historical click object group of the candidate search term, obtain, for each historical click object in the historical click object group, an interest level distribution of each historical click object in contents of various tags according to a viewing record of the historical click object, calculate, for the historical click object group of the candidate search term, a mean value of the interest level distribution of each historical click object in contents of various tags, and obtain a tag feature corresponding to the historical click object group of the candidate search term; and for any two candidate search terms, obtaining the repetition degree between any two candidate search terms based on the object clicking behaviors according to the repetition degree between the tag features corresponding to the historical clicking object groups of the any two candidate search terms.
In one embodiment, the third repetition degree obtaining module is further configured to calculate a cosine distance between tag features corresponding to the historical click object groups of the any two candidate search terms, and obtain, based on the cosine distance, a repetition degree based on an object click behavior between the any two candidate search terms, where the repetition degree based on the object click behavior is inversely related to the cosine distance.
In one embodiment, the search result obtained based on searching the entry is a video, and the third repetition obtaining module is further configured to determine various tags preset for the video; obtaining labels of videos watched by the historical click object according to the watching record of the historical click object, and determining the interestingness of the historical click object in the videos of various labels according to the video duration, the playing completion degree and the playing time point of the corresponding videos watched by the historical click object for various labels; normalizing the interestingness of the historical click object in the videos of various labels to obtain the interestingness distribution of the historical click object in the videos of various labels.
In one embodiment, the deduplication module is further configured to perform weighted summation on the repetition level based on text semantics, the repetition level based on search results, and the repetition level based on object click behaviors between the arbitrary two candidate search terms, so as to obtain the repetition level between the arbitrary two candidate search terms; identifying a plurality of repeated groups from the plurality of candidate search terms based on the repetition degree between any two candidate search terms, wherein the repetition degree between candidate search terms belonging to the same repeated group in the plurality of repeated groups is higher than a set threshold value, and the repetition degree between candidate search terms belonging to different repeated groups is lower than the set threshold value; and de-duplicating candidate search terms belonging to the same repeated group.
In one embodiment, the deduplication module is further configured to construct a candidate search term distance graph based on a repetition degree between the arbitrary two candidate search terms; nodes in the candidate search term distance graph represent candidate search terms, and the distance between the nodes is inversely related to the repetition degree between the candidate search terms; and excavating the candidate search term distance graph to obtain a plurality of repeated groups.
In one embodiment, the deduplication module is further configured to obtain an object tag sequence corresponding to an object input to the input text; for a repeated group with the number of the candidate search terms being greater than 1, respectively inputting any candidate search term in the repeated group, the object tag sequence and the input text into a click probability prediction model, and outputting the click probability of the object on any candidate search term in the repeated group; and in the repeated group, eliminating candidate search terms with click probability lower than a threshold value to obtain candidate search terms after duplication elimination.
In one embodiment, the deduplication module is further configured to input any candidate search term in the repetition group, the object tag sequence, and the input text into a click probability prediction model; respectively obtaining the depth semantic representation of the candidate search term, the depth semantic representation of the object tag sequence and the depth semantic representation of the input text through the semantic representation network of the click probability prediction model, and then performing splicing processing to obtain the spliced depth semantic representation; and predicting the click probability of the object on the candidate search term according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the click probability prediction model.
In one embodiment, the apparatus further includes a second training module configured to obtain a training sample of the click probability prediction model, where the training sample of the click probability prediction model includes a positive sample and a negative sample, the positive sample is composed of sample input text, candidate search terms that an object inputting the sample input text has clicked, and an object tag sequence of the object inputting the sample input text, and the negative sample is composed of the sample input text, candidate search terms that an object inputting the sample input text has not clicked, and an object tag sequence of an object inputting the sample input text; and optimizing network parameters of a semantic representation network and a fully-connected network in the click prediction probability model based on positive examples and negative examples included in the training samples of the click prediction probability model until training is stopped.
The application provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the processing method of the search term.
The present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor of the above-described processing method of searching for an entry.
The application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of processing a search term as described above.
The processing method, apparatus, computer device, storage medium and computer program product of the search term, after obtaining a plurality of candidate search terms corresponding to the input text, determine the repetition degree between any two candidate search terms from a plurality of dimensions, specifically: mining semantic representations of any two candidate search terms based on semantic representations of the input text to determine the repetition degree between any two candidate search terms based on text semantics; based on the search results of any two candidate search terms, the repetition degree based on the search results between any two candidate search terms is obtained according to the repetition degree between the search results, in addition, the historical click object group of any two candidate search terms is also determined, the repetition degree based on the object click behavior between any two candidate search terms is obtained according to the repetition degree between the interest degree distribution of the historical click object group to the content of various labels, the repetition degree based on the object click behavior between any two candidate search terms is not limited to the candidate search terms, the information dimension on which the deduplication can depend is expanded, and the deduplication effect is improved; and finally, performing duplication removal on the candidate search terms based on the multi-dimensional repetition degree of any two search terms, thereby improving the duplication removal effect on the candidate search terms and finally improving the search efficiency and search experience of the user.
Drawings
FIG. 1 is an application environment diagram of a processing method for searching for terms in one embodiment;
FIG. 2 is a flow diagram of a processing method for searching for terms in one embodiment;
FIG. 3 is a schematic diagram of a video search interface in one embodiment;
FIG. 4 is a flow diagram of computing text semantic based repeatability in one embodiment;
FIG. 5 is a diagram of candidate search term distances in one embodiment;
FIG. 6 is a flow diagram of predicting a probability of clicking on a first candidate search term in one embodiment;
FIG. 7 is a block diagram of a method of processing for searching for terms in one embodiment;
FIG. 8 is a flowchart of another embodiment of a processing method for searching for terms;
FIG. 9 is a block diagram of a processing device for searching for terms in one embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the described embodiments of the application may be combined with other embodiments. It should be noted that references to "first," "second," etc. in this description are for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.
FIG. 1 is an application environment diagram of a processing method for searching for terms in one embodiment. Wherein the terminal 102 communicates with the server 104 via a communication network. Wherein the terminal 102 may interact with the server 104 via a communication network; the terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster or cloud server composed of a plurality of servers. The server 104 may store the candidate search terms using a data storage system that may be integrated into the server 104 or may be located separately from the server 104.
In one embodiment, after the terminal 102 obtains the input text, the input text may be sent to the server 104; server 104 determines a plurality of candidate search terms corresponding to the input text; based on semantic representations of any two candidate search terms of the plurality of candidate search terms and the input text, server 104 may determine a degree of repetition between any two candidate search terms based on text semantics; the server 104 may count the repetition degree between the search results of any two candidate search terms, to obtain the repetition degree between any two candidate search terms based on the search results; the server 104 may count the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms, to obtain the repetition degree between any two candidate search terms based on the object click behavior; the tag features characterize the interest degree distribution of the historical click object group on the contents of various tags; based on the repetition degree based on text semantics, the repetition degree based on search results, and the repetition degree based on object clicking behaviors between any two candidate search terms, the server 104 may perform deduplication on the multiple candidate search terms, and feed back the candidate search terms obtained after deduplication to the terminal 102.
FIG. 2 is a flow chart of a processing method for searching for an entry in one embodiment, where the method may be executed by a server or a terminal, or may be executed by the server and the terminal together, and in an embodiment of the present application, the method is described as being executed by the server; the method comprises the following steps:
step S202, a plurality of candidate search terms corresponding to the input text are determined.
When an object wants to find corresponding video, merchandise, etc. content on a video platform, shopping platform, etc., a corresponding text may be entered, which may be referred to as an input text (the input text belonging to a search query) for the platform to search for the corresponding content.
The search term may be a search suggestion constructed based on the input text, so that the object may select a search term desired by the object itself from among the search terms without inputting the complete text and only inputting a part of the text, for the platform to perform content search based on the candidate search term selected by the object.
Taking video search as an example introduction:
FIG. 3 is a schematic diagram of a video search interface in one embodiment. Referring to fig. 3, the video search interface 301 may be displayed on a terminal, when an object wants to search for a film that Alice participates in, alice may be input in a video search box 3011 of the video search interface 301, the terminal may send Alice as an input text to a server, the server may construct a plurality of search terms based on the input text after receiving the input text and feed back the search terms to the terminal, and the search terms fed back to the terminal by the server may include: "Alice 2021" is cross-year "," Alice participates in a variety "," Alice and Bob jointly participate in a work "," Alice participates in a movie work "and" Alice participates in a film. After receiving the plurality of search terms fed back by the server, the terminal may display the search terms on the video search interface 301, referring again to fig. 3, the video search interface 301 may display each search term in the search term columns 3012 to 3016 respectively; when the object selects the search term "the piece that Alice participates in" and clicks the search control 3018, the terminal may feed back the search term "the piece that Alice participates in" selected by the object to the server, and the server performs video search based on the search term "the piece that Alice participates in" and feeds back the search result to the terminal. In addition, when the object is to delete text in the video search box 3011, the delete control 3017 can be clicked.
The search terms "film and television works visited by Alice" and "film and television works visited by Alice" belong to repeated search terms, and if the repeated search terms are displayed to the object at the same time, the searching efficiency of the object may be affected, so that after the server preliminarily determines a plurality of search terms based on the input text, the server may perform deduplication processing on the preliminarily determined plurality of search terms, and then feed back the search terms determined after the deduplication processing to the terminal. Wherein the search term initially determined by the server based on the input text may be referred to as a candidate search term.
After obtaining a plurality of candidate search terms based on the input text, the server may perform multiple-dimension repetition calculation on each two candidate search terms, where the multiple-dimension repetition calculation is described in steps S204 to S208.
In one embodiment, the server, when determining a plurality of candidate search terms corresponding to the input text, may perform the steps of: acquiring an input text; querying a candidate search term database to obtain candidate search terms matched with the input text; and sorting the candidate search terms matched with the input text according to the relevance of the input text and the candidate search terms, the popularity of the candidate search terms and the historical click rate, so as to obtain a plurality of candidate search terms corresponding to the input text.
Wherein the candidate search term database comprises a plurality of candidate search terms, the search terms are usually determined according to search content in a server, for example, the candidate search term database can be constructed based on description information of the search content and historical input text. In the scene of video search, the description information of the search result is, for example, a video name, a name of a participating actor, a name of a character, music, and the like, and in the commodity search scene, the description information of the search result is, for example, a commodity name, a commodity size, commodity use, and the like.
The above is described by taking video search as an example:
after the server receives an input text input by an object, the server queries a candidate search term database based on the input text, and constructs a candidate search term matched with the input text based on the description information such as video names, names of participated actors, role names, music and the like included in the candidate search term database and historical input texts input by other objects in the past; then, the server may sort the constructed candidate search terms matched with the input text according to the relevance of the candidate search terms and the input text, the popularity of the candidate search terms and the historical click rate, and determine a plurality of candidate search terms corresponding to the input text from the constructed candidate search terms matched with the input text according to the sorting result so as to perform the deduplication process.
In the above embodiment, after the input text is obtained, the candidate search term matched with the input text may be obtained based on the candidate search term database, and the multiple candidate search terms for performing the deduplication processing are determined by combining the relevance of the candidate search term and the input text, the popularity of the candidate search term, and the historical click rate, so that the deduplication efficiency of the search term may be improved.
Step S204, determining the repetition degree between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text.
The repetition degree between candidate search terms is represented not only literally but also semantically. In the application, text semantics are taken as one dimension, and the repeatability among candidate search terms is analyzed. In order to further increase the accuracy of the semantic repetition of text between candidate search terms, the input text spoken in step S202 is analyzed in combination with the input text as a context for calculating the semantic repetition of text between candidate search terms.
In the following description, among two candidate search terms to be subjected to the repetition calculation of text semantics, one candidate search term is referred to as a first candidate search term, and the other candidate search term is referred to as a second candidate search term.
After obtaining the input text spoken in step S202 and the first candidate search term and the second candidate search term corresponding to the input text, the server obtains the semantic representation of the input text, the semantic representation of the first candidate search term, and the semantic representation of the second candidate search term, and combines the semantic representation of the first candidate search term and the semantic representation of the second candidate search term with the semantic representation of the input text under the condition that the input text is used as a context, so as to determine the degree of repetition between the first candidate search term and the second candidate search term based on text semantics.
In one embodiment, the server may input any two candidate search terms into the text semantic repetition prediction model when calculating the text semantic-based repetition; respectively obtaining respective depth semantic representations of any two candidate search terms and depth semantic representations of input texts through a semantic representation network of a semantic repetition prediction model, and then performing splicing processing to obtain spliced depth semantic representations; and predicting the text semantic-based repeatability between any two candidate search terms according to the spliced deep semantic representation by a fully-connected network connected with the semantic representation network in the semantic repeatability prediction model.
The semantic repetition prediction model is used for predicting the repetition of candidate search terms based on text semantics; the semantic repeatability prediction model may include a semantic representation network, which may be a bi-directional encoder representation network (BERT, bidirectional Encoder Representation from Transformers), and a fully connected network connected to the semantic representation network.
The semantic repetition prediction model may include a plurality of independent semantic representation networks, the number of independent semantic representation networks may be 3, network parameters of the independent semantic representation networks being shared for extracting a deep semantic representation of the input text, a deep semantic representation of the first candidate search term, and a deep semantic representation of the second candidate search term, respectively.
The depth semantic representation of the input text, the depth semantic representation of the first candidate search term and the depth semantic representation of the second candidate search term are respectively extracted through the mutually independent semantic representation networks shared by network parameters, so that the calculation efficiency based on the repeatability of the text semantic can be improved.
FIG. 4 is a flow diagram of computing text semantic based repeatability in one embodiment; referring to fig. 4, after obtaining an input text, a first candidate search term and a second candidate search term, the server inputs the input text, the first candidate search term and the second candidate search term into each semantic representation network with shared network parameters, so as to obtain a deep semantic representation of the input text, a deep semantic representation of the first candidate search term and a deep semantic representation of the second candidate search term; then, the server can splice the deep semantic representation of the input text, the deep semantic representation of the first candidate search term and the deep semantic representation of the second candidate search term to obtain spliced deep semantic representation; then, the server can input the spliced depth semantic representation into the fully connected network of the semantic repeatability prediction model, and take the result output by the fully connected network of the semantic repeatability prediction model as: in the case of text input as context, the repetition between the first candidate search term and the second candidate search term is based on text semantics.
In the embodiment, the text-semantic-based repeatability among candidate search terms is obtained by using the pre-constructed semantic-based repeatability prediction model, so that the accuracy of the repeatability can be improved, and the accuracy of the repeatability can be further improved by combining the input text to perform text-semantic-based repeatability calculation.
In one embodiment, the training step of the semantic repeatability prediction model comprises: obtaining a training sample of a semantic repetition prediction model, wherein the training sample of the semantic repetition prediction model comprises a positive example sample and a negative example sample, the positive example sample is composed of candidate search terms marked with repetition in candidate search terms of a sample input text and a sample input text, and the negative example sample is composed of candidate search terms not marked with repetition in the candidate search terms of the sample input text and the sample input text; and optimizing network parameters of a semantic representation network and a fully-connected network in the semantic repeatability prediction model based on positive examples and negative examples included in training samples of the semantic repeatability prediction model until training is stopped.
The semantic repetition prediction model is performed in the context of text input when predicting the repetition of text-based semantic between candidate search terms, so that a training sample of the semantic repetition prediction model needs to take the context into consideration.
Taking video search as an example introduction: the server can acquire log data, the log data comprises a history input text and corresponding candidate search terms, heuristic mining is carried out in the log data, the history input text is used as a sample input text, then repeated candidate search terms are marked manually in the candidate search terms corresponding to the context of the sample input text, and the sample input text and the repeated candidate search terms are used as positive examples in training samples; the server can randomly select non-repeated candidate search terms from candidate search terms corresponding to the context of the sample input text, mark the randomly selected non-repeated candidate search terms as non-repeated, and take the sample input text and the non-repeated candidate search terms as negative examples in the training samples. Then, the server can input the positive example sample and the negative example sample into the semantic repeatability prediction model to optimize the network parameters of the semantic representation network and the network parameters of the full-connection network of the semantic repeatability prediction model until training is stopped.
In the above embodiment, the sample input text is used as the context to obtain the corresponding positive example sample and negative example sample, and the model training is performed by using the positive example sample and the negative example sample, so that the semantic repetition prediction model has the corresponding capability, and the candidate search term can be calculated based on the text semantic repetition by using the input text as the context.
Step S206, counting the repetition degree between the search results of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the search results.
The search results are content obtained by searching based on the search term, and the search results are correspondingly different in different search scenes, for example, the search results are video in a video search scene, the search results are commodity in a commodity search scene, and the search results are images in an image search scene.
If the search results corresponding to the two candidate search terms are substantially identical, then the two candidate search terms are repeated with a high probability, i.e., the higher the identity of the search results corresponding to the two candidate search terms, the higher the repetition of the two candidate search terms based on the search results.
Taking the first candidate search term and the second candidate search term as examples for introduction:
after obtaining the first candidate search term and the second candidate search term, the server searches with the first candidate search term to obtain a search result corresponding to the first candidate search term, and searches with the second candidate search term to obtain a search result corresponding to the second candidate search term; the server may then compare the consistency between the search results corresponding to the first candidate search term and the search results corresponding to the second candidate search term, and if the consistency is higher, determine that the first candidate search term and the second candidate search term are higher in the repetition based on the search results.
In the commodity searching scene, when the server compares the consistency between the searching results corresponding to the first candidate searching term and the searching results corresponding to the second candidate searching term, the server can compare the consistency of the searching results in terms of commodity links, commodity names, commodity purposes and the like. In the image searching scene, when the server compares the consistency between the search results corresponding to the first candidate search term and the search results corresponding to the second candidate search term, the server can compare the consistency of the search results in terms of image semantics, image pixel distribution and the like. In the video searching scene, the consistency comparison of the searching results can be carried out in terms of video duration, video playing times, video average integrity and the like.
Step S208, counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the object click behaviors; the tag features characterize the interestingness distribution of the historical click object group for the content of various types of tags.
Wherein the group of history hit objects of the candidate search term may include history objects that have clicked on the candidate search term, wherein the history objects that have clicked on the candidate search term may be referred to as history hit objects.
The tag features corresponding to the historical click object group represent the interestingness distribution of the historical click object group for the contents of various tags, and the tag features corresponding to the historical click object group can be recorded as: [ (tag_1, prob_1), (tag_2, prob_2), (tag_3, prob_3), (tag_n, prob_n) ], where tag_n represents tag n and prob_n represents the interest level of the historical click object group in the content of tag n, which may be characterized by probability.
In the same context of the input text described in step S202, if two candidate search terms are clicked by similar groups of history click objects, it is indicated that the satisfaction of the groups is more consistent in the case that the two candidate search terms are in the same context, which means that the two candidate search terms represent the same meaning with a high probability, that is, the two candidate search terms may be repeated candidate search terms; accordingly, the lower the likelihood that the object clicks on both candidate search terms, i.e., the lower the repetition between the two candidate search terms based on the object's click behavior.
Taking the first candidate search term and the second candidate search term as examples for introduction:
after obtaining the first candidate search term and the second candidate search term, the server may obtain tag features corresponding to the historical click object group of the first candidate search term and record the tag features as [ (tag_1, prob_11), (tag_2, prob_21), (tag_3, prob_31) ], and obtain tag features corresponding to the historical click object group of the second candidate search term and record the tag features as [ (tag_1, prob_12), (tag_2, prob_22), (tag_3, prob_32) ], (tag_n, prob_2) ]; then, aiming at the tag characteristics corresponding to the historical click object group of the first candidate search term and the tag characteristics corresponding to the historical click object group of the second candidate search term, the server calculates the consistency of the interestingness distribution of the contents of various tags corresponding to the two tag characteristics, if the consistency is higher, the higher the repetition degree between the tag characteristics corresponding to the historical click object group of the first candidate search term and the tag characteristics corresponding to the historical click object group of the second candidate search term is determined, and the lower the repetition degree between the first candidate search term and the second candidate search term based on the object click behavior is determined.
Step S210, performing deduplication on a plurality of candidate search terms based on the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms.
Taking the first candidate search term and the second candidate search term as examples for introduction:
the server obtains the repetition degree based on text semantics between the first candidate search term and the second candidate search term, the repetition degree based on the search result between the first candidate search term and the second candidate search term, and the repetition degree based on the object clicking action between the first candidate search term and the second candidate search term through steps S204 to S208, and then synthesizes the repetition degree of the three dimensions to obtain the repetition degree between the first candidate search term and the second candidate search term, if the repetition degree between the first candidate search term and the second candidate search term is higher, the meaning represented by the first candidate search term and the second candidate search term is more similar, the probability of the first candidate search term and the second candidate search term is higher, and therefore, the server can reject one candidate search term, reserve the other candidate search term, and feed back the reserved candidate search term to the terminal.
In the above processing method of search terms, after obtaining a plurality of candidate search terms corresponding to an input text, determining the repetition degree between any two candidate search terms from a plurality of dimensions, specifically: mining semantic representations of any two candidate search terms based on semantic representations of the input text to determine the repetition degree between any two candidate search terms based on text semantics; based on the search results of any two candidate search terms, the repetition degree based on the search results between any two candidate search terms is obtained according to the repetition degree between the search results, in addition, the historical click object group of any two candidate search terms is also determined, the repetition degree based on the object click behavior between any two candidate search terms is obtained according to the repetition degree between the interest degree distribution of the historical click object group to the content of various labels, the repetition degree based on the object click behavior between any two candidate search terms is not limited to the candidate search terms, the information dimension on which the deduplication can depend is expanded, and the deduplication effect is improved; and finally, performing duplication removal on the candidate search terms based on the multi-dimensional repetition degree of any two search terms, thereby improving the duplication removal effect on the candidate search terms and finally improving the search efficiency and search experience of the user.
In one embodiment, when calculating the repetition degree based on the search result between any two candidate search terms, the server may further acquire a first search result list corresponding to the first candidate search term; acquiring a second search result list corresponding to the second candidate search term; and counting the repetition degree between the first search results in the first search result list and the second search results in the second search result list to obtain the repetition degree between the first candidate search terms and the second candidate search terms based on the search results, wherein the higher the correlation degree between the first search results and the corresponding candidate search terms.
Taking video search, a first candidate search term and a second candidate search term as an example introduction, correspondingly, the search result is video:
after obtaining the first candidate search term and the second candidate search term, the server takes the first candidate search term as an index, searches videos corresponding to the first candidate search term in a video search index library, and forms a search result list corresponding to the first candidate search term, wherein the search result list corresponding to the first candidate search term can be called as a first search result list; then, the server may use the second candidate search term as an index, and search the video corresponding to the second candidate search term in the video search index library to form a search result list corresponding to the second candidate search term, where the search result list corresponding to the second candidate search term may be referred to as a second search result list.
Because the video and the candidate search terms have the correlation degree in terms of semantics, after the first search result list and the second search result list are obtained, videos in each search result list can be respectively ordered, and the higher the correlation degree between the videos and the candidate search terms is, the higher the video ordering is.
After the server calculates the semantic relevance between the first candidate search term and each video in the first search result list, ordering the videos in the first search result list, wherein the higher the relevance is, the earlier the ordering is; and similarly, after the server calculates the semantic relevance between the second candidate search term and each video in the second search result list, ordering the videos in the second search result list, wherein the higher the relevance is, the earlier the ordering is.
The server may then determine a top-ranked plurality of videos (e.g., top 200) in the first search result list and a top-ranked plurality of videos (e.g., top 200) in the second search result list, and determine a degree of repetition between the top-ranked plurality of videos in the first search result list and the top-ranked plurality of videos in the second search result list, thereby obtaining a degree of repetition between the first candidate search term and the second candidate search term based on the search results.
In the above embodiment, according to the correlation between the candidate search term and the search result, a part of the search result is selected from the search result list to calculate the repetition degree, so that the accuracy of the repetition degree is ensured while the calculation efficiency of the repetition degree is improved.
In one embodiment, in a video search scenario, search results obtained based on searching terms are videos; the server can determine a video intersection and a video union formed by a plurality of videos which are ranked ahead in the first search result list and a plurality of videos which are ranked ahead in the second search result list; for each video in the video union, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; for each video in the video intersection, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search result according to the ratio of the sum of the statistic values of all videos in the video intersection to the sum of the statistic values of all videos in the video intersection.
The duration of the video is the duration of the video itself, for example, a certain video has a duration of 4 minutes, and the playing duration of the video is the duration of the video being played, for example, the video having a duration of 4 minutes, and the playing duration is 2 minutes. The playing average integrity of the video characterizes the situation that the video is played completely, and the higher the playing average integrity of the video is, the more the video is played completely.
The video union includes a plurality of videos ranked first in the first search result list and a plurality of videos ranked first in the second search result, the first 200 being exemplified by the video union may include: the first 200 videos in the first search result list and the first 200 videos in the second search result list.
The video intersection comprises videos which repeatedly appear in the first search result list and the second search result list; for example, a video is a video in both the first search result list and the second search result list, and then the video may be considered as a video included in the video intersection.
Taking the example that the video union comprises video a, video b, video c and video d, and the video intersection comprises video a and video d as description:
after determining a video union and a video intersection formed by a plurality of videos ranked in front in a first search result list and a plurality of videos ranked in front in a second search result list, the server counts each video aiming at each video in the video union to obtain a corresponding statistic value of each video, taking a video a as an example: the server can acquire the duration of the video a, the playing times of the video a in the first preset time period T1 and the playing average integrity of the video a, and count the duration of the video a, the playing times of the video a in the first preset time period T1 and the playing average integrity of the video a to obtain the statistic value of the video a. The method of counting the duration of the video a, the playing times of the video a in the first preset time period T and the playing average integrity of the video a may be to multiply the duration of the video a, the playing times of the video a in the first preset time period T and the playing average integrity of the video a, and correspondingly, the statistics value of the video a is the product result.
And the server calculates the statistic value of other videos in the video union set according to the mode of calculating the statistic value of the video a, and obtains the statistic value of the video b, the statistic value of the video c and the statistic value of the video d. The server may then calculate the sum of the statistics of video a, video b, video c, and video d, which are included in the video set, and record as CA.
For each video in the video intersection, for example, the statistics of video a and the statistics of video d, the above calculation results may be directly multiplexed to obtain the statistics of video a and the statistics of video d. The server may then calculate the sum of the statistics of each video in the video intersection, i.e., the sum of the statistics of video a and the statistics of video d, and record as CI.
Then, the server may obtain a degree of repetition (denoted as p_search) between the top-ranked plurality of search results in the first search result list and the top-ranked plurality of search results in the second search result list based on the CA and the CI; wherein CA and P_search are inversely related, and CI and P_search are positively related. Specifically, the server may use the ratio between CI and CA as p_search, i.e., p_search=ci/CA, and p_search as the repetition between the first candidate search term and the second candidate search term based on the search result.
In the above embodiment, the repeatability based on the search result between candidate search terms is obtained from three aspects of the duration of the video, the playing times of the video in the first preset time period and the playing average integrity of the video, so that the accuracy of the repeatability can be improved.
In one embodiment, regarding the average integrity of playing of the video, the server may obtain the playing duration and the playing times of the video within the second preset period; and taking the ratio between the playing time length and the playing times of the video in the second preset period as the playing average integrity of the video.
The second preset period may be the same as the first preset period, or may be different from the first preset period, and when the second preset period is different from the first preset period, the second preset period may be longer than the first preset period.
Taking video a as an example: the playing time of the video a in the second preset time period and the playing times of the video a in the second preset time period, and then the server can take the ratio between the playing time of the video a in the second preset time period and the playing times as the playing average integrity of the video a.
In the above embodiment, based on the ratio between the playing time length and the playing times of the video in the second preset period, the playing average integrity of the video is obtained, so that the situation that the video is completely played can be more comprehensively and accurately reflected.
In one embodiment, when calculating the repetition degree based on the object clicking behaviors between any two candidate search terms, the server may determine a historical clicking object group of the candidate search terms, obtain, for each historical clicking object in the historical clicking object group, the interestingness distribution of each historical clicking object in the content of each type of tag according to the viewing record of the historical clicking object, and calculate the average value of the interestingness distribution of each historical clicking object in the content of each type of tag in the historical clicking object group of the candidate search terms, so as to obtain the tag feature corresponding to the historical clicking object group of the candidate search terms; and for any two candidate search terms, obtaining the repetition degree between any two candidate search terms based on the object clicking behaviors according to the repetition degree between the tag features corresponding to the historical clicking object groups of the any two candidate search terms.
The history click objects of the candidate search terms are: the object of the candidate search term was clicked. To improve the repeatability of the object-based click behavior, the condition of the context of the input text may be combined; under the condition of the input text, the history click object of the candidate search term is: in the case where the input text is the input text, the object of the candidate search term is once clicked.
Let the video platform have n tags: tag_1, tag_2,..and tag_n, the video published on the video platform is marked with the corresponding tag.
Taking the label characteristics corresponding to the historical click object group of the first candidate search term as an example introduction:
under the condition that the input text is the input text spoken in the step S202, the object clicked on the first candidate search term is used as the historical click object of the first candidate search term and forms a historical click object group of the first candidate search term, and the interestingness distribution of each historical click object of the first candidate search term in the content of various labels is obtained. If the historical click object group of the first candidate search term includes user_1 and user_2, the calculation mode of the interestingness distribution of the content of the user_1 in each type of tag may be:
according to the watching record of the user_1 on the video platform, the interestingness of the user_1 on the content of each type of label can be determined, and the interestingness distribution of the user_1 on the content of each type of label is obtained and is marked as [ (tag_1, prob_user_11), (tag_2, prob_user_12), (tag_3, prob_user_13) ].
Similarly, the interestingness distribution of the content of the user_1 in each type of tag can be obtained in the above manner and is expressed as [ (tag_1, prob_user_21), (tag_2, prob_user_22), (tag_3, prob_user_23), (tag_n, prob_user_2 n) ].
Then, for the similar tag, a mean value of interestingness of the content of each historical click object of the first candidate search term under the similar tag is calculated, for example, for tag_1, a mean value of tag_1 is obtained by carrying out average calculation on prob_user_11 and prob_user_21, and the mean value is recorded as prob_11. The server may obtain a mean value of interestingness of contents of the historical click object group of the first candidate search term under various tags, and obtain [ (tag_1, prob_11), (tag_2, prob_21), (tag_3, prob_31) ], and may be referred to as a tag feature corresponding to the historical click object group of the first candidate search term.
The server may obtain, according to the calculation manner of the tag feature corresponding to the history click object group of the first candidate search term, the tag feature corresponding to the history click object group of the second candidate search term, [ (tag_1, prob_12), (tag_2, prob_22), (tag_3, prob_32),... The higher the repetition degree between the tag feature corresponding to the historical click object group of the first candidate search term and the tag feature corresponding to the historical click object group of the second candidate search term, the lower the repetition degree between the first candidate search term and the second candidate search term based on the object click behavior.
In the above embodiment, based on the viewing record of the historical click object of the candidate search term, the interest degree distribution of the historical click object in the content of each type of label is obtained, and the label characteristics corresponding to the historical click object group of the candidate search term are obtained in a mean value mode, so that the interest degree of the historical click object group of the candidate search term in the content of each type of label is accurately reflected, and the accuracy of the repeatability of the candidate search term based on the object click behavior is improved.
In one embodiment, the server may calculate a cosine distance between tag features corresponding to the historical click object groups of any two candidate search terms, and obtain a repetition degree between any two candidate search terms based on the object click behavior based on the cosine distance, where the repetition degree based on the object click behavior is inversely related to the cosine distance.
Taking the first candidate search term and the second candidate search term as examples for introduction:
after obtaining the tag features [ (tag_1, prob_11), (tag_2, prob_21), (tag_3, prob_31), (tag_n, prob_n1) ] corresponding to the historical click object group of the first candidate search term, and the tag features [ (tag_1, prob_12), (tag_2, prob_22), (tag_3, prob_32), (tag_n, prob_2) ] corresponding to the historical click object group of the second candidate search term, the server may calculate a cosine distance between the two, and obtain a repetition degree (denoted as p_user) based on the object click behavior between the first candidate search term and the second candidate search term based on the cosine distance, where the repetition degree based on the object click behavior is inversely related to the cosine distance; the specific calculation mode can be as follows: p_user=1-cosine distance.
In the above embodiment, the repetition degree based on the object click behavior between the candidate search terms is obtained based on the cosine distance between the tag features corresponding to the historical click object groups of the candidate search terms under the condition that the cosine distance and the repetition degree based on the object click behavior are inversely related, so that the accuracy of the repetition degree based on the object click behavior is improved.
In one embodiment, regarding the interestingness distribution of the content of each historical click object in various labels, the server may determine various labels preset for the video; obtaining labels of videos watched by the historical click object according to the watching record of the historical click object, and determining the interestingness of the historical click object in the videos of various labels according to the video duration, the playing completion degree and the playing time point of the corresponding videos watched by the historical click object for various labels; and normalizing the interestingness of the historical click object in the videos of various labels to obtain the interestingness distribution of the historical click object in the videos of various labels.
The playing completion of the object on the watched video represents the proportion condition between the watched video content and the complete video content, and the higher the playing completion, the closer the watched video content of the object to the complete video content is, specifically, the ratio between the playing duration of the video and the duration of the video can be determined. The play time point of the object to the video being watched is the time at which the object watched the video.
Taking the video interestingness of the historical click object user_1 in tag_1 of the first candidate search term as an example introduction:
on a video platform, various labels can be preset for videos, and according to the watching record of the user_1, which videos are watched by the user_1 and the labels marked to the videos can be determined; when calculating the interest level of the user_1 in the tag_1, the video duration of the video belonging to the tag_1 watched by the user_1, the playing completion level of the video belonging to the tag_1 watched by the user_1, and the playing time point of the video belonging to the tag_1 watched by the user_1 may be obtained, so as to obtain the interest level of the video of the user_1 in the tag_1, which may be specifically calculated by: user_1 video interest level= (video duration of video belonging to tag_1 viewed by user_1×playback completion level of video belonging to tag_1 by user_1)/log (e+number of weeks elapsed from the current time point of playback time of video belonging to tag_1 viewed by user_1).
The server can obtain the interestingness of the user_1 in the tag_2, the tag_3 and the tag_n according to the mode of calculating the interestingness of the video of the user_1 in the tag_1, and normalize the interestingness to form the interestingness distribution [ (tag_1, prob_user_11), (tag_2, prob_user_12), (tag_3 and prob_user_13) of the user_1 in various tags.
In the above embodiment, when calculating the interest degree distribution of the historical click object in the videos of various tags, the video duration, the playing completion degree and the playing time point of the videos of various tags watched by the historical click object are combined, so that the interest degree distribution can accurately reflect the interest degree of the historical click object in the videos of various tags, the interest degree of the historical click object in the videos of various tags is normalized, the comparison calculation can be performed on the same dimension, and the accuracy of the repetition degree based on the object click behavior among candidate search terms is improved.
In one embodiment, the server may perform weighted summation on the repetition degree based on text semantics, the repetition degree based on search results, and the repetition degree based on object click behaviors between any two candidate search terms to obtain the repetition degree between any two candidate search terms; identifying a plurality of repeated groups from the plurality of candidate search terms based on the repetition degree between any two candidate search terms, wherein the repetition degree between candidate search terms belonging to the same repeated group in the plurality of repeated groups is higher than a set threshold value, and the repetition degree between candidate search terms belonging to different repeated groups is lower than the set threshold value; and de-duplicating candidate search terms belonging to the same repeated group.
The method comprises the steps of marking the repetition degree based on text semantics as P_text, marking the repetition degree based on search results as P_search and marking the repetition degree based on object clicking behaviors as P_user, and taking a first candidate search term and a second candidate search term as an example for introduction:
after obtaining the text-semantic-based repetition p_text between the first candidate search term and the second candidate search term, marking the repetition degree based on the search result as p_search and marking the repetition degree based on the object click action as p_user, the server may perform weighted summation on the text-semantic-based repetition degree p_text between the first candidate search term and the second candidate search term, marking the repetition degree based on the search result as p_search and marking the repetition degree based on the object click action as p_user according to a preset weight, thereby obtaining the repetition degree between the first candidate search term and the second candidate search term. The specific calculation mode can be as follows: repetition between the first candidate search term and the second candidate search term = w1×p_text+w2×p_search+w3×p_user; wherein, w1, w2 and w3 may be weights of floating point number type, w1+w2+w3=1.0, and the server may determine reasonable values of w1, w2 and w3 by performing grid search (grid search) on the data set, such as w1=0.5, w2=0.15 and w3=0.35.
For the plurality of candidate search terms of the input text spoken in step S202, the server may obtain the repetition level between every two candidate search terms in the above manner, and identify a plurality of repetition groups from among the plurality of candidate search terms based on the repetition level between every two candidate search terms, where the repetition level between candidate search terms of the same repetition group is higher than a set threshold, and the repetition level between candidate search terms of different repetition groups is lower than the set threshold; after the server obtains a plurality of repeated groups, whether to de-duplicate the candidate search terms of the same repeated group can be determined based on the number of the candidate search terms included in the repeated groups; for example, if the number of candidate search terms included in a certain repetition group is one, the candidate search terms in the repetition group may not be subjected to the deduplication process, and if the number of candidate search terms included in a certain repetition group is two or more, for example, the candidate search terms in the repetition group may be subjected to the deduplication process.
The method for performing the de-duplication processing on the candidate search terms in the same repeated group may be: a candidate search term is randomly selected for retention, or is retained according to the interest of the object of the input text spoken in the input step S202.
In the above embodiment, after the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between candidate search terms are obtained, weighted summation may be performed according to preset weights, and grouping and deduplication may be performed based on the repetition degree obtained by weighted summation, so that the deduplication accuracy of the candidate search terms may be improved.
In one embodiment, the server may construct a candidate search term distance map based on the degree of repetition between any two candidate search terms; nodes in the candidate search term distance graph represent candidate search terms, and the distance between the nodes is inversely related to the repetition degree between the candidate search terms; and excavating the candidate search term distance graph to obtain a plurality of repeated groups.
For the multiple candidate search terms of the input text spoken in step S202, the server may use the candidate search terms as nodes, and construct a candidate search term distance graph according to the distance between the nodes determined by the repetition degree between the candidate search terms; the higher the repetition degree between candidate search terms, the farther the distance between the corresponding nodes.
FIG. 5 is a distance diagram of candidate search terms in one embodiment, referring to FIG. 5, nodes 1 to 11 are formed by using each candidate search term of the input text described in step S202 as a node, wherein the further the distance between the nodes is, the lower the repetition degree between the corresponding candidate search terms is; then, the server may mine the candidate search term distance graph to obtain a plurality of repeated groups. When the server digs the candidate search term distance graph, the candidate search term distance graph can be iteratively mined through a community discovery algorithm (such as a graph algorithm based on modularity) so as to ensure that the obtained distance between candidate search terms in the same repeated group is smaller than a threshold value and the distance between repeated groups is larger than the threshold value.
In the above embodiment, after the repetition degrees among the candidate search terms are integrated in multiple dimensions, the candidate search terms can be grouped in a graph mining manner, so that the duplication removal efficiency of the candidate search terms is improved.
In one embodiment, the server may obtain an object tag sequence corresponding to an object to which the input text is input; for the repeated group with the number of the candidate search terms being greater than 1, respectively inputting any candidate search term, an object tag sequence and an input text in the repeated group into a click probability prediction model, and outputting the click probability of an object on any candidate search term in the repeated group; and in the repeated group, eliminating candidate search terms with click probability lower than a threshold value to obtain candidate search terms after duplication elimination.
The object tag sequence of the object is used for representing tags of contents of interest of the object, for example, the object is interested in videos of tag_1 and tag_3, and the object tag sequence forming the object is [ tag_1 and tag_3]. The click probability prediction model may be used to predict the probability of being clicked by the object of the input text spoken in step S202 for each candidate search term corresponding to the input text spoken in step S202 under the context of the input text spoken in step S202.
For a repeated group including candidate search terms with a number greater than 1, the server may deduplicate the candidate search terms within the repeated group, retaining one of the candidate search terms.
Taking a certain repeated group including the first candidate search term and the second candidate search term as an example introduction:
the server may acquire the object tag sequence of the object of the input text described in the input step S202, input the object tag sequence and the first candidate search term into the click probability prediction model of the input text described in the step S202, and output the result of the click probability prediction model as: under the condition of the input text spoken in step S202, the probability that the first candidate search term is clicked by the object of the input text spoken in step S202, i.e., the probability that the object clicks on the first candidate search term. Similarly, the server may obtain the click probability of the object on the second candidate search term, and reject the candidate search term with the click probability lower than the threshold value, to obtain the candidate search term after duplication removal. When the first candidate search term and the second candidate search term are removed, the server can remove the candidate search term with small click probability, and the candidate search term with large click probability is reserved.
In the above embodiment, the input text is used as the context, the object tag sequence of the object is combined, the click probability of the candidate search term is de-duplicated by the object obtained through the click probability prediction model, and personalized de-duplication is realized by considering the object tag.
In one embodiment, the server may input any candidate search term, object tag sequence and input text within the repeated group into the click probability prediction model; respectively obtaining depth semantic representations of candidate search terms, depth semantic representations of object tag sequences and depth semantic representations of input texts through a semantic representation network of a click probability prediction model, and then performing splicing processing to obtain spliced depth semantic representations; and predicting the click probability of the object on the candidate search term according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the click probability prediction model.
The click probability prediction model may include a semantic representation network, which may be a bi-directional encoder representation network, a fully connected network connected to the semantic representation network, and a classification network connected to the fully connected network.
The click probability prediction model may include a plurality of independent semantic representation networks, the number of which may be 3, network parameters of which are shared, for extracting a deep semantic representation of the input text, a deep semantic representation of the candidate search term, and a deep semantic representation of the object tag sequence, respectively.
The depth semantic representation of the input text, the depth semantic representation of the candidate search term and the depth semantic representation of the object tag sequence are respectively extracted through the mutually independent semantic representation network shared by network parameters, so that the prediction efficiency of the click probability can be improved.
Taking the click probability of the computing object on the first candidate search term as an example introduction:
FIG. 6 is a flow diagram of predicting a probability of clicking on a first candidate search term in one embodiment; referring to fig. 6, after obtaining the input text, the first candidate search term and the corresponding object tag sequence spoken in step S202, the server inputs the input text, the first candidate search term and the corresponding object tag sequence spoken in step S202 into each semantic representation network shared by network parameters, to obtain a deep semantic representation of the input text, a deep semantic representation of the first candidate search term and a deep semantic representation of the object tag sequence spoken in step S202, and performs a splicing process to obtain a spliced deep semantic representation; and then, the server inputs the spliced depth semantic representation into a fully-connected network connected with the semantic representation network in the click probability prediction model, inputs the result output by the fully-connected network into a classification network, and takes the result output by the classification network as the click probability of the object on the first candidate search term.
In the embodiment, the click probability prediction model constructed in advance is beneficial to obtaining the click probability of the object on the candidate search term under the condition of inputting the text, so that the accuracy of the click probability is improved, and the personalized search term de-duplication is realized.
In one embodiment, the training step of the click probability prediction model includes: the server can acquire a training sample of the click probability prediction model, wherein the training sample of the click probability prediction model comprises a positive example sample and a negative example sample, the positive example sample is composed of a sample input text, a candidate search term clicked by an object inputting the sample input text and an object tag sequence of the object inputting the sample input text, and the negative example sample is composed of the sample input text, a candidate search term not clicked by the object inputting the sample input text and an object tag sequence of the object inputting the sample input text; based on positive examples and negative examples included in the training samples of the click probability prediction model, optimizing network parameters of the semantic representation network and the full-connection network in the click probability prediction model until training is stopped.
The server may obtain log data including the historical input text and corresponding candidate search terms; the server can take the history input text as a sample input text, then determines candidate search terms clicked by an object inputting the sample input text from candidate search terms corresponding to the context of the sample input text, and takes the sample input text, the candidate search terms clicked by the object inputting the sample input text and an object tag sequence of the object inputting the sample input text as positive sample; the server may randomly determine candidate search terms that the object inputting the sample input text does not click on, and take the sample input text, the candidate search terms that the object inputting the sample input text does not click on, and the object tag sequence of the object inputting the sample input text as negative examples. Then, the server can input the positive example sample and the negative example sample into the click probability prediction model to optimize the respective network parameters of the semantic representation network, the full-connection network and the classification network in the click probability prediction model until training is stopped.
In the above embodiment, when the input text is taken as the context, the corresponding positive example sample and negative example sample are obtained by combining the object tag sequence, and model training is performed by using the positive example sample and the negative example sample, so that the click probability prediction model has corresponding capability, and when the input text is taken as the context, the probability of clicking the corresponding candidate search term by the object can be predicted by combining the object tag sequence of the object corresponding to the input text, thereby realizing personalized deduplication.
In order to better understand the above method, an application example of the processing method for searching for an entry according to the present application is described in detail below. The application embodiment corresponds to a video search scene, and accordingly, the search result is video.
Fig. 7 is a block diagram of a processing method of search terms in one embodiment, referring to fig. 7, a server may form a plurality of candidate search terms corresponding to an input text of an object, then determine the repetition between the candidate search terms from three aspects of a repetition based on text semantics, a repetition based on a search result, and a repetition based on an object click behavior, form a repetition group, and then de-repeat the candidate search terms based on an object tag sequence of the object for the same repetition group.
Fig. 8 is a flowchart of another embodiment of a processing method for searching for an entry, and referring to fig. 8, the method may be executed by a server or a terminal, or may be executed by the server and the terminal together, and in an embodiment of the present application, the method is executed by the server as an example, and the method mainly includes the following steps:
step S802, an input text is acquired.
Step S804, the candidate search term database is queried to obtain candidate search terms matched with the input text.
Step S806, sorting the candidate search terms matched with the input text according to the relevance of the input text and the candidate search terms, the heat of the candidate search terms and the historical click rate, so as to obtain a plurality of candidate search terms corresponding to the input text.
Step S808, determining the repetition degree between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text.
Specifically, step S808 may include the steps of:
inputting any two candidate search terms and input text into a semantic repetition prediction model;
respectively obtaining respective depth semantic representations of any two candidate search terms and depth semantic representations of input texts through a semantic representation network of a semantic repetition prediction model, and then performing splicing processing to obtain spliced depth semantic representations;
And predicting the text semantic-based repeatability between any two candidate search terms according to the spliced deep semantic representation by a fully-connected network connected with the semantic representation network in the semantic repeatability prediction model.
Step S810, counting the repetition degree between the search results of any two candidate search terms, and obtaining the repetition degree between any two candidate search terms based on the search results.
Taking any two candidate search terms including the first candidate search term and the second candidate search term as an example, the introducing step S810 specifically includes the following steps:
acquiring a first search result list corresponding to a first candidate search term;
acquiring a second search result list corresponding to the second candidate search term;
determining a video union and a video intersection formed by a plurality of videos which are ranked ahead in a first search result list and a plurality of videos which are ranked ahead in a second search result list; wherein, the higher the top ranked search results, the higher the correlation between the plurality of search results and the corresponding candidate search terms;
for each video in the video union, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video;
For each video in the video intersection, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video;
and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search result according to the ratio of the sum of the statistic values of all videos in the video intersection to the sum of the statistic values of all videos in the video intersection.
Step S812, counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms, and obtaining the repetition degree between any two candidate search terms based on the object click behaviors.
The tag features characterize the interest degree distribution of the historical click object group on the contents of various tags;
step S812 may specifically include the following steps:
determining a historical click object group of the candidate search term, for each historical click object in the historical click object group, obtaining the interest degree distribution of each historical click object in the contents of various labels according to the watching record of the historical click object, and calculating the average value of the interest degree distribution of each historical click object in the contents of various labels in the historical click object group of the candidate search term to obtain the label characteristics corresponding to the historical click object group of the candidate search term;
And calculating cosine distances between tag features corresponding to the historical click object groups of any two candidate search terms, and obtaining the repetition degree of the object click behaviors between any two candidate search terms based on the cosine distances, wherein the repetition degree of the object click behaviors is inversely related to the cosine distances.
Step S814, the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms are weighted and summed to obtain the repetition degree between any two candidate search terms.
In step S816, a plurality of repeated groups are identified from the plurality of candidate search terms based on the repetition level between any two candidate search terms.
The repetition degree among candidate search terms belonging to the same repetition group is higher than a set threshold value, and the repetition degree among candidate search terms belonging to different repetition groups is lower than the set threshold value.
Step S816 may specifically include the following steps:
constructing a candidate search term distance graph based on the repetition degree between any two candidate search terms; nodes in the candidate search term distance graph represent candidate search terms, and the distance between the nodes is inversely related to the repetition degree between the candidate search terms;
And excavating the candidate search term distance graph to obtain a plurality of repeated groups.
Step S818, for the repeated group with the number of the candidate search terms greater than 1, inputting any candidate search term, the object tag sequence and the input text in the repeated group into the click probability prediction model, and outputting the click probability of the object to any candidate search term in the repeated group.
Step S818 may specifically include the following steps:
inputting any candidate search term, object tag sequence and input text in the repeated group into a click probability prediction model;
respectively obtaining depth semantic representations of candidate search terms, depth semantic representations of object tag sequences and depth semantic representations of input texts through a semantic representation network of a click probability prediction model, and then performing splicing processing to obtain spliced depth semantic representations;
and predicting the click probability of the object on the candidate search term according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the click probability prediction model.
Step S820, for the repeated group with the number of the included candidate search terms greater than 1, eliminating the candidate search terms with click probability lower than the threshold value to obtain the candidate search terms after duplication elimination.
In this embodiment of the present application, after obtaining a plurality of candidate search terms corresponding to an input text, the repetition degree between any two candidate search terms is determined from a plurality of dimensions, specifically: mining semantic representations of any two candidate search terms based on semantic representations of the input text to determine the repetition degree between any two candidate search terms based on text semantics; based on the search results of any two candidate search terms, the repetition degree based on the search results between any two candidate search terms is obtained according to the repetition degree between the search results, in addition, the historical click object group of any two candidate search terms is also determined, the repetition degree based on the object click behavior between any two candidate search terms is obtained according to the repetition degree between the interest degree distribution of the historical click object group to the content of various labels, the repetition degree based on the object click behavior between any two candidate search terms is not limited to the candidate search terms, the information dimension on which the deduplication can depend is expanded, and the deduplication effect is improved; finally, performing candidate search term de-duplication based on multi-dimensional repetition of any two search terms, improving the de-duplication effect on the candidate search terms, and finally improving the search efficiency and search experience of the user; in addition, the application embodiment can select candidate search terms which are more friendly to the object from candidate search terms in the same repeated group to reserve based on the individual interest degree of the object in videos of various labels, so that the duplication-free optimization of the search terms is realized, the richness of the search terms is improved, the object is more convenient to search for the selected video based on the search terms, and the video searching efficiency of the object is improved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a search term processing device for realizing the above related search term processing method. The implementation scheme of the solution to the problem provided by the device is similar to the implementation scheme described in the above method, so the specific limitation and technical effects in the embodiments of the processing device for one or more search terms provided below can be referred to the limitation and technical effects of the processing method for the search terms hereinabove, and are not repeated herein.
Fig. 9 is a block diagram of a processing device for searching for an entry in one embodiment. Referring to fig. 9, the apparatus includes:
a candidate search term determination module 902, configured to determine a plurality of candidate search terms corresponding to an input text;
a first repetition level obtaining module 904, configured to determine a repetition level between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text;
a second repetition degree obtaining module 906, configured to count the repetition degrees between the search results of any two candidate search terms, to obtain the repetition degrees between any two candidate search terms based on the search results;
a third repetition degree obtaining module 908, configured to count the repetition degrees between the tag features corresponding to the historical click object groups of any two candidate search terms, to obtain the repetition degrees between any two candidate search terms based on the object click behaviors; the tag features characterize the interest degree distribution of the historical click object group on the contents of various tags;
the deduplication module 910 is configured to deduplicate a plurality of candidate search terms based on a repetition level based on text semantics, a repetition level based on search results, and a repetition level based on object click behaviors between any two candidate search terms.
In one embodiment, the candidate search term determination module 902 is further configured to obtain input text; querying a candidate search term database to obtain candidate search terms matched with the input text; and sorting the candidate search terms matched with the input text according to the relevance of the input text and the candidate search terms, the popularity of the candidate search terms and the historical click rate, so as to obtain a plurality of candidate search terms corresponding to the input text.
In one embodiment, the first repetition obtaining module 904 is further configured to input any two candidate search terms and input text into the semantic repetition prediction model; respectively obtaining respective depth semantic representations of any two candidate search terms and depth semantic representations of input texts through a semantic representation network of a semantic repetition prediction model, and then performing splicing processing to obtain spliced depth semantic representations; and predicting the text semantic-based repeatability between any two candidate search terms according to the spliced deep semantic representation by a fully-connected network connected with the semantic representation network in the semantic repeatability prediction model.
In one embodiment, the device comprises a first training module, a second training module and a third training module, wherein the first training module is used for acquiring a training sample of a semantic repetition prediction model, the training sample of the semantic repetition prediction model comprises a positive sample and a negative sample, the positive sample is composed of candidate search terms marked with repetition in candidate search terms of a sample input text and a sample input text, and the negative sample is composed of candidate search terms not marked with repetition in the candidate search terms of the sample input text and the sample input text; and optimizing network parameters of a semantic representation network and a fully-connected network in the semantic repeatability prediction model based on positive examples and negative examples included in training samples of the semantic repeatability prediction model until training is stopped.
In one embodiment, any two candidate search terms include a first candidate search term and a second candidate search term; the second repetition obtaining module 906 is further configured to obtain a first search result list corresponding to the first candidate search term; acquiring a second search result list corresponding to the second candidate search term; and counting the repetition degree between a plurality of search results which are ranked first in the first search result list and a plurality of search results which are ranked first in the second search result list, and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search results, wherein the higher the correlation degree between the search results which are ranked first and the corresponding candidate search terms is.
In one embodiment, the second repetition obtaining module 906 is further configured to determine a video intersection and a video union formed by the first video ranked first in the first search result list and the first video ranked first in the second search result list; for each video in the video union, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; for each video in the video intersection, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video; and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search result according to the ratio of the sum of the statistic values of all videos in the video intersection to the sum of the statistic values of all videos in the video intersection.
In one embodiment, the apparatus further includes an integrity obtaining module, configured to obtain a playing duration and a playing number of times of the video in a second preset time period; and taking the ratio between the playing time length and the playing times of the video in the second preset time period as the playing average integrity of the video.
In one embodiment, the third repetition obtaining module 908 is further configured to determine a historical click object group of the candidate search term, obtain, for each historical click object in the historical click object group, an interest degree distribution of each historical click object in the content of each type of tag according to a viewing record of the historical click object, and calculate, for each historical click object in the historical click object group of the candidate search term, a mean value of the interest degree distribution of each historical click object in the content of each type of tag, to obtain a tag feature corresponding to the historical click object group of the candidate search term; and for any two candidate search terms, obtaining the repetition degree between any two candidate search terms based on the object clicking behaviors according to the repetition degree between the tag features corresponding to the historical clicking object groups of the any two candidate search terms.
In one embodiment, the third repetition obtaining module 908 is further configured to calculate a cosine distance between tag features corresponding to the historical click object groups of any two candidate search terms, obtain the repetition degree between any two candidate search terms based on the object click behavior based on the cosine distance, and inversely correlate the repetition degree based on the object click behavior with the cosine distance.
In one embodiment, the search result obtained based on searching the entry is video, and the third repetition obtaining module 908 is further configured to determine various tags preset for the video; obtaining labels of videos watched by the historical click object according to the watching record of the historical click object, and determining the interestingness of the historical click object in the videos of various labels according to the video duration, the playing completion degree and the playing time point of the corresponding videos watched by the historical click object for various labels; and normalizing the interestingness of the historical click object in the videos of various labels to obtain the interestingness distribution of the historical click object in the videos of various labels.
In one embodiment, the deduplication module 910 is further configured to perform weighted summation on the repetition level based on text semantics, the repetition level based on search results, and the repetition level based on object click behaviors between any two candidate search terms, so as to obtain the repetition level between any two candidate search terms; identifying a plurality of repeated groups from the plurality of candidate search terms based on the repetition degree between any two candidate search terms, wherein the repetition degree between candidate search terms belonging to the same repeated group in the plurality of repeated groups is higher than a set threshold value, and the repetition degree between candidate search terms belonging to different repeated groups is lower than the set threshold value; and de-duplicating candidate search terms belonging to the same repeated group.
In one embodiment, the deduplication module 910 is further configured to construct a candidate search term distance graph based on a degree of repetition between any two candidate search terms; nodes in the candidate search term distance graph represent candidate search terms, and the distance between the nodes is inversely related to the repetition degree between the candidate search terms; and excavating the candidate search term distance graph to obtain a plurality of repeated groups.
In one embodiment, the deduplication module 910 is further configured to obtain an object tag sequence corresponding to an object of the input text; for a repeated group with the number of the included candidate search terms being greater than 1, respectively inputting any candidate search term in the repeated group, the object tag sequence and the input text into a click probability prediction model, and outputting the click probability of the object for any candidate search term in the repeated group; and in the repeated group, eliminating candidate search terms with click probability lower than a threshold value to obtain candidate search terms after duplication elimination.
In one embodiment, the deduplication module 910 is further configured to input any candidate search term, object tag sequence and input text in the repetition group into the click probability prediction model; respectively obtaining depth semantic representations of candidate search terms, depth semantic representations of object tag sequences and depth semantic representations of input texts through a semantic representation network of a click probability prediction model, and then performing splicing processing to obtain spliced depth semantic representations; and predicting the click probability of the object on the candidate search term according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the click probability prediction model.
In one embodiment, the apparatus further includes a second training module configured to obtain a training sample of the click probability prediction model, where the training sample of the click probability prediction model includes a positive sample and a negative sample, the positive sample is composed of a sample input text, a candidate search term clicked by an object inputting the sample input text, and an object tag sequence of the object inputting the sample input text, and the negative sample is composed of the sample input text, a candidate search term not clicked by the object inputting the sample input text, and an object tag sequence of the object inputting the sample input text; based on positive examples and negative examples included in the training samples of the click probability prediction model, optimizing network parameters of the semantic representation network and the full-connection network in the click probability prediction model until training is stopped.
In the processing device for search terms, after obtaining a plurality of candidate search terms corresponding to an input text, the repetition degree between any two candidate search terms is determined from a plurality of dimensions, specifically: mining semantic representations of any two candidate search terms based on semantic representations of the input text to determine the repetition degree between any two candidate search terms based on text semantics; based on the search results of any two candidate search terms, the repetition degree based on the search results between any two candidate search terms is obtained according to the repetition degree between the search results, in addition, the historical click object group of any two candidate search terms is also determined, the repetition degree based on the object click behavior between any two candidate search terms is obtained according to the repetition degree between the interest degree distribution of the historical click object group to the content of various labels, the repetition degree based on the object click behavior between any two candidate search terms is not limited to the candidate search terms, the information dimension on which the deduplication can depend is expanded, and the deduplication effect is improved; and finally, performing duplication removal on the candidate search terms based on the multi-dimensional repetition degree of any two search terms, thereby improving the duplication removal effect on the candidate search terms and finally improving the search efficiency and search experience of the user.
The respective modules in the processing device for searching the term may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal or a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O interface) and a communication interface connected by a system bus. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the processed data of the search term. The input-output interface of the computer device is used for exchanging information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of processing search terms.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the processor executes the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the respective method embodiments described above.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the various method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (19)

1. A method of processing a search term, the method comprising:
determining a plurality of candidate search terms corresponding to the input text;
determining the repetition degree between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text;
Counting the repetition degree between the search results of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the search results;
counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the object click behaviors; the tag features characterize the interestingness distribution of the historical click object group for the content of various tags;
and de-duplicating the candidate search terms based on the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms.
2. The method of claim 1, wherein the determining a plurality of candidate search terms corresponding to the input text comprises:
acquiring an input text;
querying a candidate search term database to obtain candidate search terms matched with the input text;
and sorting the candidate search terms matched with the input text according to the relevance of the input text and the candidate search terms, the heat of the candidate search terms and the historical click rate, so as to obtain a plurality of candidate search terms corresponding to the input text.
3. The method of claim 1, wherein the determining a degree of repetition between any two candidate search terms based on text semantics based on semantic representations of the any two candidate search terms and the input text comprises:
inputting the arbitrary two candidate search terms and the input text into a semantic repetition prediction model;
respectively obtaining the respective depth semantic representation of any two candidate search terms and the depth semantic representation of the input text through the semantic representation network of the semantic repetition prediction model, and then performing splicing processing to obtain the spliced depth semantic representation;
and predicting the repetition degree based on text semantics between any two candidate search terms according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the semantic repetition degree prediction model.
4. A method according to claim 3, wherein the training step of the semantic repeatability prediction model comprises:
obtaining a training sample of the semantic repetition prediction model, wherein the training sample of the semantic repetition prediction model comprises a positive example sample and a negative example sample, the positive example sample consists of candidate search terms marked with repetition in candidate search terms of a sample input text and the sample input text, and the negative example sample consists of candidate search terms not marked with repetition in candidate search terms of the sample input text and the sample input text;
And optimizing network parameters of a semantic representation network and a fully-connected network in the semantic repeatability prediction model based on positive examples and negative examples included in training samples of the semantic repeatability prediction model until training is stopped.
5. The method of claim 1, wherein the any two candidate search terms include a first candidate search term and a second candidate search term, the counting the repetition level between the search results of the any two candidate search terms, and the obtaining the repetition level between the any two candidate search terms based on the search results, includes:
acquiring a first search result list corresponding to the first candidate search term;
acquiring a second search result list corresponding to the second candidate search term;
and counting the repetition degree between the top-ranked multiple search results in the first search result list and the top-ranked multiple search results in the second search result list to obtain the repetition degree between the first candidate search term and the second candidate search term based on the search results, wherein the higher the correlation degree between the top-ranked search results and the corresponding candidate search terms is.
6. The method of claim 5, wherein the search result based on searching the term is video; the counting of the repetition degree between the top-ranked plurality of search results in the first search result list and the top-ranked plurality of search results in the second search result list, to obtain the repetition degree between the first candidate search term and the second candidate search term based on the search results, includes:
determining a video intersection and a video union formed by a plurality of videos which are ranked ahead in the first search result list and a plurality of videos which are ranked ahead in the second search result list;
for each video in the video union, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video;
for each video in the video intersection, counting respective time length, playing times in a first preset time period and playing average integrity of the video to obtain respective statistic values of the video;
and obtaining the repetition degree between the first candidate search term and the second candidate search term based on the search result according to the ratio of the sum of the statistic values of all videos in the video intersection to the sum of the statistic values of all videos in the video union.
7. The method of claim 6, wherein the method further comprises:
acquiring the playing time length and the playing times of the video in a second preset time period;
and taking the ratio between the playing time length and the playing times of the video in a second preset time period as the playing average integrity of the video.
8. The method of claim 1, wherein the counting the repetition degree between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition degree between any two candidate search terms based on the object click behavior comprises:
determining a historical click object group of the candidate search term, for each historical click object in the historical click object group, obtaining the interest degree distribution of each historical click object in the content of each type of tag according to the watching record of the historical click object, and calculating the average value of the interest degree distribution of each historical click object in the content of each type of tag in the historical click object group of the candidate search term to obtain the tag characteristic corresponding to the historical click object group of the candidate search term;
and for any two candidate search terms, obtaining the repetition degree between any two candidate search terms based on the object clicking behaviors according to the repetition degree between the tag features corresponding to the historical clicking object groups of the any two candidate search terms.
9. The method of claim 8, wherein the obtaining, for any two candidate search terms, the repetition between any two candidate search terms based on the object click behavior according to the repetition between the tag features corresponding to the historical click object groups of the any two candidate search terms, comprises:
and calculating the cosine distance between the label features corresponding to the historical click object groups of any two candidate search terms, and obtaining the repetition degree based on the object click behaviors between any two candidate search terms based on the cosine distance, wherein the repetition degree based on the object click behaviors is inversely related to the cosine distance.
10. The method of claim 8, wherein the search result obtained based on searching the term is video, and the obtaining the interestingness distribution of the content of each historical click object in each type of tag according to the viewing record of the historical click object comprises:
determining various labels preset for the video;
obtaining labels of videos watched by the historical click object according to the watching record of the historical click object, and determining the interestingness of the historical click object in the videos of various labels according to the video duration, the playing completion degree and the playing time point of the corresponding videos watched by the historical click object for various labels;
Normalizing the interestingness of the historical click object in the videos of various labels to obtain the interestingness distribution of the historical click object in the videos of various labels.
11. The method of claim 1, wherein de-duplicating the plurality of candidate search terms based on a degree of duplication between any two candidate search terms based on text semantics, a degree of duplication based on search results, and a degree of duplication based on object click behavior, comprises:
the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms are weighted and summed to obtain the repetition degree between any two candidate search terms;
identifying a plurality of repeated groups from the plurality of candidate search terms based on the repetition degree between any two candidate search terms, wherein the repetition degree between candidate search terms belonging to the same repeated group in the plurality of repeated groups is higher than a set threshold value, and the repetition degree between candidate search terms belonging to different repeated groups is lower than the set threshold value;
and de-duplicating candidate search terms belonging to the same repeated group.
12. The method of claim 11, wherein the identifying a plurality of repeated groups from the plurality of candidate search terms based on a degree of repetition between the any two candidate search terms comprises:
constructing a candidate search term distance graph based on the repetition degree between any two candidate search terms; nodes in the candidate search term distance graph represent candidate search terms, and the distance between the nodes is inversely related to the repetition degree between the candidate search terms;
and excavating the candidate search term distance graph to obtain a plurality of repeated groups.
13. The method of claim 11, wherein de-duplicating candidate search terms belonging to the same repetition group comprises:
acquiring an object tag sequence corresponding to an object inputting the input text;
for a repeated group with the number of the candidate search terms being greater than 1, respectively inputting any candidate search term in the repeated group, the object tag sequence and the input text into a click probability prediction model, and outputting the click probability of the object on the candidate search term in the repeated group;
and in the repeated group, eliminating candidate search terms with click probability lower than a threshold value to obtain candidate search terms after duplication elimination.
14. The method of claim 13, wherein the inputting the click probability prediction model with the input text, the object tag sequence, and any candidate search term in the repeated group, respectively, outputs the click probability of the object for the candidate search term in the repeated group, comprises:
inputting any candidate search term, the object tag sequence and the input text in the repeated group into a click probability prediction model;
respectively obtaining the depth semantic representation of the candidate search term, the depth semantic representation of the object tag sequence and the depth semantic representation of the input text through the semantic representation network of the click probability prediction model, and then performing splicing processing to obtain the spliced depth semantic representation;
and predicting the click probability of the object on the candidate search term according to the spliced deep semantic representation by a fully connected network connected with the semantic representation network in the click probability prediction model.
15. The method of claim 13, wherein the training step of the click probability prediction model comprises:
Obtaining a training sample of the click probability prediction model, wherein the training sample of the click probability prediction model comprises a positive example sample and a negative example sample, the positive example sample is composed of a sample input text, a candidate search term clicked by an object inputting the sample input text and an object tag sequence of the object inputting the sample input text, and the negative example sample is composed of the sample input text, a candidate search term not clicked by the object inputting the sample input text and an object tag sequence of the object inputting the sample input text;
and optimizing network parameters of a semantic representation network and a fully-connected network in the click prediction probability model based on positive examples and negative examples included in the training samples of the click prediction probability model until training is stopped.
16. A processing apparatus for searching for terms, the apparatus comprising:
a candidate search term determining module for determining a plurality of candidate search terms corresponding to the input text;
the first repetition degree acquisition module is used for determining the repetition degree based on text semantics between any two candidate search terms in the plurality of candidate search terms and the semantic representation of the input text;
The second repetition degree acquisition module is used for counting the repetition degree between the search results of any two candidate search terms and obtaining the repetition degree between any two candidate search terms based on the search results;
the third repetition acquisition module is used for counting the repetition between the tag features corresponding to the historical click object groups of any two candidate search terms to obtain the repetition between any two candidate search terms based on the object click behaviors; the tag features characterize the interestingness distribution of the historical click object group for the content of various tags;
and the deduplication module is used for deduplicating the candidate search terms based on the repetition degree based on text semantics, the repetition degree based on search results and the repetition degree based on object clicking behaviors between any two candidate search terms.
17. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 15 when executing the computer program.
18. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 15.
19. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 15.
CN202210377991.XA 2022-04-12 2022-04-12 Processing method, device, equipment, storage medium and program product for searching entry Pending CN116932705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210377991.XA CN116932705A (en) 2022-04-12 2022-04-12 Processing method, device, equipment, storage medium and program product for searching entry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210377991.XA CN116932705A (en) 2022-04-12 2022-04-12 Processing method, device, equipment, storage medium and program product for searching entry

Publications (1)

Publication Number Publication Date
CN116932705A true CN116932705A (en) 2023-10-24

Family

ID=88393120

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210377991.XA Pending CN116932705A (en) 2022-04-12 2022-04-12 Processing method, device, equipment, storage medium and program product for searching entry

Country Status (1)

Country Link
CN (1) CN116932705A (en)

Similar Documents

Publication Publication Date Title
CN111008332B (en) Content item recommendation method, device, server and storage medium
US9449271B2 (en) Classifying resources using a deep network
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
Qian et al. Social media based event summarization by user–text–image co-clustering
CN111125422A (en) Image classification method and device, electronic equipment and storage medium
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words
CN110795657A (en) Article pushing and model training method and device, storage medium and computer equipment
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
CN113011172B (en) Text processing method, device, computer equipment and storage medium
CN111506820A (en) Recommendation model, method, device, equipment and storage medium
CN111783903A (en) Text processing method, text model processing method and device and computer equipment
CN114168790A (en) Personalized video recommendation method and system based on automatic feature combination
CN116578729B (en) Content search method, apparatus, electronic device, storage medium, and program product
CN114817692A (en) Method, device and equipment for determining recommended object and computer storage medium
CN114741587A (en) Article recommendation method, device, medium and equipment
CN112148994A (en) Information push effect evaluation method and device, electronic equipment and storage medium
Huang et al. Tag refinement of micro-videos by learning from multiple data sources
CN115878761A (en) Event context generation method, apparatus, and medium
CN116975359A (en) Resource processing method, resource recommending method, device and computer equipment
CN116932705A (en) Processing method, device, equipment, storage medium and program product for searching entry
CN112417260B (en) Localized recommendation method, device and storage medium
CN114817697A (en) Method and device for determining label information, electronic equipment and storage medium
CN116150428B (en) Video tag acquisition method and device, electronic equipment and storage medium
CN117056587A (en) Content pushing method, device, computer equipment and storage medium
CN114662480A (en) Synonym label judging method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination