CN112989164A - Search result processing method and device and electronic equipment - Google Patents

Search result processing method and device and electronic equipment Download PDF

Info

Publication number
CN112989164A
CN112989164A CN202110329657.2A CN202110329657A CN112989164A CN 112989164 A CN112989164 A CN 112989164A CN 202110329657 A CN202110329657 A CN 202110329657A CN 112989164 A CN112989164 A CN 112989164A
Authority
CN
China
Prior art keywords
data
target
relevance
model
sample data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110329657.2A
Other languages
Chinese (zh)
Other versions
CN112989164B (en
Inventor
范成
冯志
胡阿敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Credit Service Co ltd
Original Assignee
Beijing Jindi Credit Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Credit Service Co ltd filed Critical Beijing Jindi Credit Service Co ltd
Priority to CN202110329657.2A priority Critical patent/CN112989164B/en
Publication of CN112989164A publication Critical patent/CN112989164A/en
Application granted granted Critical
Publication of CN112989164B publication Critical patent/CN112989164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a search result processing method and device and electronic equipment. The method comprises the following steps: acquiring index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to the N description domains; determining target data hit by the index information from domain data corresponding to the description domain associated with the target relevance model in the N description domains; determining a correlation characteristic parameter according to the target data; obtaining a relevance score of the search result and the index information according to the relevance characteristic parameter and the target relevance model; and displaying the search result according to the acquired relevance score. By adopting the technical scheme, the reasonability of the search result page can be ensured, so that high-quality user search experience can be ensured.

Description

Search result processing method and device and electronic equipment
Technical Field
The present disclosure relates to the field of information search technologies, and in particular, to a method and an apparatus for processing search results, and an electronic device.
Background
In a search scenario, a search may be performed using the index information to obtain a search result, for example, a user may perform a search using "beijing king bank" as the index information to obtain a plurality of search results, which may specifically be a plurality of documents. Thereafter, presentation of the search results may be performed on a search results page.
It should be noted that, from the perspective of user perception, the rationality of the search result page determines the quality of the user search experience, and therefore, how to ensure the rationality of the search result page and thus ensure high-quality user search experience is a problem to be solved urgently for those skilled in the art.
Disclosure of Invention
The invention aims to provide a search result processing method, a search result processing device and electronic equipment, so that the rationality of a search result page is at least ensured to a certain extent, and high-quality user search experience is ensured.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a search result processing method, including:
acquiring index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to N description domains;
determining target data hit by the index information from domain data corresponding to a description domain associated with a target relevance model among the N description domains;
determining a correlation characteristic parameter according to the target data;
obtaining a relevance score of the search result and the index information according to the relevance characteristic parameter and the target relevance model;
and displaying the search result according to the acquired relevance score.
In an exemplary embodiment of the present disclosure, the number of the target relevance models is at least two, one of the target relevance models is a first target relevance model, the other one of the target relevance models is a second target relevance model, the description domain associated with the first target relevance model is a name domain, the description domain associated with the second target relevance model includes at least two description domains, and the number of the search results is K;
the displaying the search result according to the obtained relevance score comprises the following steps:
acquiring a first score threshold corresponding to the first target correlation model and a second score threshold corresponding to the second target correlation model;
according to the first score threshold value, extracting P relevance scores meeting a first preset condition from K relevance scores corresponding to K search results obtained according to the first target relevance model;
according to the second score threshold value, extracting Q relevance scores meeting a second preset condition from K relevance scores corresponding to K search results obtained according to the second target relevance model;
determining a union of the P search results corresponding to the P relevance scores and the Q search results corresponding to the Q relevance scores;
and sequencing and displaying all the search results in the union set.
In an exemplary embodiment of the present disclosure, the performing an ordered presentation on all the search results in the union includes:
acquiring a first model weight corresponding to the first target correlation model and a second model weight corresponding to the second target correlation model;
for each search result in the union set, performing weighted summation on corresponding relevance scores respectively obtained according to the first target relevance model and the second target relevance model by using the first model weight and the second model weight to obtain a comprehensive score;
and sequencing and displaying all the search results in the union set according to the respective comprehensive scores of all the search results in the union set.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes at least one of a field hit coverage parameter, a miss coverage parameter before a hit field, and a miss coverage parameter after a hit field of the domain data in which the target data is located;
the determining of the correlation characteristic parameter according to the target data includes:
performing word segmentation processing on the domain data where the target data is located to obtain a first word segmentation processing result;
calculating a first sum of importance weights of all words in the first word segmentation processing result;
calculating a second sum of importance weights of all terms used for forming the target data in the first word segmentation processing result, and calculating the field hit coverage rate parameter according to the second sum and the first sum;
and/or calculating a third sum of importance weights of all terms positioned before all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter before the hit field according to the third sum and the first sum;
and/or calculating a fourth sum of importance weights of all terms after all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter after the hit field according to the fourth sum and the first sum.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes an information hit coverage parameter of the index information;
the determining of the correlation characteristic parameter according to the target data includes:
performing word segmentation processing on the index information to obtain a second word segmentation processing result;
calculating a fifth sum of importance weights of all words in the second word segmentation processing result;
calculating a sixth sum of importance weights of the words forming the target data in the second word segmentation processing result;
and calculating the information hit coverage rate parameter according to the sixth sum and the fifth sum.
In an exemplary embodiment of the present disclosure, the relevance feature parameters include a core word loss evaluation parameter;
the determining of the correlation characteristic parameter according to the target data includes:
determining core data in the domain data where the target data is located;
performing word segmentation processing on the core data to obtain a third word segmentation processing result;
calculating a seventh sum of importance weights of the words in the third word segmentation processing result;
performing word segmentation on the part hitting the core data in the target data to obtain a fourth word segmentation result;
calculating an eighth sum of importance weights of the words in the fourth word segmentation processing result;
and calculating the core word loss evaluation parameter according to the eighth sum and the seventh sum.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes a hit window length ratio parameter;
the determining of the correlation characteristic parameter according to the target data includes:
determining a data window consisting of all data from a first field of a first target data to an end field of a last target data in the domain data of the target data;
performing word segmentation processing on the data window to obtain a fifth word segmentation processing result;
and calculating the length ratio parameter of the hit window according to the number of the words forming the target data in the fifth word segmentation processing result and the number of the words in the fifth word segmentation processing result.
In an exemplary embodiment of the disclosure, the correlation characteristic parameter includes at least one of a hit number ratio parameter and a distance escape parameter, the hit number ratio parameter is a ratio of the number of the target data in the domain data where the target data is located to a hit number threshold, and the distance escape parameter is an edit distance between the index information and the target data.
In an exemplary embodiment of the present disclosure, the method further comprises:
obtaining R sample data from historical user data; each sample data in the R sample data comprises sample index information, a sample search result and the frequency of occurrence of a preset event on the sample search result;
dividing the R sample data into a positive sample data group and a negative sample data group according to whether the times of the R sample data are zero or not;
extracting S positive sample data from the positive sample data group, and extracting T negative sample data from the negative sample data group; wherein S and T satisfy a preset proportional relationship;
and training to obtain the target correlation model according to the S pieces of positive sample data and the T pieces of negative sample data.
In an exemplary embodiment of the present disclosure, the training to obtain the target correlation model according to the S positive sample data and the T negative sample data includes:
according to a frequency threshold value, extracting U positive sample data of which the frequency meets a third preset condition from the S positive sample data;
according to the T pieces of negative sample data and the initial correlation model, obtaining a correlation score between a sample search result and sample index information included in each piece of negative sample data in the T pieces of negative sample data;
extracting V negative sample data meeting a fourth preset condition from the T negative sample data according to the acquired relevance scores;
and performing model training by using the U positive sample data and the V negative sample data to obtain the target correlation model.
According to a second aspect of the present disclosure, there is provided a search result processing apparatus including:
the first acquisition module is used for acquiring index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to N description domains;
a first determining module, configured to determine, from domain data corresponding to a description domain associated with a target relevance model among the N description domains, target data hit by the index information;
the second determination module is used for determining a correlation characteristic parameter according to the target data;
the second obtaining module is used for obtaining the relevance scores of the search results and the index information according to the relevance characteristic parameters and the target relevance model;
and the processing module is used for displaying the search result according to the acquired relevance score.
In an exemplary embodiment of the present disclosure, the number of the target relevance models is at least two, one of the target relevance models is a first target relevance model, the other one of the target relevance models is a second target relevance model, the description domain associated with the first target relevance model is a name domain, the description domain associated with the second target relevance model includes at least two description domains, and the number of the search results is K;
the processing module comprises:
a first obtaining submodule, configured to obtain a first score threshold corresponding to the first target correlation model and a second score threshold corresponding to the second target correlation model;
the first extraction submodule is used for extracting P relevance scores meeting a first preset condition from K relevance scores corresponding to K search results obtained according to the first target relevance model according to the first score threshold;
the second extraction submodule is used for extracting Q relevance scores meeting a second preset condition from K relevance scores corresponding to K search results obtained according to the second target relevance model according to the second score threshold;
a first determining submodule, configured to determine a union of the P search results corresponding to the P relevance scores and the Q search results corresponding to the Q relevance scores;
and the display sub-module is used for carrying out sequencing display on all the search results in the union set.
In an exemplary embodiment of the present disclosure, the presentation submodule includes:
a first obtaining unit, configured to obtain a first model weight corresponding to the first target correlation model and a second model weight corresponding to the second target correlation model;
a second obtaining unit, configured to perform weighted summation on corresponding relevance scores respectively obtained according to the first target relevance model and the second target relevance model by using the first model weight and the second model weight for each search result in the union to obtain a comprehensive score;
and the display unit is used for carrying out sequencing display on all the search results in the union set according to the respective comprehensive scores of all the search results in the union set.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes at least one of a field hit coverage parameter, a miss coverage parameter before a hit field, and a miss coverage parameter after a hit field of the domain data in which the target data is located;
the second determining module includes:
the first processing submodule is used for performing word segmentation processing on the domain data where the target data is located to obtain a first word segmentation processing result;
the first calculation submodule is used for calculating a first sum of importance weights of all terms in the first word segmentation processing result;
a second determining submodule, configured to calculate a second sum of importance weights of words used for forming the target data in the first word segmentation processing result, and calculate the field hit coverage parameter according to the second sum and the first sum;
and/or the second determining submodule is used for calculating a third sum of importance weights of all terms positioned before all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter before the hit field according to the third sum and the first sum;
and/or the second determining submodule is used for calculating a fourth sum of importance weights of all terms after all terms used for forming the target data in the first word segmentation processing result, and calculating the miss coverage rate parameter after the hit field according to the fourth sum and the first sum.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes an information hit coverage parameter of the index information;
the second determining module includes:
the second processing submodule is used for performing word segmentation processing on the index information to obtain a second word segmentation processing result;
the second calculation submodule is used for calculating a fifth sum of importance weights of all words in the second word segmentation processing result;
a third calculation submodule, configured to calculate a sixth sum of importance weights of words used for forming the target data in the second word segmentation processing result;
and the third determining submodule is used for calculating the information hit coverage rate parameter according to the sixth sum and the fifth sum.
In an exemplary embodiment of the present disclosure, the relevance feature parameters include a core word loss evaluation parameter;
the second determining module includes:
a fourth determining submodule, configured to determine core data in the domain data where the target data is located;
the third processing submodule is used for carrying out word segmentation processing on the core data to obtain a third word segmentation processing result;
a fourth calculation submodule, configured to calculate a seventh sum of importance weights of the words in the third word segmentation processing result;
the fourth processing submodule is used for performing word segmentation processing on the part hitting the core data in the target data to obtain a fourth word segmentation processing result;
a fifth calculation submodule, configured to calculate an eighth sum of importance weights of the terms in the fourth term processing result;
and the fifth determining submodule is used for calculating the core word loss evaluation parameter according to the eighth sum and the seventh sum.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes a hit window length ratio parameter;
the second determining module includes:
a sixth determining submodule, configured to determine a data window, in the domain data where the target data is located, that is formed by all data from a first field of a first target data to a last field of a last target data;
the fifth processing submodule is used for performing word segmentation processing on the data window to obtain a fifth word segmentation processing result;
and the seventh determining submodule is used for calculating the length ratio parameter of the hit window according to the number of the words forming the target data in the fifth word processing result and the number of the words in the fifth word processing result.
In an exemplary embodiment of the present disclosure, the correlation characteristic parameter includes at least one of a hit number ratio parameter and a distance escape parameter, the hit number ratio parameter is a ratio of the number of the target data in the domain data to a hit number threshold, and the distance escape parameter is an edit distance between the index information and the target data.
In an exemplary embodiment of the present disclosure, the apparatus further includes:
the third acquisition module is used for acquiring R sample data from historical user data; each sample data in the R sample data comprises sample index information, a sample search result and the frequency of occurrence of a preset event on the sample search result;
the dividing module is used for dividing the R sample data into a positive sample data group and a negative sample data group according to whether the times of the R sample data are zero or not;
the extraction module is used for extracting S positive sample data from the positive sample data group and extracting T negative sample data from the negative sample data group; wherein S and T satisfy a preset proportional relationship;
and the training module is used for training to obtain the target correlation model according to the S positive sample data and the T negative sample data.
In an exemplary embodiment of the present disclosure, the training module includes:
a third extraction submodule, configured to extract, according to a frequency threshold, U positive sample data whose included frequency satisfies a third preset condition from the S positive sample data;
the second obtaining sub-module is used for obtaining the relevance scores of the sample search results and the sample index information included in each negative sample data in the T negative sample data according to the T negative sample data and the initial relevance model;
a fourth extraction submodule, configured to extract, according to the obtained correlation score, V negative sample data that meet a fourth preset condition from the T negative sample data;
and the training submodule is used for performing model training by utilizing the U positive sample data and the V negative sample data to obtain the target correlation model.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the above search result processing method via execution of the executable instructions.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described search result processing method.
According to a fifth aspect of the present disclosure, there is provided a computer program comprising computer readable code which, when run on an apparatus, a processor in the apparatus executes instructions of the steps in the above search result processing method.
As can be seen from the foregoing technical solutions, the search result processing method, apparatus, electronic device, computer-readable storage medium, and computer program in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:
according to the search result processing method in the embodiment of the disclosure, index information and a search result obtained by searching by using the index information can be obtained firstly; the search result comprises N domain data corresponding to the N description domains; next, target data hit by the index information may be determined from domain data corresponding to the description domain associated with the target relevance model among the N description domains, and a relevance feature parameter may be determined according to the target data; then, the relevance scores of the search results and the index information can be obtained according to the relevance characteristic parameters and the target relevance model, and the search results are displayed according to the obtained relevance scores. It can be seen that, in the embodiment of the present disclosure, after the search is performed by using the index information, the target data hit by the index information may be determined from the appropriate domain data in combination with the association relationship between the model and the description domain, and then, the relevance score that can effectively characterize the relevance degree of the search result and the index information may be obtained in combination with the relevance characteristic parameter determination processing and the model-based relevance score acquisition processing.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 is a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied.
Fig. 2 is a flowchart of a search result processing method in an exemplary embodiment of the present disclosure.
Fig. 3 is another flowchart of a search result processing method in an exemplary embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a composition of correlation characteristic parameters in an exemplary embodiment of the present disclosure.
Fig. 5 is an overall design schematic in an exemplary embodiment of the present disclosure.
Fig. 6 is a flow chart of sample construction in an exemplary embodiment of the present disclosure.
Fig. 7 is a block diagram of a search result processing apparatus in an exemplary embodiment of the present disclosure.
Fig. 8 is another block diagram of a search result processing apparatus in an exemplary embodiment of the present disclosure.
Fig. 9 is a block diagram of an electronic device in an exemplary embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, apparatus, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. The symbol "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the present disclosure, unless otherwise expressly specified or limited, the terms "connected" and the like are to be construed broadly, e.g., as meaning electrically connected or in communication with each other; may be directly connected or indirectly connected through an intermediate. The specific meaning of the above terms in the present disclosure can be understood by those of ordinary skill in the art as appropriate.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having display screens including, but not limited to, smart phones, tablets, portable and desktop computers, digital cinema projectors, and the like.
The server 105 may be a server that provides various services. For example, a user sends a search request carrying index information to the server 105 by using the terminal device 103 (or the terminal device 101 or 102), and the server 105 may respond to the search request to obtain a search result and control the terminal device 103 to perform a presentation process of the search result on a search result page, so as to present the search result page to the user through the terminal device 103.
Referring to fig. 2, a flowchart of a search result processing method according to an exemplary embodiment of the present disclosure is provided. The method shown in fig. 2 may include step 201, step 202, step 203, step 204 and step 205, which are described below.
Step 201, obtaining index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to the N description domains.
In step 201, a user may input index information, where the index information may be a search term; herein, the search term may also be referred to as query.
Next, a search may be performed using the index information to obtain a search result, which may be a document; wherein a document may also be referred to as doc. It should be noted that the search result may include N domain data corresponding to the N description domains, where the N description domains and the N domain data may be in a one-to-one correspondence relationship, and N may be 1, 4, 8, 10, 15, 30, or another integer, which is not listed here.
In the case that the search scenario in the embodiment of the present disclosure is a company search scenario, the N description fields in the search result may include at least one of the following: name field (which is specifically a company name field), trademark information field, stockholder information field, investment institution field, project field. Assuming that the index information is "beijing king embankment", the domain data corresponding to the name domain may be "beijing king embankment technologies limited", the domain data corresponding to the trademark information domain may include related information of each trademark registered by beijing king embankment technologies limited, and the domain data corresponding to the shareholder information may include related information of each shareholder of the beijing embankment technologies limited.
Step 202, determining target data hit by the index information from domain data corresponding to the description domain associated with the target relevance model in the N description domains.
Here, the target correlation model may be a model that is trained in advance using a machine learning algorithm and is capable of performing correlation score prediction; the machine learning algorithm may be an eXtreme Gradient Boosting (XGBoost) algorithm.
In step 202, the description domain associated with the target relevance model in the N description domains may be determined according to the pre-configured association relationship between the target relevance model and the description domain. Optionally, by pre-configuration, the description domain associated with the target relevance model may be only a name domain, or the description domain associated with the target relevance model may include at least two of the N description domains, or the description domain associated with the target relevance model may include each of the N description domains.
Next, the target data hit by the index information can be determined from the domain data corresponding to the description domain associated with the target relevance model through a specific tool; wherein the particular tool may be an elastic search tool, which is a popular enterprise-level search engine. Alternatively, in the case that the index information is "beijing king bank" and the domain data corresponding to the description domain associated with the target correlation model is "beijing king bank science and technology limited", the following results may be obtained by using the elastic search tool: the part between < em > and </em > can be highlighted, for example, shown as red, and then the part between < em > and </em > can be regarded as a drifting result (which is composed of a plurality of drifting fields), and the part between < em > and </em > can be used as target data, namely, the target data is 'Beijing gold levee'.
Step 203, determining a correlation characteristic parameter according to the target data.
In step 203, a correlation characteristic parameter that can be used to assist in obtaining a correlation score between the search result and the index information may be determined according to the target data, and the number of parameters in the correlation characteristic parameter may be at least one, including but not limited to a field hit coverage parameter, a miss coverage parameter before a hit field, a miss coverage parameter after a hit field, an information hit coverage parameter, a core word loss evaluation parameter, a hit window length ratio parameter, a hit number ratio parameter, a distance escape parameter, and the like.
And step 204, obtaining a relevance score of the search result and the index information according to the relevance characteristic parameter and the target relevance model.
Under the condition that the number of the parameters in the correlation characteristic parameters is only one, the parameter in the correlation characteristic parameters can be input into the target correlation model to obtain the correlation score of the search result and the index information, which is output by the target correlation model according to the parameter; under the condition that the number of the parameters in the relevance characteristic parameters is at least two, at least two parameters in the relevance characteristic parameters can be combined to obtain a combined characteristic vector, and then the combined characteristic vector is input into the target relevance model to obtain the relevance score of the search result and the index information, which is output by the target relevance model according to the combined characteristic vector. Alternatively, the relevance score of the search results to the index information may lie between 0 and 1.
And step 205, displaying the search result according to the acquired relevance score.
It should be noted that the relevance score of the search result and the index information, which is output by the target relevance model, can effectively represent the relevance degree of the search result and the index information, and based on this, the search result can be displayed in a proper manner to ensure the rationality of the search result page. For example, under the condition that the degree of correlation between the search result and the index information is very high, the search result can be directly displayed on the search result page, and the ranking of the search result on the search result page is made to be as front as possible; under the condition that the degree of correlation between the search result and the index information is very low, the search result can be directly displayed on the search result page, and the ranking of the search result on the search result page is as back as possible, or the search result can be prohibited from being directly displayed on the search result page, and the search result can be displayed on the search result page only under the condition that a user indicates to view all the results.
According to the search result processing method in the embodiment of the disclosure, index information and a search result obtained by searching by using the index information can be obtained firstly; the search result comprises N domain data corresponding to the N description domains; next, target data hit by the index information may be determined from domain data corresponding to the description domain associated with the target relevance model among the N description domains, and a relevance feature parameter may be determined according to the target data; then, the relevance scores of the search results and the index information can be obtained according to the relevance characteristic parameters and the target relevance model, and the search results are displayed according to the obtained relevance scores. It can be seen that, in the embodiment of the present disclosure, after the search is performed by using the index information, the target data hit by the index information may be determined from the appropriate domain data in combination with the association relationship between the model and the description domain, and then, the relevance score that can effectively characterize the relevance degree of the search result and the index information may be obtained in combination with the relevance characteristic parameter determination processing and the model-based relevance score acquisition processing.
In an alternative example, the number of target relevance models is at least two, and the description fields associated with different target relevance models are not identical.
In one embodiment, the number of the object correlation models is at least two, the description domain associated with one of the object correlation models is a name domain, and the description domain associated with the other object correlation model includes at least two description domains. Optionally, the at least two description fields may include a name field.
In this embodiment, both the two target relevance models (i.e., the target relevance model whose associated description domain is the name domain and the target relevance model whose associated description domain includes at least two description domains) can output the relevance scores of the search result and the index information, so that the search result can be displayed by combining the two relevance scores output by the two target relevance models, and the two target relevance models can play a complementary role, thereby being beneficial to further ensuring the reasonability of the search result page.
In an optional example, the number of the target relevance models is at least two, wherein one target relevance model is a first target relevance model, the other target relevance model is a second target relevance model, the description domain associated with the first target relevance model is a name domain, the description domain associated with the second target relevance model comprises at least two description domains, and the number of the search results is K;
based on the embodiment shown in fig. 2, as shown in fig. 3, step 205 includes:
step 2051, acquiring a first score threshold corresponding to the first target correlation model and a second score threshold corresponding to the second target correlation model;
step 2052, according to the first score threshold, extracting P relevance scores meeting a first preset condition from K relevance scores corresponding to K search results obtained according to the first target relevance model;
step 2053, according to a second score threshold, extracting Q relevance scores meeting a second preset condition from K relevance scores corresponding to K search results obtained according to the second target relevance model;
step 2054, determining a union of P search results corresponding to the P relevance scores and Q search results corresponding to the Q relevance scores;
and step 2055, sequencing and displaying all the search results in the union set.
Here, K may be 2, 3, 4, 5, or an integer greater than 5, which is not listed here.
In the embodiment of the disclosure, the corresponding relationship between each target correlation model and the corresponding score threshold may be preset, so that the first score threshold corresponding to the first target correlation model and the second score threshold corresponding to the second target correlation model may be conveniently and reliably obtained according to the preset corresponding relationship.
The number of the search results is K, the first target relevance model may output K relevance scores corresponding to the K search results, and after the first score threshold is obtained, P relevance scores meeting a first preset condition may be extracted from the K relevance scores corresponding to the K search results output by the first target relevance model according to the first score threshold. Optionally, each relevance score larger than a first score threshold may be screened from K relevance scores corresponding to K search results output by the first target relevance model, and all screened relevance scores may be taken as P extracted relevance scores meeting the first preset condition, or further screening may be performed on the basis of all screened relevance scores, and all further screened relevance scores may be taken as P extracted relevance scores meeting the first preset condition. It should be noted that, in the above cases, the value of P is not fixed, and the value of P is specifically related to the actual screening situation, and of course, the value of P may also be fixed, for example, the value of P may be set to 5, 8, 10, and the like in advance.
Similar to the first target relevance model, the second target relevance model may also output K relevance scores corresponding to the K search results, and after the second score threshold is obtained, Q relevance scores meeting a second preset condition may be extracted from the K relevance scores corresponding to the K search results output by the second target relevance model according to the second score threshold, where the specific extraction manner may refer to the description in the previous paragraph, and details are not repeated here. It should be noted that, similar to P, the value of Q may not be fixed, for example, related to the actual screening situation, and of course, the value of Q may also be fixed.
After the P relevance scores meeting the first preset condition and the Q relevance scores meeting the second preset condition are extracted, a union of the P search results corresponding to the P relevance scores and the Q search results corresponding to the Q relevance scores may be determined, and all the search results in the union may be displayed in an order.
In a specific example, the number of the search results is 3, which are search result 1, search result 2, and search result 3, the relevance scores output by the first target relevance model for search result 1, search result 2, and search result 3 are 0.55, 0.60, and 0.85, the relevance scores output by the second target relevance model for search result 1, search result 2, and search result 3 are 0.91, 0.70, and 0.95, respectively, the first score threshold corresponding to the first target relevance model is 0.80, and the second score threshold corresponding to the second target relevance model is 0.90. Then, each relevance score greater than 0.80 may be extracted from among 0.55, 0.60, and 0.85, and at this time, only 0.85 is extracted, only 0.85 is included in P relevance scores that satisfy the first preset condition, and only search result 3 is included in P search results corresponding to the P relevance scores. Similarly, each of the relevance scores greater than 0.9 may be extracted from among 0.91, 0.70, and 0.95, at this time, only 0.91 and 0.95 are extracted, only 0.91 and 0.95 are included in Q relevance scores that satisfy the second preset condition, only search result 1 and search result 3 are included in Q search results corresponding to Q relevance scores, so that the union of P search results and Q search results only includes search result 1 and search result 3, and then search result 1 and search result 3 may be displayed in order on the search result page, so that all search results displayed in the search result page have a high degree of relevance to the index information.
It can be seen that, in the embodiment of the present disclosure, in combination with the first score threshold, P relevance scores may be reasonably extracted from K relevance scores obtained according to the first target relevance model, in combination with the second score threshold, Q relevance scores may be reasonably extracted from K relevance scores obtained according to the second target relevance model, and then, through processing of the union set and performing ranking display of search results based on the union set processing result, the first target relevance model and the second target relevance model can be effectively made to play a complementary role, thereby ensuring the reasonability of the search result page.
In an alternative example, step 2055 includes:
acquiring a first model weight corresponding to the first target correlation model and a second model weight corresponding to the second target correlation model;
aiming at each search result in the parallel set, carrying out weighted summation on corresponding relevance scores respectively obtained according to the first target relevance model and the second target relevance model by utilizing the first model weight and the second model weight so as to obtain a comprehensive score;
and sequencing and displaying all the search results in the union set according to the respective comprehensive scores of all the search results in the union set.
In the embodiment of the disclosure, the corresponding relationship between each target correlation model and the corresponding model weight can be preset, so that the first model weight corresponding to the first target correlation model and the second model weight corresponding to the second target correlation model can be conveniently and reliably obtained according to the preset corresponding relationship; wherein the first model weight may be represented as Z1 and the second model weight may be represented as Z2.
Next, for each search result in the union of P search results and Q search results, a weighted summation process may be performed by using the first model weight and the second model weight to obtain a composite score, so that all search results in the union may be displayed in a sorted manner according to all the obtained composite scores.
Continuing with the example in the above embodiment and including only search result 1 and search result 3 collectively, since the relevance scores output by the first target relevance model for search result 1 and search result 3 are 0.55 and 0.85, respectively, and the relevance scores output by the second relevance model for search result 1 and search result 3 are 0.91 and 0.95, respectively, the composite score S1 corresponding to search result 1 can be calculated using the following formula: s1 ═ Z1 × 0.55+ Z2 × 0.91, and the composite score S2 corresponding to search result 2 can be calculated using the following formula: s2 ═ Z1 × 0.85+ Z2 × 0.95. Thereafter, the composite score S1 and the composite score S2 may be compared in magnitude, and assuming that the composite score S2 is greater than the composite score S1 as a result of the comparison, the degree of relevance of the search result 3 to the index information may be considered to be higher than the degree of relevance of the search result 1 to the index information, so that the search result 3 may be ranked before the search result 1 when the search result page is displayed.
Therefore, in the embodiment of the disclosure, by combining the first model weight and the second model weight, the respective comprehensive scores corresponding to all search results in a merged set can be conveniently and reliably obtained, and each comprehensive score can very accurately represent the degree of correlation between the corresponding search result and the index information.
In an optional example, the correlation characteristic parameter includes at least one of a field hit coverage parameter of the domain data where the target data is located, a miss coverage parameter before the hit field, and a miss coverage parameter after the hit field;
step 203, comprising:
performing word segmentation processing on domain data of the target data to obtain a first word segmentation processing result;
calculating a first sum of importance weights of all words in the first word segmentation processing result;
calculating a second sum of importance weights of all words used for forming the target data in the first word segmentation processing result, and calculating a field hit coverage rate parameter according to the second sum and the first sum;
and/or calculating a third sum of importance weights of all terms positioned before all terms forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter before a hit field according to the third sum and the first sum;
and/or calculating a fourth sum of importance weights of all terms positioned after all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter after the hit field according to the fourth sum and the first sum.
In the embodiment of the present disclosure, a word segmentation tool may be utilized to perform word segmentation on domain data (for example, domain data corresponding to a name domain) where the target data is located, so as to obtain a first word segmentation result, and if the domain data is "beijing jingti technologies ltd", the first word segmentation result may be represented as: [ "Beijing", "gold dike", "science and technology", "Limited company" ]. In addition, the importance weight of each word may be determined in advance by analyzing the text set of the company field, for example, the importance weight of "beijing" may be 3.3322, the importance weight of "jinbang" may be 5.02418, the importance weight of "science and technology" may be 1.75534, the importance weight of "limited company" may be 0.742791, and then the sum of 3.3322, 5.02418, 1.75534, and 0.742791 may be used as the first sum.
Suppose the index information is "Beijing gold Bank" and the following results can be obtained using the elastic search tool: "< em > beijing king dike science and technology limited", it is apparent that the target data is "beijing king dike", in which case, the respective words for composing the target data in the first participle processing result include "beijing" and "king dike", the word preceding all the words for composing the target data does not exist in the first participle processing result, and the respective words following all the words for composing the target data in the first participle processing result include "science and technology" and "limited", then, the sum of 3.3322 and 5.02418 may be taken as a second sum, 0 may be taken as a third sum, the sum of 1.75534 and 0.742791 may be taken as a fourth sum, and up to this point, the first sum, the second sum, the third sum, and the fourth sum are successfully obtained, and thereafter, a ratio of the second sum to the first sum may be taken as a field hit coverage parameter, and taking the ratio of the third sum to the first sum as a miss coverage parameter before the hit field, and taking the ratio of the fourth sum to the first sum as a miss coverage parameter after the hit field, so that the field hit coverage parameter, the miss coverage parameter before the hit field and the miss coverage parameter after the hit field can be calculated conveniently and reliably through simple division operation.
If the search result is represented as doc, the field hit coverage parameter may also be referred to as doc coverage because the field hit coverage parameter is a ratio of the second sum to the first sum, which may represent an importance weight ratio of the reddish field to the domain data corresponding to the name domain; since the miss coverage parameter before the hit field is the ratio of the third sum to the first sum, the miss coverage parameter before the hit field may also be referred to as the coverage of the miss before the drift field, which may represent the importance ratio of the continuous length of the non-drift from front to back; since the miss coverage parameter after the hit field is the ratio of the fourth sum to the first sum, the miss coverage parameter after the hit field may also be referred to as the coverage of misses after the drift field, which may represent the importance ratio of the continuous length of the non-drift from back to front.
Therefore, in the embodiment of the disclosure, through word segmentation processing on domain data and introduction of importance weight, a field hit coverage rate parameter, a hit field previous miss coverage rate parameter and a hit field subsequent miss coverage rate parameter can be calculated conveniently and reliably.
It should be noted that, in the foregoing, when calculating the field hit coverage parameter, the miss coverage parameter before the hit field, and the miss coverage parameter after the hit field, the importance weight is introduced, and according to the actual situation, the calculation process related to these coverage parameters may not introduce the importance weight, for example, if the index information is "gold dike", and the following result is obtained by using an elastic search tool: "beijing" jinke science and technology ltd ", it is obvious that the target data is" jinke ", the number of words in the first segmentation processing result is 4, the number of words before all words for composing the target data in the first segmentation processing result is 1, and the number of words after all words for composing the target data in the first segmentation processing result is 2, then the miss coverage parameter before the hit field may be 1/4, and the miss coverage parameter after the hit field may be 2/4.
In an optional example, the correlation characteristic parameter includes an information hit coverage parameter of the index information;
determining a correlation characteristic parameter according to the target data, comprising:
performing word segmentation processing on the index information to obtain a second word segmentation processing result;
calculating a fifth sum of importance weights of all words in the second word segmentation processing result;
calculating a sixth sum of importance weights of all words forming the target data in the second word segmentation processing result;
and calculating the information hit coverage rate parameter according to the sixth sum and the fifth sum.
In the embodiment of the present disclosure, a word segmentation tool may be utilized to perform word segmentation processing on the index information to obtain a second word segmentation processing result, and assuming that the index information is "beijing jinke", the second word segmentation processing result may be represented as: [ "Beijing", "gold dike" ]. In addition, the importance weight of each word may be determined in advance by analyzing a text set of a company field, for example, the importance weight of "beijing" may be 3.3322, the importance weight of "jinbang" may be 5.02418, and then the sum of 3.3322 and 5.02418 may be a fifth sum.
Assume that the domain data is "Beijing Jinbei technologies, Inc" and the following results were obtained using the elastic search tool: "beijing king dike science and technology ltd", it is obvious that the target data is "beijing king dike", each word for composing the target data in the second word segmentation processing result includes "beijing" and "king dike", and then the sum of 3.3322 and 5.02418 may be used as a sixth sum. Therefore, the fifth sum and the sixth sum are both successfully obtained, and then the ratio of the sixth sum to the fifth sum can be used as the information hit coverage rate parameter, so that the information hit coverage rate parameter can be calculated conveniently and reliably through simple division operation.
If the index information is represented as a query, since the information hit coverage parameter is the ratio of the sixth sum to the fifth sum, the information hit coverage parameter may also be referred to as a query coverage, which may represent the importance ratio of the query hit float field.
Therefore, in the embodiment of the disclosure, the information hit coverage rate parameter can be calculated conveniently and reliably by performing word segmentation processing on the index information and introducing the importance weight.
In an optional example, the correlation characteristic parameter includes a core word loss evaluation parameter;
determining a correlation characteristic parameter according to the target data, comprising:
determining core data in domain data in which the target data is located;
performing word segmentation processing on the core data to obtain a third word segmentation processing result;
calculating a seventh sum of importance weights of all words in the third word segmentation processing result;
performing word segmentation on the part hitting the core data in the target data to obtain a fourth word segmentation result;
calculating an eighth sum of importance weights of all words in the fourth word segmentation processing result;
and calculating the core word loss evaluation parameter according to the eighth sum and the seventh sum.
In the embodiment of the disclosure, the core word dictionary may be used to determine the core data in the domain data (e.g., the domain data corresponding to the name domain) where the target data is located, where the core data refers to the most important component in the domain data and may cause data escape once lost, for example, when the domain data is "beijing jinbang technologies limited", the core data determined by the core word dictionary may be "beijing jinbang", and of course, the core data determined by the core word dictionary may also be "jinbang".
Next, a word segmentation tool may be utilized to perform word segmentation on the core data to obtain a third word segmentation result, and assuming that the core data is "beijing king dike", the third word segmentation result may be represented as: [ "Beijing", "gold dipper" ]. In addition, the importance weight of each word may be determined in advance by analyzing the text set of the company field, for example, the importance weight of "beijing" may be 3.3322, the importance weight of "jinbang" may be 5.02418, and then the sum of 3.3322 and 5.02418 may be a seventh sum.
Assuming that the target data is "gold dike", a part of the target data hitting the core data is "gold dike", and a fourth participle processing result obtained after performing participle processing on the part of the target data hitting the core data may be represented as [ "gold dike" ], and 5.02418 may be used as an eighth sum. Therefore, the seventh sum and the eighth sum are both successfully obtained, and then the ratio of the eighth sum to the seventh sum can be used as the core word loss evaluation parameter, so that the core word loss evaluation parameter can be calculated conveniently and reliably through simple division operation.
Therefore, in the embodiment of the disclosure, the core word loss evaluation parameters can be calculated conveniently and reliably by performing word segmentation on the core data in the domain data and the part hitting the core data in the target data and introducing the importance weight.
In an optional example, the correlation characteristic parameters comprise a hit window length ratio parameter;
determining a correlation characteristic parameter according to the target data, comprising:
determining a data window which is composed of all data from a first field of first target data to an end field of last target data in domain data where the target data is located;
performing word segmentation processing on the data window to obtain a fifth word segmentation processing result;
and calculating the length ratio parameter of the hit window according to the number of the words forming the target data in the fifth word processing result and the number of the words in the fifth word processing result.
In the embodiment of the present disclosure, it is assumed that the domain data of the target data is "beijing jinbang technologies ltd", and the following results are obtained by using an elastic search tool: "beijing < em > jingei </em > science and technology < em > limited </em > company", it is obvious that the number of the object data is two, which are "jingei" and "limited", respectively, and then the data window composed of all data between the first field of the first object data and the last field of the last object data in the domain data is "jingei science and technology limited".
Then, a word segmentation tool may be utilized to perform word segmentation processing on the data window to obtain a fifth word segmentation processing result, where the fifth word segmentation processing result may be expressed as: if the number of words used for forming the target data in the fifth word processing result is 2, the number of words in the fifth word processing result is 3, and then the ratio 2/3 between 2 and 3 can be used as the length ratio parameter of the hit window, so that the length ratio parameter of the hit window can be calculated conveniently and reliably through simple division operation.
It should be noted that the data window may also be referred to as a drifting window, and since the hit window length ratio parameter is a ratio of the number of words used for forming the target data in the fifth word processing result to the number of words in the fifth word processing result, the hit window length ratio parameter may also be considered as a drifting length ratio in the drifting window.
Therefore, in the embodiment of the disclosure, by determining the data window and performing word segmentation processing on the data window, the length ratio parameter of the hit window can be calculated conveniently and reliably.
In an optional example, the correlation characteristic parameter includes at least one of a hit number ratio parameter and a distance escape parameter, the hit number ratio parameter is a ratio of the number of target data in the domain data where the target data is located to a hit number threshold, and the distance escape parameter is an edit distance between the index information and the target data.
Here, the hit number threshold may be a number threshold set in advance.
In the embodiment of the present disclosure, it is assumed that the domain data of the target data is "beijing jinbang technologies ltd", and the following results are obtained by using an elastic search tool: "< em > Beijing jin Ding science and technology Co., Ltd", it is obvious that the number of target data in the domain data is only 1, and if the threshold value of the hit number is 4, 1/4 can be used as the hit number ratio parameter. Suppose that the domain data of the target data is "Beijing Jinke technologies, Inc", and the following results are obtained by using an elastic search tool: it is obvious that the number of target data in domain data is 2, and if the threshold value of the hit number is 4, 2/4 can be used as the hit number ratio parameter. The hit number ratio parameter may also be referred to as a ratio of the number of float labels.
In the embodiment of the present disclosure, it is assumed that the index information is "gold," the domain data is "beijing king bank science and technology ltd," and the following results are obtained by using an elastic search tool: it is obvious that the object data is 'gold dam', the edit distance of 'gold dam' and 'gold dam' can be calculated, and 1 can be used as the distance escape parameter because the calculated edit distance is 1.
Therefore, in the embodiment of the disclosure, the hit number ratio parameter can be calculated very conveniently and reliably by dividing the number of the target data by the hit number threshold, and the distance escape parameter can be calculated very conveniently and reliably by determining the edit distance.
In combination with the above, according to the target data, a plurality of parameters can be calculated, specifically, as shown in fig. 4, by performing basic literal matching, the query coverage, doc coverage, the percentage of the number of the drifting labels, the coverage of the miss before the drifting field, the coverage of the miss after the drifting field, and the percentage of the length of the drifting red in the drifting window can be calculated; through escape matching, a distance escape parameter and a core word loss evaluation parameter can be calculated. Based on the calculated plurality of parameters, a correlation characteristic parameter may be composed for subsequent processing.
In one optional example, the method further comprises:
obtaining R sample data from historical user data; each sample data in the R sample data comprises sample index information, a sample search result and the frequency of occurrence of a preset event on the sample search result;
dividing the R sample data into a positive sample data group and a negative sample data group according to whether the times of the R sample data are zero or not;
extracting S positive sample data from the positive sample data group, and extracting T negative sample data from the negative sample data group; wherein S and T satisfy a preset proportional relationship;
and training to obtain a target correlation model according to the S positive sample data and the T negative sample data.
Here, the preset event may be a click event; the condition that S and T satisfy the preset proportional relationship can be as follows: the ratio of S to T is 1: 1.
In the embodiment of the disclosure, historical user data may be acquired from a user database, where the historical user data may include a user behavior log of each of a plurality of users, and the user behavior log of each user may record at what time the user is, what index information (which may be subsequently used as sample index information) is used for searching to obtain a search result (which may be subsequently used as a sample search result), whether the search result is clicked, and the like, so that R sample data may be acquired by performing statistics and analysis on the historical user data, and each sample data may include the sample index information, the sample search result, and the number of times of occurrence of a preset event on the sample search result; wherein, R may be 2000, 5000, 10000 or other integers, which are not listed here.
Next, the R sample data may be divided into a positive sample data group and a negative sample data group according to whether the number of times that the R sample data respectively includes is zero. Specifically, the positive sample data group may include all sample data with a non-zero number of times in the R sample data groups, and each sample data in the positive sample data group may be regarded as a positive sample data; the negative sample data group may include all sample data with a frequency of zero in the R sample data, each sample data in the negative sample data group may be regarded as a negative sample data, and a ratio of the number of the negative sample data in the negative sample data group to the number of the positive sample data in the positive sample data group may be about 1: 13.
Then, according to a preset proportional relationship (for example, according to a ratio of 1: 1), extraction processing may be performed on the positive sample data group and the negative sample data group, so as to extract S positive sample data from the positive sample data group, and extract T negative sample data from the negative sample data group; wherein, S and T may be 1000, 2500, 5000 or other integers, which are not listed here.
And then, training to obtain a target correlation model according to the S positive sample data and the T negative sample data. Specifically, a set composed of S positive sample data and T negative sample data may be used as a training sample library, and training of the target correlation model is performed according to the set; or, the S positive sample data and the T negative sample data may be further filtered, and a set of the positive sample data and the negative sample data obtained after the further filtering is used as a training sample library, and the training of the target correlation model is performed according to the set of the positive sample data and the negative sample data.
In the embodiment of the disclosure, after R sample data are acquired from historical user data, positive and negative sample division can be performed according to whether the number of times that the R sample data respectively include is zero, so that a positive sample data group and a negative sample data group can be conveniently and reliably obtained, then, extraction processing can be performed on the positive sample data group and the negative sample data group respectively, and the number of the positive sample data extracted from the positive sample data group and the number of the negative sample data extracted from the negative sample data group meet a preset proportional relationship, so that on one hand, the number of the data samples used for model training can be reduced, so that the model training speed is increased, on the other hand, the uniformity of the positive and negative samples used for model training can be improved, and the model training effect is ensured.
In an optional example, training to obtain a target correlation model according to the S positive sample data and the T negative sample data includes:
according to the frequency threshold value, extracting U positive sample data of which the frequency meets a third preset condition from the S positive sample data;
according to the T negative sample data and the initial correlation model, obtaining a correlation score between a sample search result and sample index information included in each negative sample data in the T negative sample data;
extracting V negative sample data meeting a fourth preset condition from the T negative sample data according to the acquired relevance score;
and carrying out model training by utilizing the U positive sample data and the V negative sample data to obtain a target correlation model.
Here, the number threshold may be a preset number of times, and the number threshold may be 10, 15, 20 or other values, which are not listed here.
Here, the initial correlation model may be a model that is trained in advance by using a machine learning algorithm and is capable of performing correlation score prediction, and a training manner of the initial correlation model and a training manner of the target correlation model may be substantially similar, but an accuracy requirement of the initial correlation model may be lower than an accuracy requirement of the target correlation model.
In the embodiment of the present disclosure, after obtaining S pieces of positive sample data, U pieces of positive sample data whose number of times satisfies a third preset condition may be extracted from the S pieces of positive sample data according to a number of times threshold. Optionally, each positive sample data included in the S positive sample data whose number of times is greater than the number threshold may be screened, and all the screened positive sample data may be used as U positive sample data satisfying the third preset condition, or further screening may be performed on the basis of all the screened positive sample data, and all the further screened positive sample data may be used as U positive sample data satisfying the third preset condition.
After the T pieces of negative sample data are obtained, the relevance score of the sample search result included in each negative sample data in the T pieces of negative sample data and the sample index information, which are output by the initial relevance model, may be obtained according to the T pieces of negative sample data and the initial relevance model.
Specifically, for each negative sample data in the T negative sample data, the correlation characteristic parameter may be determined based on the sample index information and the sample search result included in the sample data (the specific determination manner may refer to the above description of step 102 to step 103, and is not described here again). When the number of the determined correlation characteristic parameters is only one, inputting the one of the determined correlation characteristic parameters into the initial correlation model to obtain a correlation score between a sample search result included in the negative sample data and the sample index information, which is output by the initial correlation model; under the condition that the number of the determined parameters in the correlation characteristic parameters is at least two, at least two of the determined parameters in the correlation characteristic parameters can be combined to obtain a combined characteristic vector, and then the combined characteristic vector is input into the initial correlation model to obtain a correlation score of a sample search result and sample index information, wherein the sample search result and the sample index information are included in the negative sample data and are output by the initial correlation model.
After acquiring the corresponding correlation scores respectively output by the initial correlation model for each negative sample data in the T negative sample data, V negative sample data meeting a fourth preset condition may be extracted from the T negative sample data according to the acquired correlation scores. Optionally, each negative sample data with the numerical value of the corresponding correlation score sorted in a later preset proportion (for example, later 60%) may be screened from the T negative sample data, and all the screened negative sample data may be used as V negative sample data meeting a fourth preset condition, or further screening may be performed on the basis of all the screened negative sample data, and all the further screened negative sample data may be used as V negative sample data meeting the fourth preset condition.
And then, performing model training by using U positive sample data and V negative sample data to obtain a target correlation model. Optionally, the process of model training specifically may include: (1) taking a set consisting of U positive sample data and V negative sample data as a training sample library; (2) for each sample data in the training sample library, determining the corresponding label as 0 under the condition that the included times of the sample data is zero, otherwise, determining the corresponding label as 1; (3) for each sample data in the training sample library, determining a correlation characteristic parameter based on the sample index information and the sample search result included in the sample data; (4) and taking the correlation characteristic parameter corresponding to each sample data in the training sample library as input data, and taking the label corresponding to each sample data in the training sample library as output data for training to obtain a target correlation model.
It should be noted that, as can be known through experiments, a sample confidence problem may occur when the click behavior is completely relied on to distinguish the positive and negative samples, for example, the click behavior of the positive sample may be recorded from a false point of the user, and the no click behavior of the negative sample does not represent a low degree of objectively relevance, in view of this, in the embodiments of the present disclosure, a hierarchical policy may be used to extract sample data, and specifically, U positive sample data whose included times satisfy a third preset condition may be extracted from S positive sample data according to a time threshold, which is favorable for extracting sample data with high click, and V negative sample data which satisfy a fourth preset condition may be extracted from T negative sample data according to a relevance score between a sample search result included in each negative sample data in the T negative sample data output by the initial relevance model and the sample index information, therefore, negative sample data with high corresponding relevance scores can be removed, and therefore the rationality of the training sample library can be well guaranteed by the embodiment of the disclosure, and the model training effect is guaranteed.
It should be noted that, as shown in fig. 5, the implementation process of the embodiment of the present disclosure may be divided into 4 parts, which are sample construction, feature design, model training, and grading application, respectively; the method comprises the following steps that basic data are taken from a basic data source, and then positive and negative samples (which are equivalent to U positive sample data and V negative sample data in the above text) for model training are obtained through preprocessing; the main function of feature design is to design reasonable correlation operators (which are equivalent to the above-mentioned determination of correlation feature parameters); the main function of model training is to train to obtain a model (which is the above target correlation model) suitable for a company search scenario by using a machine learning algorithm suitable for the embodiments of the present disclosure; the grading application has the main function of deploying the model into a project, and then automatically grading and grading the doc set obtained by searching by using the query (which is equivalent to sequencing and displaying all doc in the doc set) so as to avoid that the doc with low relevance degree in company search is ranked in the front.
Since the influence of the previous data processing on the later stage is huge, the sample construction can be considered as a factor influencing the accuracy of the model to the maximum. As shown in fig. 6, when constructing a sample, a query with a click behavior record, a click company doc and the number of clicks may be obtained by mining a user behavior log; then simulating access to a recall service by using doc with click behavior, returning a doc set consisting of all related doc under the query by the recall, wherein each doc comprises a plurality of domain data, and if a field in the query is hit in each domain data, obtaining a query-doc pair by accessing the recall service, otherwise, obtaining a query-doc pair by accessing the recall service; then, screening of clicking and non-clicking samples can be carried out, each sample (which is equivalent to the sample data in the above text) can be composed of 3 factors, namely query, doc and clicking times (no clicking action, default is 0 time), and the samples are preliminarily screened according to whether the clicking action exists in the doc set under the clicking query, so that the screening of basic samples (which is equivalent to the division of a positive sample data group and a negative sample data group in the above text) is completed; and then sampling can be carried out, wherein the sampling aims to reasonably utilize positive and negative samples to carry out model training, the positive and negative samples are distinguished according to whether clicking exists, and the evidence proves that a query and a doc have certain correlation when a user clicks a certain doc in the search of the query. It should be noted that, by executing the first 3 steps in fig. 6, the ratio of the clicked sample to the non-clicked sample (which is equivalent to the ratio of the number of the negative sample data in the negative sample data set to the number of the positive sample data in the positive sample data set in the foregoing) obtained is about 1: 13, if all the samples are used for training, there are two problems, one is that the training sample is large and the training speed is slow, and the other is that the positive sample and the negative sample are not uniform and will affect the final training effect. In addition, in order to avoid the sample confidence problem caused by completely depending on the click behavior to determine the positive and negative samples, a strategy of extracting samples hierarchically may also be adopted, that is, the positive sample extracts a sample with a high click rate (which is equivalent to extracting U positive sample data included in the S positive sample data by the number of times that satisfies the third preset condition in the foregoing), and the negative sample culling model extracts a sample with a higher prediction score (which is equivalent to extracting V negative sample data satisfying the fourth preset condition from T negative sample data in the foregoing).
The complexity of feature design is that two entity relationships of query and doc are required to be described through operators to form numbers with physical meanings. Specifically, as shown in fig. 4, in the embodiment of the present disclosure, two types of feature parameters may be designed based on the drifting information, one type is a basic literal feature parameter, and includes 6 total, which are respectively a query coverage, a doc coverage, a ratio of the number of drifting labels, a coverage of missing before a drifting field, a coverage of missing after a drifting field, and a ratio of the length of drifting in a drifting window, and the other type is an escape matching feature parameter, and includes two total, which are respectively a distance escape parameter and a core word loss evaluation parameter, and the two types of feature parameters may constitute a correlation feature parameter.
After the previous sample construction and feature design are completed, the correlation feature parameters can be used as input parameters of the model, and a machine learning algorithm is used for training to obtain the target correlation model. Alternatively, the number of the target correlation models may be two, wherein one of the target correlation models may be obtained by constructing samples and features for the name domain alone, and this target correlation model may also be referred to as a name domain correlation model, and the other target correlation model may be obtained by combining all the description domain construction samples and features, and this target correlation model may be referred to as a global correlation model.
After the training of the model is completed, the name domain relevance model and the universe relevance model can be applied to the architecture system of the company search scenario. During grading and grading, after the doc set obtained by searching the query is used, the relevance scores of each doc and the query can be respectively predicted by using the two models, a proper score threshold value is respectively formulated for each model, and finally, the doc sets can be combined for use in the using process so as to automatically sort and display all doc in the doc set.
In summary, in the embodiment of the disclosure, by innovatively utilizing external information, that is, by utilizing user information reflected by clicking a log by a user and the flushing information of a recall service, and combining with a correlation characteristic design, the perceptual relevance of the user can be greatly improved, and meanwhile, the iteration of the flushing service and the correlation can be mutually promoted; by performing correlation modeling in the company search field, extracting a plurality of description domain modeling, and distinguishing the importance of different domains to perform global and regional modeling, the core function of correlation in the company search scene can be fully played; the relevance model is applied to a company search scene, and documents are automatically graded and graded, so that the quality of search results can be ensured, the rationality of search result pages can be ensured, and the search experience of users can be improved.
Fig. 7 schematically shows a block diagram of a search result processing apparatus according to an embodiment of the present disclosure. The search result processing apparatus provided in the embodiment of the present disclosure may be disposed on a terminal device, may also be disposed on a server, or may be partially disposed on a terminal device and partially disposed on a server, for example, may be disposed on the server 105 in fig. 1 (according to actual replacement), but the present disclosure is not limited thereto.
The search result processing apparatus provided by the embodiment of the present disclosure may include a first obtaining module 701, a first determining module 702, a second determining module 703, a second obtaining module 704, and a processing module 705.
A first obtaining module 701, configured to obtain index information and a search result obtained by performing a search using the index information; the search result comprises N domain data corresponding to the N description domains;
a first determining module 702, configured to determine, from domain data corresponding to a description domain associated with the target relevance model among the N description domains, target data hit by the index information;
a second determining module 703, configured to determine a correlation characteristic parameter according to the target data;
a second obtaining module 704, configured to obtain a relevance score between the search result and the index information according to the relevance feature parameter and the target relevance model;
the processing module 705 is configured to perform display processing on the search result according to the obtained relevance score.
In an optional example, the number of the target relevance models is at least two, wherein one target relevance model is a first target relevance model, the other target relevance model is a second target relevance model, the description domain associated with the first target relevance model is a name domain, the description domain associated with the second target relevance model comprises at least two description domains, and the number of the search results is K;
as shown in fig. 8, the processing module 705 includes:
a first obtaining sub-module 7051, configured to obtain a first score threshold corresponding to the first target correlation model and a second score threshold corresponding to the second target correlation model;
the first extraction submodule 7052 is configured to extract, according to a first score threshold, P relevance scores that satisfy a first preset condition from K relevance scores corresponding to K search results obtained according to the first target relevance model;
the second extraction submodule 7053 is configured to extract, according to a second score threshold, Q relevance scores that satisfy a second preset condition from K relevance scores corresponding to K search results obtained according to the second target relevance model;
a first determining sub-module 7054, configured to determine a union of P search results corresponding to the P relevance scores and Q search results corresponding to the Q relevance scores;
and the display sub-module 7055 is configured to perform ranking display on all search results in the union.
In an alternative example, sub-module 7055 is presented, comprising:
the first obtaining unit is used for obtaining a first model weight corresponding to the first target correlation model and a second model weight corresponding to the second target correlation model;
the second obtaining unit is used for carrying out weighted summation on corresponding relevance scores respectively obtained according to the first target relevance model and the second target relevance model by utilizing the first model weight and the second model weight aiming at each search result in the union set so as to obtain a comprehensive score;
and the display unit is used for sequencing and displaying all the search results in the union set according to the respective comprehensive scores of all the search results in the union set.
In an optional example, the correlation characteristic parameter includes at least one of a field hit coverage parameter of the domain data where the target data is located, a miss coverage parameter before the hit field, and a miss coverage parameter after the hit field;
a second determining module 703, comprising:
the first processing submodule is used for performing word segmentation processing on the domain data where the target data is located to obtain a first word segmentation processing result;
the first calculation submodule is used for calculating a first sum of importance weights of all words in the first word segmentation processing result;
the second determining submodule is used for calculating a second sum of importance weights of all words forming the target data in the first word segmentation processing result and calculating a field hit coverage rate parameter according to the second sum and the first sum;
and/or the second determining submodule is used for calculating a third sum of importance weights of all terms positioned before all terms for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter before a hit field according to the third sum and the first sum;
and/or the second determining submodule is used for calculating a fourth sum of importance weights of all terms positioned after all terms used for forming the target data in the first word segmentation processing result, and calculating the miss coverage rate parameter after the hit field according to the fourth sum and the first sum.
In an optional example, the correlation characteristic parameter includes an information hit coverage parameter of the index information;
a second determining module 703, comprising:
the second processing submodule is used for performing word segmentation processing on the index information to obtain a second word segmentation processing result;
the second calculation submodule is used for calculating a fifth sum of importance weights of all words in the second word segmentation processing result;
the third calculation submodule is used for calculating a sixth sum of importance weights of all words forming the target data in the second word segmentation processing result;
and the third determining submodule is used for calculating the information hit coverage rate parameter according to the sixth sum and the fifth sum.
In an optional example, the correlation characteristic parameter includes a core word loss evaluation parameter;
a second determining module 703, comprising:
the fourth determining submodule is used for determining core data in the domain data where the target data are located;
the third processing submodule is used for carrying out word segmentation processing on the core data to obtain a third word segmentation processing result;
the fourth calculation submodule is used for calculating a seventh sum of importance weights of all words in the third word segmentation processing result;
the fourth processing submodule is used for performing word segmentation processing on the part hitting the core data in the target data to obtain a fourth word segmentation processing result;
the fifth calculation submodule is used for calculating an eighth sum of importance weights of all words in the fourth word segmentation processing result;
and the fifth determining submodule is used for calculating the core word loss evaluation parameter according to the eighth sum and the seventh sum.
In an optional example, the correlation characteristic parameters comprise a hit window length ratio parameter;
a second determining module 703, comprising:
a sixth determining submodule, configured to determine a data window, in the domain data where the target data is located, where the data window is formed by all data from a first field of the first target data to a last field of the last target data;
the fifth processing submodule is used for performing word segmentation processing on the data window to obtain a fifth word segmentation processing result;
and the seventh determining submodule is used for calculating the length ratio parameter of the hit window according to the number of the words forming the target data in the fifth word processing result and the number of the words in the fifth word processing result.
In an optional example, the correlation characteristic parameter includes at least one of a hit number ratio parameter and a distance escape parameter, the hit number ratio parameter is a ratio of the number of target data in the domain data to a hit number threshold, and the distance escape parameter is an edit distance between the index information and the target data.
In one optional example, the apparatus further comprises:
the third acquisition module is used for acquiring R sample data from historical user data; each sample data in the R sample data comprises sample index information, a sample search result and the frequency of occurrence of a preset event on the sample search result;
the dividing module is used for dividing the R sample data into a positive sample data group and a negative sample data group according to whether the times of the R sample data are zero or not;
the extraction module is used for extracting S positive sample data from the positive sample data group and extracting T negative sample data from the negative sample data group; wherein S and T satisfy a preset proportional relationship;
and the training module is used for training to obtain a target correlation model according to the S positive sample data and the T negative sample data.
In one optional example, a training module, comprising:
the third extraction submodule is used for extracting U positive sample data of which the times meet a third preset condition from the S positive sample data according to the time threshold;
the second obtaining sub-module is used for obtaining the relevance scores of the sample search results and the sample index information included in each negative sample data in the T negative sample data according to the T negative sample data and the initial relevance model;
the fourth extraction submodule is used for extracting V negative sample data meeting a fourth preset condition from the T negative sample data according to the acquired relevance scores;
and the training submodule is used for carrying out model training by utilizing the U positive sample data and the V negative sample data to obtain a target correlation model.
According to the search result processing device in the embodiment of the disclosure, after the search is performed by using the index information, the target data hit by the index information can be determined from the appropriate domain data by combining the incidence relation between the model and the description domain, then, the relevance score which can effectively represent the relevance degree of the search result and the index information can be obtained by combining the relevance characteristic parameter determination processing and the model-based relevance score acquisition processing, and the display processing of the search result can be performed in an appropriate manner according to the obtained relevance score, so that the rationality of the search result page can be favorably ensured, and the high-quality user search experience can be ensured.
The specific implementation of each module, unit and subunit in the search result processing apparatus provided in the embodiment of the present disclosure may refer to the content in the search result processing method, and is not described herein again.
It should be noted that although several modules, units and sub-units of the apparatus for action execution are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules, units and sub-units described above may be embodied in one module, unit and sub-unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module, unit and sub-unit described above may be further divided into embodiments by a plurality of modules, units and sub-units.
As shown in FIG. 9, the example electronic device 90 includes a processor 901 for executing software routines although a single processor is shown for clarity, the electronic device 90 may also include a multi-processor system. The processor 901 is connected to a communication infrastructure 902 for communicating with other components of the electronic device 90. The communication infrastructure 902 may include, for example, a communication bus, a crossbar, or a network.
Electronic device 90 also includes Memory, such as Random Access Memory (RAM), which may include a main Memory 903 and a secondary Memory 910. The secondary memory 910 may include, for example, a hard disk drive 911 and/or a removable storage drive 912, which may include a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. Removable storage drive 912 reads from and/or writes to a removable storage unit 913 in a conventional manner. Removable storage unit 913 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 912. As will be appreciated by those skilled in the relevant art(s), the removable storage unit 913 includes a computer-readable storage medium having stored thereon computer-executable program code instructions and/or data.
In an alternative embodiment, the secondary memory 910 may additionally or alternatively include other similar means for allowing computer programs or other instructions to be loaded into the electronic device 90. Such means may include, for example, a removable storage unit 921 and an interface 920. Examples of removable storage unit 921 and interface 920 include: a program cartridge and cartridge interface (such as that found in video game console devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 921 and interfaces 920 which allow software and data to be transferred from the removable storage unit 921 to electronic device 90.
The electronic device 90 also includes at least one communication interface 940. Communications interface 940 allows software and data to be transferred between electronic device 90 and external devices via communications path 941. In various embodiments of the present disclosure, the communication interface 940 allows data to be transferred between the electronic device 90 and a data communication network, such as a public data or private data communication network. The communication interface 940 may be used to exchange data between different electronic devices 90, which electronic devices 90 form part of an interconnected computer network. Examples of communication interface 940 may include a modem, a network interface (such as an ethernet card), a communication port, an antenna with associated circuitry, and so forth. The communication interface 940 may be wired or may be wireless. Software and data transferred via communications interface 940 are in the form of signals which may be electronic, magnetic, optical or other signals capable of being received by communications interface 940. These signals are provided to a communications interface via communications path 941.
As shown in fig. 9, the electronic device 90 further includes a display interface 931 to perform operations for rendering images to an associated display 930, and an audio interface 932 to perform operations for playing audio content through an associated speaker 933.
In this document, the term "computer program product" may refer, in part, to: a removable storage unit 913, a removable storage unit 921, a hard disk installed in the hard disk drive 911, or a carrier wave carrying software over a communication path 941 (wireless link or cable) to the communication interface 940. Computer-readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to electronic device 90 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROMs, DVDs, Blu-ray (TM) disks, hard disk drives, ROMs, or integrated circuits, USB memory, magneto-optical disks, or a computer-readable card, such as a PCMCIA card, etc., whether internal or external to the electronic device 90. Transitory or non-tangible computer-readable transmission media may also participate in providing software, applications, instructions, and/or data to the electronic device 90, examples of such transmission media including radio or infrared transmission channels, network connections to another computer or another networked device, and the internet or intranet including e-mail transmissions and information recorded on websites and the like.
Computer programs (also called computer program code) are stored in the main memory 903 and/or the secondary memory 910. Computer programs may also be received via communications interface 940. Such computer programs, when executed, enable the electronic device 90 to perform one or more features of embodiments discussed herein. In various embodiments, the computer programs, when executed, enable the processor 901 to perform the features of the embodiments described above. Accordingly, such computer programs represent controllers of the computer system.
The software may be stored in a computer program product and loaded into electronic device 90 using removable storage drive 912, hard drive 911 or interface 920. Alternatively, the computer program product may be downloaded to computer system 90 via communications path 941. The software, when executed by the processor 901, causes the electronic device 90 to perform the functions of the embodiments described herein.
It should be understood that the embodiment of fig. 9 is given by way of example only. Accordingly, in some embodiments, one or more features of the electronic device 90 may be omitted. Also, in some embodiments, one or more features of the electronic device 90 may be combined together. Additionally, in some embodiments, one or more features of the electronic device 90 may be separated into one or more components.
It will be appreciated that the elements shown in fig. 9 serve to provide a means for performing the various functions and operations of the server described in the above embodiments.
In one embodiment, a server may be generally described as a physical device including at least one processor and at least one memory including computer program code. The at least one memory and the computer program code are configured to, with the at least one processor, cause the physical device to perform necessary operations.
Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the functions of the method shown in fig. 2 to 3.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by an electronic device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
An embodiment of the present application further provides a computer program, which includes a computer readable code, and when the computer readable code runs on a device, a processor in the device executes instructions of the steps in the search result processing method.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The basic principles of the present invention have been described above with reference to specific embodiments, but it should be noted that the advantages, effects, etc. mentioned in the present invention are only examples and are not limiting, and the advantages, effects, etc. must not be considered to be possessed by various embodiments of the present invention. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the invention is not limited to the specific details described above.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The method and apparatus of the present invention may be implemented in a number of ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (14)

1. A method for processing search results, comprising:
acquiring index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to N description domains;
determining target data hit by the index information from domain data corresponding to a description domain associated with a target relevance model among the N description domains;
determining a correlation characteristic parameter according to the target data;
obtaining a relevance score of the search result and the index information according to the relevance characteristic parameter and the target relevance model;
and displaying the search result according to the acquired relevance score.
2. The method according to claim 1, wherein the number of the object relevance models is at least two, one of the object relevance models is a first object relevance model, the other one of the object relevance models is a second object relevance model, the description domain associated with the first object relevance model is a name domain, the description domain associated with the second object relevance model comprises at least two description domains, and the number of the search results is K;
the displaying the search result according to the obtained relevance score comprises the following steps:
acquiring a first score threshold corresponding to the first target correlation model and a second score threshold corresponding to the second target correlation model;
according to the first score threshold value, extracting P relevance scores meeting a first preset condition from K relevance scores corresponding to K search results obtained according to the first target relevance model;
according to the second score threshold value, extracting Q relevance scores meeting a second preset condition from K relevance scores corresponding to K search results obtained according to the second target relevance model;
determining a union of the P search results corresponding to the P relevance scores and the Q search results corresponding to the Q relevance scores;
and sequencing and displaying all the search results in the union set.
3. The method of claim 2, wherein the rank-based presentation of all the search results in the union comprises:
acquiring a first model weight corresponding to the first target correlation model and a second model weight corresponding to the second target correlation model;
for each search result in the union set, performing weighted summation on corresponding relevance scores respectively obtained according to the first target relevance model and the second target relevance model by using the first model weight and the second model weight to obtain a comprehensive score;
and sequencing and displaying all the search results in the union set according to the respective comprehensive scores of all the search results in the union set.
4. The method according to claim 1, wherein the correlation characteristic parameters include at least one of a field hit coverage parameter, a miss coverage parameter before hit field, and a miss coverage parameter after hit field of the domain data where the target data is located;
the determining of the correlation characteristic parameter according to the target data includes:
performing word segmentation processing on the domain data where the target data is located to obtain a first word segmentation processing result;
calculating a first sum of importance weights of all words in the first word segmentation processing result;
calculating a second sum of importance weights of all terms used for forming the target data in the first word segmentation processing result, and calculating the field hit coverage rate parameter according to the second sum and the first sum;
and/or calculating a third sum of importance weights of all terms positioned before all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter before the hit field according to the third sum and the first sum;
and/or calculating a fourth sum of importance weights of all terms after all terms used for forming the target data in the first word segmentation processing result, and calculating a miss coverage rate parameter after the hit field according to the fourth sum and the first sum.
5. The method of claim 1, wherein the correlation characteristic parameters comprise information hit coverage parameters of the index information;
the determining of the correlation characteristic parameter according to the target data includes:
performing word segmentation processing on the index information to obtain a second word segmentation processing result;
calculating a fifth sum of importance weights of all words in the second word segmentation processing result;
calculating a sixth sum of importance weights of the words forming the target data in the second word segmentation processing result;
and calculating the information hit coverage rate parameter according to the sixth sum and the fifth sum.
6. The method according to claim 1, wherein the correlation characteristic parameters comprise core word loss evaluation parameters;
the determining of the correlation characteristic parameter according to the target data includes:
determining core data in the domain data where the target data is located;
performing word segmentation processing on the core data to obtain a third word segmentation processing result;
calculating a seventh sum of importance weights of the words in the third word segmentation processing result;
performing word segmentation on the part hitting the core data in the target data to obtain a fourth word segmentation result;
calculating an eighth sum of importance weights of the words in the fourth word segmentation processing result;
and calculating the core word loss evaluation parameter according to the eighth sum and the seventh sum.
7. The method of claim 1, wherein the correlation characteristic parameters comprise a hit window length ratio parameter;
the determining of the correlation characteristic parameter according to the target data includes:
determining a data window consisting of all data from a first field of a first target data to an end field of a last target data in the domain data of the target data;
performing word segmentation processing on the data window to obtain a fifth word segmentation processing result;
and calculating the length ratio parameter of the hit window according to the number of the words forming the target data in the fifth word segmentation processing result and the number of the words in the fifth word segmentation processing result.
8. The method according to claim 1, wherein the correlation characteristic parameters include at least one of a hit number ratio parameter and a distance escape parameter, the hit number ratio parameter is a ratio of the number of the target data in the domain data where the target data is located to a hit number threshold, and the distance escape parameter is an edit distance between the index information and the target data.
9. The method of claim 1, further comprising:
obtaining R sample data from historical user data; each sample data in the R sample data comprises sample index information, a sample search result and the frequency of occurrence of a preset event on the sample search result;
dividing the R sample data into a positive sample data group and a negative sample data group according to whether the times of the R sample data are zero or not;
extracting S positive sample data from the positive sample data group, and extracting T negative sample data from the negative sample data group; wherein S and T satisfy a preset proportional relationship;
and training to obtain the target correlation model according to the S pieces of positive sample data and the T pieces of negative sample data.
10. The method according to claim 9, wherein said training the target correlation model according to the S positive sample data and the T negative sample data comprises:
according to a frequency threshold value, extracting U positive sample data of which the frequency meets a third preset condition from the S positive sample data;
according to the T pieces of negative sample data and the initial correlation model, obtaining a correlation score between a sample search result and sample index information included in each piece of negative sample data in the T pieces of negative sample data;
extracting V negative sample data meeting a fourth preset condition from the T negative sample data according to the acquired relevance scores;
and performing model training by using the U positive sample data and the V negative sample data to obtain the target correlation model.
11. A search result processing apparatus, comprising:
the first acquisition module is used for acquiring index information and a search result obtained by searching by using the index information; the search result comprises N domain data corresponding to N description domains;
a first determining module, configured to determine, from domain data corresponding to a description domain associated with a target relevance model among the N description domains, target data hit by the index information;
the second determination module is used for determining a correlation characteristic parameter according to the target data;
the second obtaining module is used for obtaining the relevance scores of the search results and the index information according to the relevance characteristic parameters and the target relevance model;
and the processing module is used for displaying the search result according to the acquired relevance score.
12. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the search result processing method of any of claims 1 to 10 via execution of the executable instructions.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the search result processing method of any one of claims 1 to 10.
14. A computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for carrying out the steps of the search result processing method of any one of claims 1 to 10.
CN202110329657.2A 2021-03-26 2021-03-26 Search result processing method and device and electronic equipment Active CN112989164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110329657.2A CN112989164B (en) 2021-03-26 2021-03-26 Search result processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110329657.2A CN112989164B (en) 2021-03-26 2021-03-26 Search result processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112989164A true CN112989164A (en) 2021-06-18
CN112989164B CN112989164B (en) 2023-11-03

Family

ID=76333971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110329657.2A Active CN112989164B (en) 2021-03-26 2021-03-26 Search result processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112989164B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676227A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Sample generation method, model training method and search method
CN117076783A (en) * 2023-10-16 2023-11-17 广东省科技基础条件平台中心 Scientific and technological information recommendation method, device, medium and equipment based on data analysis

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106235A1 (en) * 2007-10-18 2009-04-23 Microsoft Corporation Document Length as a Static Relevance Feature for Ranking Search Results
US20100312782A1 (en) * 2009-06-05 2010-12-09 Microsoft Corporation Presenting search results according to query domains
US20110106850A1 (en) * 2009-10-29 2011-05-05 Microsoft Corporation Relevant Individual Searching Using Managed Property and Ranking Features
CN102150158A (en) * 2008-09-12 2011-08-10 诺基亚公司 Method, system, and apparatus for arranging content search results
CN104063523A (en) * 2014-07-21 2014-09-24 焦点科技股份有限公司 E-commerce search scoring and ranking method and system
CN105069086A (en) * 2015-07-31 2015-11-18 焦点科技股份有限公司 Method and system for optimizing electronic commerce commodity searching
US20170091343A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and apparatus for clustering search query suggestions
CN108664515A (en) * 2017-03-31 2018-10-16 北京三快在线科技有限公司 A kind of searching method and device, electronic equipment
CN110399515A (en) * 2019-06-28 2019-11-01 中山大学 Picture retrieval method, device and picture retrieval system
US20190354604A1 (en) * 2018-05-18 2019-11-21 Yandex Europe Ag Method of and system for recommending fresh search query suggestions on search engine
US20200126675A1 (en) * 2018-10-23 2020-04-23 International Business Machines Corporation Utilizing unstructured literature and web data to guide study design in healthcare databases
CN111368058A (en) * 2020-03-09 2020-07-03 昆明理工大学 Question-answer matching method based on transfer learning
CN111695526A (en) * 2020-06-15 2020-09-22 北京爱笔科技有限公司 Network model generation method, pedestrian re-identification method and device
CN112116435A (en) * 2020-10-06 2020-12-22 广州智物互联科技有限公司 Commodity searching method and system based on big data and electronic mall platform

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106235A1 (en) * 2007-10-18 2009-04-23 Microsoft Corporation Document Length as a Static Relevance Feature for Ranking Search Results
CN102150158A (en) * 2008-09-12 2011-08-10 诺基亚公司 Method, system, and apparatus for arranging content search results
US20100312782A1 (en) * 2009-06-05 2010-12-09 Microsoft Corporation Presenting search results according to query domains
US20110106850A1 (en) * 2009-10-29 2011-05-05 Microsoft Corporation Relevant Individual Searching Using Managed Property and Ranking Features
CN104063523A (en) * 2014-07-21 2014-09-24 焦点科技股份有限公司 E-commerce search scoring and ranking method and system
CN105069086A (en) * 2015-07-31 2015-11-18 焦点科技股份有限公司 Method and system for optimizing electronic commerce commodity searching
US20170091343A1 (en) * 2015-09-29 2017-03-30 Yandex Europe Ag Method and apparatus for clustering search query suggestions
CN108664515A (en) * 2017-03-31 2018-10-16 北京三快在线科技有限公司 A kind of searching method and device, electronic equipment
US20190354604A1 (en) * 2018-05-18 2019-11-21 Yandex Europe Ag Method of and system for recommending fresh search query suggestions on search engine
US20200126675A1 (en) * 2018-10-23 2020-04-23 International Business Machines Corporation Utilizing unstructured literature and web data to guide study design in healthcare databases
CN110399515A (en) * 2019-06-28 2019-11-01 中山大学 Picture retrieval method, device and picture retrieval system
CN111368058A (en) * 2020-03-09 2020-07-03 昆明理工大学 Question-answer matching method based on transfer learning
CN111695526A (en) * 2020-06-15 2020-09-22 北京爱笔科技有限公司 Network model generation method, pedestrian re-identification method and device
CN112116435A (en) * 2020-10-06 2020-12-22 广州智物互联科技有限公司 Commodity searching method and system based on big data and electronic mall platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
蔡飞;陈洪辉;舒振;: "基于用户相关反馈的排序学习算法研究", 国防科技大学学报, no. 02, pages 135 - 139 *
阎磊;马宏琳;彭松;: "基于本体的粮食机械垂直搜索引擎主题相关性研究", 制造业自动化, no. 05, pages 88 - 91 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676227A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Sample generation method, model training method and search method
CN114676227B (en) * 2022-04-06 2023-07-18 北京百度网讯科技有限公司 Sample generation method, model training method and retrieval method
CN117076783A (en) * 2023-10-16 2023-11-17 广东省科技基础条件平台中心 Scientific and technological information recommendation method, device, medium and equipment based on data analysis
CN117076783B (en) * 2023-10-16 2023-12-26 广东省科技基础条件平台中心 Scientific and technological information recommendation method, device, medium and equipment based on data analysis

Also Published As

Publication number Publication date
CN112989164B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN110704674B (en) Video playing integrity prediction method and device
CN110221965B (en) Test case generation method, test case generation device, test case testing method, test case testing device, test equipment and test system
CN112749608B (en) Video auditing method, device, computer equipment and storage medium
CN112989164B (en) Search result processing method and device and electronic equipment
CN103744889A (en) Method and device for clustering problems
CN111767393A (en) Text core content extraction method and device
CN112818995B (en) Image classification method, device, electronic equipment and storage medium
CN111191133B (en) Service search processing method, device and equipment
CN111144215A (en) Image processing method, image processing device, electronic equipment and storage medium
CN112995690B (en) Live content category identification method, device, electronic equipment and readable storage medium
CN111522724A (en) Abnormal account determination method and device, server and storage medium
CN112613321A (en) Method and system for extracting entity attribute information in text
CN117409419A (en) Image detection method, device and storage medium
CN112989312B (en) Verification code identification method and device, electronic equipment and storage medium
CN113609390A (en) Information analysis method and device, electronic equipment and computer readable storage medium
CN111986259A (en) Training method of character and face detection model, auditing method of video data and related device
CN111259975A (en) Method and device for generating classifier and method and device for classifying text
CN114697127B (en) Service session risk processing method based on cloud computing and server
CN107133644B (en) Digital library&#39;s content analysis system and method
CN116980665A (en) Video processing method, device, computer equipment, medium and product
CN114048294B (en) Similar population extension model training method, similar population extension method and device
KR101551879B1 (en) A Realization of Injurious moving picture filtering system and method with Data pruning and Likelihood Estimation of Gaussian Mixture Model
CN112507214B (en) User name-based data processing method, device, equipment and medium
CN104915408B (en) A kind of method and device of social search result displaying
CN112784015A (en) Information recognition method and apparatus, device, medium, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant