CN115248847B - Search data set construction method and device, electronic equipment and storage medium - Google Patents

Search data set construction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115248847B
CN115248847B CN202211156488.8A CN202211156488A CN115248847B CN 115248847 B CN115248847 B CN 115248847B CN 202211156488 A CN202211156488 A CN 202211156488A CN 115248847 B CN115248847 B CN 115248847B
Authority
CN
China
Prior art keywords
long
search
tail
target
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211156488.8A
Other languages
Chinese (zh)
Other versions
CN115248847A (en
Inventor
简仁贤
卢露
吴文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhujian Smart Technology Beijing Co ltd
Emotibot Technologies Ltd
Original Assignee
Zhujian Smart Technology Beijing Co ltd
Emotibot Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhujian Smart Technology Beijing Co ltd, Emotibot Technologies Ltd filed Critical Zhujian Smart Technology Beijing Co ltd
Priority to CN202211156488.8A priority Critical patent/CN115248847B/en
Publication of CN115248847A publication Critical patent/CN115248847A/en
Application granted granted Critical
Publication of CN115248847B publication Critical patent/CN115248847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for constructing a search data set, electronic equipment and a storage medium, which relate to the technical field of data processing, wherein the method comprises the following steps: counting first search indexes of all texts, and dividing all the texts into long-tail texts and non-long-tail texts according to the first search indexes; determining search modes corresponding to the non-long-tail text, and searching different target search modes in the search modes; counting second search indexes of each target search mode, and determining target non-long-tail texts contained in each target search mode; sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set; and sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set, and constructing a search data set with the non-long-tail search data set. Therefore, the method not only can represent high-frequency texts, but also can give consideration to low-frequency texts, and can reflect the search requirements of user groups.

Description

Search data set construction method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for constructing a search dataset, an electronic device, and a storage medium.
Background
At present, for a certain website, a search data set needs to be prepared to evaluate the search effect of the website or optimize the search effect of the website. For this purpose, a search data set may be constructed in advance so as to evaluate the search effect of the website or to optimize the search effect of the website.
In the related art, random sampling in statistics is mostly performed on the construction of a search data set, or artificial subjective random sampling often lacks representativeness and cannot represent the search requirements of a user group, so that the accuracy and the effectiveness of website search effect evaluation are influenced, or the optimization of the website search effect is influenced.
Disclosure of Invention
In order to solve the technical problems that statistical random sampling or artificial subjective random sampling is mostly performed on the construction of a search data set, the construction often lacks representativeness and cannot represent the search requirements of user groups, so that the accuracy and the effectiveness of website search effect evaluation are influenced, or the optimization of the website search effect is influenced, the embodiment of the invention provides a construction method and a device of the search data set, electronic equipment and a storage medium. The specific technical scheme is as follows:
in a first aspect of the embodiments of the present invention, a method for constructing a search data set is provided first, where the method includes:
counting first search indexes of all texts, and dividing all the texts into long-tail texts and non-long-tail texts according to the first search indexes;
determining a search mode corresponding to the non-long-tail text, and searching different target search modes in the search mode;
counting second search indexes of each target search mode, and determining target non-long-tail texts contained in each target search mode;
sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set;
and sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set, and constructing a search data set with the non-long-tail search data set.
In an optional embodiment, the counting the first search index of each text includes:
the method comprises the steps of obtaining a search log of a website in a preset time period, and determining each text contained in the search log, wherein each text is different;
counting first search indexes of texts in the search logs;
the dividing each text into a long-tail text and a non-long-tail text according to the first search index includes:
aiming at any text, judging whether a first search index of the text is larger than a preset index threshold value corresponding to the website or not;
if the first search index of the text is larger than the preset index threshold value, dividing the text into a non-long-tail text;
and if the first search index of the text is not greater than the preset index threshold value, dividing the text into a long-tail text.
In an optional embodiment, the number of the non-long-tail texts is multiple, and the determining the search mode corresponding to the non-long-tail text includes:
performing word segmentation processing on the non-long-tail text aiming at any non-long-tail text to obtain a plurality of segmented words;
aiming at any word segmentation, determining a category label to which the word segmentation belongs through a preset classification algorithm, wherein a search mode is obtained by combining the category labels;
and combining the category labels to which the participles belong to obtain a search mode corresponding to the non-long-tail text.
In an optional implementation manner, the sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set includes:
determining a first sampling proportion of target non-long-tail texts contained in each target search mode according to the second search index of each target search mode;
determining the number of samples corresponding to a non-long tail search data set, and determining the first sampling number of target non-long tail texts contained in each target search mode according to the sample number and the first sampling proportion;
and extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts contained in each target search mode according to the first search index of the target non-long-tail texts contained in each target search mode to obtain a non-long-tail search data set.
In an optional implementation manner, the determining, according to the second search index of each target search mode, a first sampling proportion of target non-long-tail text included in each target search mode includes:
aiming at any one target search mode, determining a logarithm corresponding to a second search index of the target search mode;
obtaining the logarithmic sum corresponding to the second search index of each target search mode to obtain a logarithmic sum;
aiming at any one target search mode, obtaining a quotient between a logarithm corresponding to a second search index of the target search mode and the logarithm sum;
and determining the quotient of the logarithm corresponding to the second search index of the target search mode and the logarithm sum as the first sampling proportion of the target non-long-tail text contained in the target search mode.
In an optional implementation manner, the extracting, according to the first search index of the target non-long-tail text included in each target search mode, the first sample number of target non-long-tail texts from the target non-long-tail texts included in each target search mode to obtain a non-long-tail search data set includes:
judging whether the first sampling quantity of the target non-long-tail text contained in each target search mode is larger than the text quantity of the target non-long-tail text contained in each target search mode;
if the first sampling quantity of the target non-long-tail text contained in each target search mode is not larger than the text quantity of the target non-long-tail text contained in each target search mode, extracting the target non-long-tail text with the first sampling quantity from the target non-long-tail text contained in each target search mode according to the first search index of the target non-long-tail text contained in each target search mode, and obtaining a non-long-tail search data set.
In an optional implementation manner, the extracting, according to the first search index of the target non-long-tail text included in each target search mode, the first sample number of target non-long-tail texts from the target non-long-tail texts included in each target search mode to obtain a non-long-tail search data set further includes:
if the first sampling quantity of the target non-long-tail text contained in a first search mode is larger than the text quantity of the target non-long-tail text contained in the first search mode, extracting all the target non-long-tail text contained in the first search mode;
the first search mode comprises any one of the target search modes, and the first sampling quantity of the target non-long-tail text contained in each of the remaining second search modes is not greater than the text quantity of the target non-long-tail text contained in each of the second search modes;
extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts contained in each second search mode according to the first search index of the target non-long-tail texts contained in each second search mode;
determining the missing quantity of the target non-long-tail text contained in the first search mode according to the text quantity of the target non-long-tail text contained in the first search mode and the first sampling quantity of the target non-long-tail text contained in the first search mode;
distributing the missing quantity to each second search mode, and extracting the distributed quantity of target non-long-tail texts from the remaining target non-long-tail texts contained in each second search mode according to the first search index of the remaining target non-long-tail texts contained in each second search mode;
and forming a non-long-tail search data set by all target non-long-tail texts contained in the first search mode, the target non-long-tail texts with the first sampling quantity extracted from the target non-long-tail texts contained in each second search mode and the distributed quantity of target non-long-tail texts.
In an optional implementation manner, the allocating the missing amount to each of the second search modes, and extracting the allocated amount of the target non-long-tail text from the remaining target non-long-tail text included in each of the second search modes according to the first search index of the remaining target non-long-tail text included in each of the second search modes includes:
determining a second sampling proportion of the remaining target non-long-tail text contained in each second search mode according to the second search index of each second search mode;
determining the second sampling quantity of the residual target non-long-tail text contained in each second search mode according to the missing quantity and the second sampling proportion;
and extracting the target non-long-tail texts with the second sampling quantity from the residual target non-long-tail texts contained in each second search mode according to the first search index of the residual target non-long-tail texts contained in each second search mode.
In an optional embodiment, the sampling the long-tailed text according to the first search index of the long-tailed text to obtain a long-tailed search data set includes:
clustering the non-long-tail texts by a preset clustering algorithm, and judging whether the long-tail texts belong to the clustering result of the non-long-tail texts;
and if the long-tail text does not belong to the clustering result of the non-long-tail text, sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set.
In an optional embodiment, the method further comprises:
and if the long-tail text belongs to the clustering result of the non-long-tail text, dividing the long-tail text into the non-long-tail text, and jumping to the step of determining the search mode corresponding to the non-long-tail text.
In a second aspect of the embodiments of the present invention, there is also provided a search data set constructing apparatus, including:
the first index counting module is used for counting the first search indexes of all texts;
the text dividing module is used for dividing each text into a long-tail text and a non-long-tail text according to the first search index;
the mode determining module is used for determining a search mode corresponding to the non-long-tail text;
the mode searching module is used for searching different target searching modes in the searching modes;
the second index counting module is used for counting second search indexes of the target search modes;
the text determining module is used for determining target non-long-tail texts contained in each target search mode;
the first sampling module is used for sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set;
and the second sampling module is used for sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set and constructing a search data set with the non-long-tail search data set.
In a third aspect of the embodiments of the present invention, there is further provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor configured to implement the method for constructing a search data set according to any one of the first aspect described above when executing a program stored in a memory.
In a fourth aspect of the embodiments of the present invention, there is also provided a storage medium, in which instructions are stored, and when the storage medium is run on a computer, the storage medium causes the computer to execute the method for constructing a search data set according to any one of the first aspect.
In a fifth aspect of the embodiments of the present invention, there is also provided a computer program product including instructions, which when run on a computer, cause the computer to perform any one of the above-mentioned search data set construction methods.
The technical scheme provided by the embodiment of the invention includes the steps of counting first search indexes of each text, dividing each text into a long-tail text and a non-long-tail text according to the first search indexes, determining search modes corresponding to the non-long-tail text, searching different target search modes in the search modes, counting second search indexes of the target search modes, determining target non-long-tail texts contained in the target search modes, sampling the target non-long-tail texts according to the second search indexes and the first search indexes of the target non-long-tail texts to obtain a non-long-tail search data set, sampling the long-tail texts according to the first search indexes of the long-tail texts to obtain a long-tail search data set, and constructing the search data set with the non-long-tail search data set. Therefore, the search indexes of the texts and the search indexes of the search modes of the texts are comprehensively utilized for sampling, so that the search data set can represent high-frequency texts and also can give consideration to low-frequency texts, the search requirements of user groups can be reflected, and the influence on the accuracy and effectiveness of website search effect evaluation or the influence on the optimization of the website search effect is avoided.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic flow chart illustrating an implementation of a method for constructing a search data set according to an embodiment of the present invention;
fig. 2 is a schematic flowchart illustrating an implementation flow of a text partitioning method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of an implementation of a search mode determining method according to an embodiment of the present invention;
FIG. 4 is a flow chart illustrating an implementation of a sampling method according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating an implementation of a sampling rate determining method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a search data set constructing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device shown in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the embodiment of the invention, in order to enable the search data set to represent a high-frequency text and a low-frequency text and reflect the search requirements of a user group so as to reflect the main condition of the performance of a search engine, the text is analyzed from two angles of search frequency and search mode frequency so as to distinguish a high-frequency and low-frequency user search mode from a high-frequency and low-frequency search mode.
The high-frequency and low-frequency searching is distinguished, so that the problem is decomposed into problems processed in different stages according to searching difficulty, optimization technology difference, user demand commonality and personality difference. According to a plurality of researches and experiments, the text searching frequency meets power law distribution, a small amount of texts are searched with the frequency accounting for 50% -70% of the total searching amount, so that head searching and long tail searching can be distinguished and processed respectively, and the text searching in the middle part is performed, wherein the text searching frequency can come from different searching of a plurality of users and can come from a plurality of attempts of the same user. In view of this, the text is subjected to cluster analysis, the middle and tail searches cannot be clustered with the head search and fall into the long tail search part, and the rest are classified as non-long tail searches.
The search mode (pattern) is distinguished, and is used for determining text intention types and intention frames, namely mode abstractions of intentions on phrase structure forms, such as the intention frame that the text '601318 scale 5 month 25' can be abstracted into 'time + stock codes + scale', the intention frames of non-long tail part texts tend to be relatively concentrated, and the intention mode abstractions of high-frequency texts are important parts for understanding text intentions and are also the most basic search modes of vertical domain website users. In fact, the individual concrete text is the instantiation expression of the intent pattern. When the non-long-tailed search is sampled, the search mode is not considered, only the text search frequency is considered, or the complete random sampling is not considered, so that the real search behavior of the user group cannot be really represented.
Based on the above, the embodiment of the invention comprehensively utilizes the search index of the text and the search index of the search mode of the text to sample, so that the search data set can represent the high-frequency text and also can give consideration to the low-frequency text, and meanwhile, the search requirements of the user group can be reflected, thereby avoiding influencing the accuracy and effectiveness of website search effect evaluation or influencing the optimization of the website search effect.
Specifically, as shown in fig. 1, an implementation flow diagram of a method for constructing a search data set according to an embodiment of the present invention is shown, where the method is applied to an electronic device, and specifically includes the following steps:
s101, counting first search indexes of all texts, and dividing all the texts into long-tail texts and non-long-tail texts according to the first search indexes.
In the embodiment of the present invention, the first search index of each text may be counted, where the first search index may be the first search frequency, and certainly may also be the first search frequency, which is not limited in the embodiment of the present invention.
For the first search index of each text, each text can be divided into a long-tail text and a non-long-tail text according to the first search index of each text, so that each text can be distinguished into high-frequency and low-frequency user searches.
In the embodiment of the invention, the search logs of the website in the preset time can be cleaned, so that the search logs of the website in the preset time period after cleaning are obtained, texts contained in the search logs are determined, wherein the texts are different, and the first search indexes of the texts in the search logs are counted.
For example, the search logs of websites (here, websites may be vertical domain websites) in the last year are cleaned, after cleaning, the search logs of websites in the last year are obtained, texts included in the search logs are determined, and the texts are different from each other, and the first search frequency of each text in the search logs is counted.
In addition, in the embodiment of the invention, the index threshold value of the website can be set according to the service characteristics of the website, so that the text can be divided into long-tail text and non-long-tail text according to the size relationship between the first search index of the text and the index threshold value.
Specifically, as shown in fig. 2, which is a schematic diagram of an implementation flow of a text partitioning method according to an embodiment of the present invention, the method may be applied to an electronic device, and specifically includes the following steps:
s201, aiming at any text, judging whether a first search index of the text is larger than a preset index threshold corresponding to a website.
S202, if the first search index of the text is larger than a preset index threshold value, the text is divided into non-long-tail texts.
S203, if the first search index of the text is not larger than the preset index threshold value, the text is divided into long-tail texts.
In the embodiment of the invention, whether the first search index of any text is greater than the preset index threshold corresponding to the website is judged, and if the first search index of the text is not greater than the preset index threshold, the text is divided into long-tail texts, which means that the text is a low-frequency user search.
In addition, if the first search index of the text is larger than the preset index threshold value, the text is divided into non-long-tail texts, that is, the text is a high-frequency user search, and therefore the high-frequency and low-frequency user searches can be distinguished through the first search index of the text.
It should be noted that the preset index threshold corresponding to the website is set according to the service characteristics of the website, and since different websites have different service characteristics, the set index thresholds are different, that is, different websites have different corresponding index thresholds, which is not limited in the embodiment of the present invention.
For example, for any text, it is determined whether the first search frequency of the text is greater than a preset frequency threshold corresponding to a website, for example, the preset frequency threshold is 5, if the first search frequency of the text is not greater than 5, the text is divided into long-tail texts, and if the first search frequency of the text is greater than 5, the text is divided into non-long-tail texts.
S102, determining the search mode corresponding to the non-long-tail text, and searching different target search modes in the search mode.
In the embodiment of the present invention, for a plurality of non-long-tail texts, a search mode corresponding to any non-long-tail text is determined, so that each non-long-tail text has a corresponding search mode.
Therefore, repeated search modes exist in the search modes, and different target search modes in the search modes can be searched, so that second search indexes of the target search modes can be counted in the search modes subsequently.
For example, for any non-long-tail text, the corresponding search mode of the non-long-tail text is determined, as shown in table 1 below, so that each non-long-tail text has the corresponding search mode, and different target search modes in the search modes are searched, such as search mode 1, search mode 2 \8230;.
Non-long-tail text Search mode
Text A Search mode 1
Text B Search mode 2
Text C Search mode 1
…… ……
TABLE 1
In the embodiment of the invention, for any non-long-tail text, the corresponding search mode can be determined by analyzing the non-long-tail text. Specifically, as shown in fig. 3, an implementation flow diagram of a search mode determination method shown in the embodiment of the present invention is shown, where the method is applied to an electronic device, and specifically may include the following steps:
s301, performing word segmentation processing on the non-long-tail text aiming at any non-long-tail text to obtain a plurality of word segments.
In the embodiment of the invention, for any non-long-tail text, word segmentation processing is carried out on the non-long-tail text to obtain a plurality of word segments corresponding to the non-long tail. Here, the present relatively mature word segmentation tool or algorithm may be used to perform word segmentation processing on the non-long-tail text, and the embodiments of the present invention are not described in detail herein.
For example, taking the non-long-tailed text "601318 scale 5 month 25" as an example, the non-long-tailed text is subjected to word segmentation processing to obtain a plurality of words corresponding to the non-long-tailed text, and the plurality of words are "601318", "scale", "5 month 25".
S302, aiming at any participle, determining a category label to which the participle belongs through a preset classification algorithm, wherein the search mode is obtained by combining the category labels.
In the embodiment of the invention, for a plurality of participles corresponding to a non-long-tail text, aiming at any participle, a category label to which the participle belongs is determined through a classification algorithm, wherein a search mode is obtained by combining the category labels.
It should be noted that, for the category label, if the website has a category system, for any participle, the category label to which the participle belongs may be determined through a classification algorithm, and if the website has no category system, the category label may be manually labeled, and for any participle, the manually labeled category label to which the participle belongs may be determined through a classification algorithm.
In addition, for the classification algorithm, a currently mature classification algorithm, such as a named entity recognition algorithm, may be referred to, in the case that the website has no category system, the category label is manually labeled, and a small amount of training samples, such as "the wealth report in 2013 of luzhou council", are labeled, where the category label includes 3 manually labeled category labels: the search mode is company abbreviation + financial report + time, so that on the basis of labeling a small number of training samples, the training samples are segmented, each segmentation and the corresponding category label are used as input, supervised training is performed on the classification algorithm, and after model training, the manually labeled category label to which the segmentation belongs can be determined by using the classification algorithm.
For example, for a plurality of segmented words "601318", "scale", "5 month 25" corresponding to the non-long-tailed text "601318 scale 5 month 25", for any segmented word, the category label to which the segmented word belongs is determined by a classification algorithm, as shown in table 2 below.
Word segmentation Category label
601318 Stock code
Scale of Scale of
5, 25 months Time
TABLE 2
And S303, combining the category labels to which the participles belong to obtain a search mode corresponding to the non-long-tail text.
In the embodiment of the invention, for any non-long-tail text, each corresponding participle has a corresponding category label, and the category labels to which the participles belong are combined to obtain the search mode corresponding to the non-long-tail text.
For example, for the non-long-tailed text "601318 scale 5 months 25", the category labels to which the corresponding participles belong are combined as shown in table 2, so that the search mode corresponding to the non-long-tailed text, that is, "time + scale + stock code", can be obtained.
S103, counting second search indexes of each target search mode, and determining target non-long-tail texts contained in each target search mode.
In the embodiment of the present invention, for each target search pattern, the second search index of each target search pattern may be counted. And each non-long-tail text has a corresponding search mode, so that the second search indexes of the target search modes can be counted in the search modes.
For example, for each target search pattern: search mode 1, search mode 2 \8230 \ 8230;, where there is a corresponding search mode for each non-long-tailed text, as shown in table 1 above, the second search frequency of each target search mode may be counted among the search modes.
In addition, in the embodiment of the present invention, for each target search mode, it is also necessary to determine target non-long-tail texts included in each target search mode, that is, for any target search mode, it is necessary to determine how many non-long-tail texts exist in the target search mode.
For example, for each target search pattern: search mode 1, search mode 2, \8230 \ 8230;, each target search mode is determined to contain target non-long-tail text, i.e., as shown in table 1 above, search mode 1 contains text a, text B, \8230;, etc., target non-long-tail text, while search mode 2 contains text C, \8230;, etc.
And S104, sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set.
In the embodiment of the present invention, for each target search mode, the target non-long-tail text included in each target search mode may be sampled according to the second search index of each target search mode and the first search index of the target non-long-tail text included in each target search mode, so as to obtain a non-long-tail search data set. It should be noted that, the second search index here may be the second search frequency, and may be the second search frequency, which is not limited in the embodiment of the present invention.
For example, each target search pattern: search mode 1, search mode 2, second search frequency 1000 of search mode 1, second search frequency 500 of search mode 2, search mode 1 containing text a, text B, \8230; \8230, etc. target non-long-tail text, and search mode 2 containing text C, \8230; \8230, etc. Therefore, the text A, the text B, \ 8230; \ 8230;, etc. target non-long-tail text contained in the search mode 1 can be sampled according to the second search frequency 1000 of the search mode 1 and the first search index of the text A, the text B, \ 8230;, etc. target non-long-tail text contained in the search mode 1, and the text C, \ 8230;, etc. target non-long-tail text contained in the search mode 2 can be sampled according to the second search frequency 500 of the search mode 2 and the first search index of the text C, \ 8230;, etc. target non-long-tail text contained in the search mode 2, and a non-long-tail search data set can be obtained through the results of the previous and subsequent sampling.
The method includes sampling the target non-long-tail text included in each target search mode by using the second search index of each target search mode and the first search index of the target non-long-tail text included in each target search mode as rules to construct a non-long-tail search data set, and specifically, the sampling mode is as shown in fig. 4, which is an implementation flow diagram of a sampling method shown in an embodiment of the present invention, and the method is applied to an electronic device, and specifically includes the following steps:
s401, determining a first sampling proportion of the target non-long-tail text contained in each target search mode according to the second search index of each target search mode.
In the embodiment of the present invention, for each target search mode, a first sampling proportion of a target non-long-tail text included in each target search mode may be determined according to a second search index of each target search mode, which means that for any target search mode, a sampling proportion of a target non-long-tail text included in the target search mode is determined by the second search index of the target search mode.
For example, taking the target search pattern as search pattern 1 as an example, determining the first sampling proportion of the target non-long-tail text such as text a, text B, \8230:, etc. contained in search pattern 1 according to the second search frequency of search pattern 1 means that for search pattern 1, the sampling proportion of the target non-long-tail text such as text a, text B, \8230:, etc. contained in search pattern 1 is determined by the second search frequency of search pattern 1.
The first sampling proportion of the target non-long-tail text included in each target search mode can be determined by the method shown in fig. 5. Specifically, as shown in fig. 5, which is a schematic view of an implementation flow of a sampling ratio determining method according to an embodiment of the present invention, the method may be applied to an electronic device, and specifically includes the following steps:
s501, aiming at any target search mode, determining the logarithm corresponding to the second search index of the target search mode.
And S502, acquiring the logarithmic sum corresponding to the second search index of each target search mode to obtain the logarithmic sum.
S503, aiming at any target search mode, obtaining the quotient between the logarithm and the logarithm sum corresponding to the second search index of the target search mode.
S504, determining a quotient between a logarithm and a logarithm sum corresponding to a second search index of the target search mode as a first sampling proportion of the target non-long-tail text contained in the target search mode.
In the embodiment of the present invention, for any target search mode, the logarithm corresponding to the second search index of the target search mode is determined, and thus for each target search mode, the logarithm corresponding to the second search index of each target search mode is determined.
For example, for each target search pattern: the search pattern 1, the search pattern 2, and the search pattern 3 determine, for any target search pattern, a logarithm corresponding to a second search frequency of the target search pattern, as shown in table 3 below.
Figure 286049DEST_PATH_IMAGE001
TABLE 3
In the embodiment of the present invention, for the logarithm corresponding to the second search index of each target search mode, the sum of the logarithms corresponding to the second search index of each target search mode is obtained, and the sum of the logarithms can be obtained. For example, as shown in table 3, the sum of the logarithm 3 of the second search frequency 1000 of the search pattern 1, the logarithm 2.6988970004 of the second search frequency 500 of the search pattern 2, and the logarithm 2.602059991 of the second search frequency 400 of the search pattern 3 is obtained, and the sum of the logarithms, i.e., 8.301029996 is obtained.
In the embodiment of the invention, aiming at any target search mode, the quotient between the logarithm and the logarithm sum corresponding to the second search index of the target search mode is obtained, the quotient between the logarithm and the logarithm sum corresponding to the second search index of the target search mode is determined and is used as the first sampling proportion of the target non-long-tail text contained in the target search mode, and thus the first sampling proportion of the target non-long-tail text contained in each target search mode can be determined.
For example, for each target search pattern: the search mode 1, the search mode 2, and the search mode 3, for any target search mode, obtain a quotient between a logarithm and a logarithmic sum corresponding to the second search frequency of the target search mode, determine a quotient between a logarithm and a logarithmic sum corresponding to the second search frequency of the target search mode, and use the quotient as a first sampling ratio of the target non-long-tail text included in the target search mode, as shown in table 4 below.
Figure 754201DEST_PATH_IMAGE002
TABLE 4
S402, determining the number of samples corresponding to the non-long tail search data set, and determining the first sampling number of the target non-long tail text contained in each target search mode according to the sample number and the first sampling proportion.
In the embodiment of the present invention, the number of samples corresponding to one non-long-tailed search data set, for example, the number of X texts which is the number of samples, may be preset, so that the number of samples corresponding to the non-long-tailed search data set may be obtained, and according to the sample number and the first sampling ratio of the target non-long-tailed text included in each target search mode, the first sampling number of the target non-long-tailed text included in each target search mode is determined.
For any target search mode, obtaining a product between the number of samples and a first sampling proportion of a target non-long-tail text contained in the target search mode, and rounding the product to obtain the first number of samples of the target non-long-tail text contained in the target search mode, which means that the first number of samples is a positive integer.
For example, for each target search pattern: the search mode 1, the search mode 2, and the search mode 3, for any target search mode, obtain a product between the number of samples and a first sampling ratio of a target non-long-tail text included in the target search mode, and perform rounding processing on the product to obtain a first sampling number of the target non-long-tail text included in the target search mode, as shown in table 5 below.
Target search mode First number of samples (rounding)
Search mode 1 4
Search mode 2 3
Search mode 3 3
TABLE 5
And S403, extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts included in each target search mode according to the first search indexes of the target non-long-tail texts included in each target search mode, and obtaining a non-long-tail search data set.
In the embodiment of the invention, the target non-long-tail texts with the first sampling quantity can be extracted from the target non-long-tail texts contained in each target search mode according to the first search index of the target non-long-tail texts contained in each target search mode, so as to obtain the non-long-tail search data set.
The target non-long-tail texts included in each target search mode may be sorted (for example, sorted from large to small) according to a first search index of the target non-long-tail texts included in each target search mode, and top N target non-long-tail texts are extracted from the target non-long-tail texts included in each target search mode to obtain a non-long-tail search data set, where N is a positive integer and is equal to the number of first samples.
In the embodiment of the present invention, it may be determined whether the first sampling number of the target non-long-tail text included in each target search mode is greater than the text number of the target non-long-tail text included in each target search mode, and if the first sampling number of the target non-long-tail text included in each target search mode is not greater than the text number of the target non-long-tail text included in each target search mode, according to the first search index of the target non-long-tail text included in each target search mode, extract the target non-long-tail text of the first sampling number from the target non-long-tail text included in each target search mode, so as to obtain the non-long-tail search data set.
For example, for each target search pattern: the method comprises the steps of judging whether the first sampling number of target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3 is larger than the text number of the target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3, and if the first sampling number of the target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3 is not larger than the text number of the target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3, extracting the target non-long-tail texts with the first sampling number from the target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3 according to the first search frequency of the target non-long-tail texts contained in the search mode 1, the search mode 2 and the search mode 3 to obtain a non-long-tail search data set.
In addition, if the number of first samples of the target non-long-tail text included in the first search mode is greater than the number of texts of the target non-long-tail text included in the first search mode, all the target non-long-tail text included in the first search mode is extracted, wherein the first search mode includes any one of the target search modes, and the number of first samples of the target non-long-tail text included in each of the remaining second search modes is not greater than the number of texts of the target non-long-tail text included in each of the second search modes.
Extracting target non-long-tail texts with a first sampling quantity from the target non-long-tail texts contained in each second search mode according to a first search index of the target non-long-tail texts contained in each second search mode; determining the missing quantity of the target non-long-tail text contained in the first search mode according to the text quantity of the target non-long-tail text contained in the first search mode and the first sampling quantity of the target non-long-tail text contained in the first search mode;
distributing the missing quantity to each second search mode, so that each second search mode can obtain the distributed quantity corresponding to each second search mode, and extracting the distributed quantity of target non-long-tail texts from the remaining target non-long-tail texts contained in each second search mode according to the first search indexes of the remaining target non-long-tail texts contained in each second search mode; and forming a non-long-tail search data set by all target non-long-tail texts contained in the first search mode, the target non-long-tail texts with the first sampling quantity extracted from the target non-long-tail texts contained in each second search mode and the distributed quantity of the target non-long-tail texts.
It should be noted that, for the number of missing, the missing number may be allocated to each second search pattern according to a certain rule, for example, the missing number is allocated to each second search pattern on average, or allocated to each second search pattern in proportion, which is not limited in this embodiment of the present invention.
Determining a second sampling proportion of the target non-long-tail text contained in each second search mode according to a second search index of each second search mode; determining the second sampling quantity of the residual target non-long-tail text contained in each second search mode according to the missing quantity and the second sampling proportion, and distributing the missing quantity to each second search mode according to the proportion; and extracting target non-long-tail texts with a second sampling quantity from the remaining target non-long-tail texts contained in each second search mode according to the first search index of the remaining target non-long-tail texts contained in each second search mode. Here, the calculation of the second sampling ratio is similar to the calculation of the first sampling ratio, and the embodiments of the present invention are not described in detail herein.
In addition, in the embodiment of the present invention, it may be determined whether the second number of samples of the remaining target non-long-tail text included in each second search mode is greater than the number of texts of the remaining target non-long-tail text included in each second search mode, if none of the second number of samples of the remaining target non-long-tail text included in each second search mode is greater than the number of texts of the remaining target non-long-tail text included in each second search mode, the second number of samples of the target non-long-tail text included in each second search mode is extracted from the remaining target non-long-tail text included in each second search mode according to the first search index of the remaining target non-long-tail text included in each second search mode, otherwise, the above steps are repeatedly performed, that is, to find a search mode whose number of samples exceeds the upper limit, extract all texts included in the search mode, extract the search modes whose number of samples does not exceed the upper limit according to the number of samples, allocate the remaining number of samples to the search modes whose number of samples does not exceed the upper limit, and continue sampling until there is no search mode whose number exceeds the upper limit.
For example, each target search pattern: search mode 1, search mode 2, and search mode 3, where the search mode 1 includes 2 target non-long-tail texts, the second search frequency 1000 of the search mode 1 (representing that the search mode 1 is searched 1000 times), the search mode 2 includes 85 target non-long-tail texts, the second search frequency 500 of the search mode 2, the search mode 3 includes 13 target non-long-tail texts, and the second search frequency 400 of the search mode 3 now needs to sample 10 texts from the search mode 1, the search mode 2, and the search mode 3, then the following operations are performed:
for the search mode 1, determining a logarithm corresponding to the second search frequency 1000 of the search mode 1, for the search mode 2, determining a logarithm corresponding to the second search frequency 500 of the search mode 2, and for the search mode 3, determining a logarithm corresponding to the second search frequency 400 of the search mode 3, as shown in the above table 1; obtaining the sum of the logarithm of the second search frequency 1000 of the search mode 1, the logarithm of the second search frequency 500 of the search mode 2 and the logarithm of the second search frequency 400 of the search mode 3 to obtain a logarithm sum, namely 8.301029996; for the search mode 1, a quotient between a logarithm and a sum of logarithms corresponding to the second search frequency of the search mode 1 is determined, the quotient is determined to be a first sampling proportion of the target non-long-tail text contained in the search mode 1, for the search mode 2, a quotient between a logarithm and a sum of logarithms corresponding to the second search frequency of the search mode 2 is determined, the quotient is determined to be a first sampling proportion of the target non-long-tail text contained in the search mode 2, for the search mode 3, a quotient between a logarithm and a sum of logarithms corresponding to the second search frequency of the search mode 3 is determined, and the quotient is determined to be a first sampling proportion of the target non-long-tail text contained in the search mode 3, as shown in the above table 4.
For the search mode 1, a product between 10 and the first sample proportion of the target non-long-tail text included in the search mode 1 is obtained, the product is rounded to obtain a first sample number of the target non-long-tail text included in the search mode 1, for the search mode 2, a product between 10 and the first sample proportion of the target non-long-tail text included in the search mode 2 is obtained, the product is rounded to obtain a first sample number of the target non-long-tail text included in the search mode 2, for the search mode 3, a product between 10 and the first sample proportion of the target non-long-tail text included in the search mode 3 is obtained, and the product is rounded to obtain a first sample number of the target non-long-tail text included in the search mode 3, as shown in table 5.
Here, the first number of samples of the target non-long-tail text included in the search pattern 1 is greater than the number of texts of the target non-long-tail text included in the search pattern 1, the first number of samples of the target non-long-tail text included in the search pattern 2 is not greater than the number of texts of the target non-long-tail text included in the search pattern 2, and the first number of samples of the target non-long-tail text included in the search pattern 3 is not greater than the number of texts of the target non-long-tail text included in the search pattern 3, at this time, all the target non-long-tail texts included in the search pattern 1 are extracted, 3 texts are extracted from the target non-long-tail text included in the search pattern 2 according to the first search frequency of the target non-long-tail text included in the search pattern 2, and 3 texts are extracted from the target non-long-tail text included in the search pattern 3 according to the first search frequency of the target non-long-tail text included in the search pattern 3.
According to the text quantity of the target non-long-tail text contained in the search mode 1 and the first sampling quantity of the target non-long-tail text contained in the search mode 1, determining the missing quantity 2 of the target non-long-tail text contained in the search mode 1, which means that 2 remaining denominations exist, and further sampling needs to be continued. At this time, the sum of the logarithm of the second search frequency 500 of the search pattern 2 and the logarithm of the second search frequency 400 of the search pattern 3 is obtained to obtain a logarithm sum, that is, 5.301029996, for the search pattern 2, the quotient between the logarithm and the logarithm sum corresponding to the second search frequency of the search pattern 2 is determined, this quotient is determined to be the second sampling proportion of the remaining target non-long-tail text contained in the search pattern 2, that is, 0.509140678, for the search pattern 3, the quotient between the logarithm and the logarithm sum corresponding to the second search frequency of the search pattern 3 is determined, and this quotient is determined to be the second sampling proportion of the remaining target non-long-tail text contained in the search pattern 3, that is, 0.490859322.
And for the search mode 2, obtaining a product between 2 and the second sampling proportion of the remaining target non-long-tail text contained in the search mode 2, rounding the product to obtain a second sampling number of the remaining target non-long-tail text contained in the search mode 2, namely 1, for the search mode 3, obtaining a product between 2 and the second sampling proportion of the remaining target non-long-tail text contained in the search mode 3, rounding the product to obtain a second sampling number of the remaining target non-long-tail text contained in the search mode 3, namely 1.
Here, the second number of samples of the remaining target non-long-tail text included in the search pattern 2 is not greater than the number of texts of the remaining target non-long-tail text included in the search pattern 2, and the second number of samples of the remaining target non-long-tail text included in the search pattern 3 is not greater than the number of texts of the remaining target non-long-tail text included in the search pattern 3, at this time, 1 text is extracted from the remaining target non-long-tail text included in the search pattern 2 according to the first search frequency of the remaining target non-long-tail text included in the search pattern 2, and 1 text is extracted from the remaining target non-long-tail text included in the search pattern 3 according to the first search frequency of the remaining target non-long-tail text included in the search pattern 3.
To this end, sampling is completed from a search mode 1, a search mode 2 and a search mode 3, wherein in a first sampling process, 2 texts are extracted from target non-long-tail texts contained in the search mode 1, 3 texts are extracted from target non-long-tail texts contained in the search mode 2, 3 texts are extracted from target non-long-tail texts contained in the search mode 3, in a second sampling process, 1 text is extracted from residual target non-long-tail texts contained in the search mode 2, and 1 text is extracted from residual target non-long-tail texts contained in the search mode 3, which means that in the whole sampling process, 2 texts are extracted from target non-long-tail texts contained in the search mode 1, 4 texts are extracted from target non-long-tail texts contained in the search mode 2, and 4 texts are extracted from target non-long-tail texts contained in the search mode 3, and a non-long-tail search data set is formed by the texts.
And S105, sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set, and constructing a search data set with the non-long-tail search data set.
In the embodiment of the invention, for the long-tail text, the long-tail text can be sampled according to the first search index of the long tail to obtain a long-tail search data set, and the long-tail search data set and the non-long-tail search data set are constructed into the search data set.
For the long-tail text, the long-tail text can be divided into three types, namely high frequency, medium frequency, low frequency and the like according to the first search index of the long tail, and a long-tail search data set is obtained by randomly sampling from each type.
In addition, for a long-tailed text, although the search frequency or the search frequency is not high, the long-tailed text may be highly similar to a non-long-tailed text, and for this reason, the long-tailed text which is not high in the search frequency or the search frequency but is highly similar to the non-long-tailed text needs to be selected and classified as the non-long-tailed text for processing.
Based on this, in the embodiment of the present invention, non-long-tail texts are clustered through a preset clustering algorithm, and whether a long-tail text belongs to a clustering result of the non-long-tail text is determined, if the long-tail text does not belong to the clustering result of the non-long-tail text, the long-tail text is sampled according to a first search index of the long-tail text, so as to obtain a long-tail search data set, and if the long-tail text belongs to the clustering result of the non-long-tail text, the long-tail text is divided into the non-long-tail texts, and a step of determining a search mode corresponding to the non-long-tail text is skipped.
For the long-tail text, the distance between the long-tail text and the clustering center in the clustering result can be calculated, if the distance is smaller than the threshold, it is indicated that the long-tail text belongs to the clustering result, otherwise, it is indicated that the long-tail text does not belong to the clustering result, and the embodiment of the present invention does not limit this. In addition, the clustering algorithm may refer to the existing clustering algorithm, and the embodiments of the present invention are not described herein in detail.
For example, clustering the non-long-tail texts by a preset clustering algorithm to obtain 3 clustering results, judging whether the long-tail texts belong to any clustering result in the clustering results of the non-long-tail texts, if the long-tail texts do not belong to any clustering result in the clustering results of the non-long-tail texts, sampling the long-tail texts according to a first search index of the long-tail texts to obtain a long-tail search data set, if the long-tail texts belong to a certain clustering result in the clustering results of the non-long-tail texts, dividing the long-tail texts into the non-long-tail texts, jumping to a step of determining a search mode corresponding to the non-long-tail texts, and processing according to the non-long-tail texts.
Through the above description of the technical solution provided by the embodiment of the present invention, the first search indexes of each text are counted, each text is divided into a long-tail text and a non-long-tail text according to the first search indexes, a search mode corresponding to the non-long-tail text is determined, different target search modes in the search modes are searched, the second search indexes of each target search mode are counted, target non-long-tail texts included in each target search mode are determined, the target non-long-tail texts are sampled according to the second search indexes and the first search indexes of the target non-long-tail texts to obtain a non-long-tail search data set, the long-tail texts are sampled according to the first search indexes of the long-tail texts to obtain a long-tail search data set, and a search data set is constructed from the non-long-tail search data set.
Therefore, the search indexes of the texts and the search indexes of the search modes of the texts are comprehensively utilized for sampling, so that the search data set can represent high-frequency texts and also can give consideration to low-frequency texts, the search requirements of user groups can be reflected, and the influence on the accuracy and effectiveness of website search effect evaluation or the influence on the optimization of the website search effect is avoided.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a device for constructing a search data set, and as shown in fig. 6, the device may include: a first index statistics module 610, a text partitioning module 620, a pattern determination module 630, a pattern lookup module 640, a second index statistics module 650, a text determination module 660, a first sampling module 670, a second sampling module 680.
A first index counting module 610, configured to count a first search index of each text;
a text dividing module 620, configured to divide each text into a long-tail text and a non-long-tail text according to the first search index;
a mode determining module 630, configured to determine a search mode corresponding to the non-long-tail text;
a pattern searching module 640, configured to search different target search patterns in the search patterns;
a second index counting module 650, configured to count a second search index of each target search pattern;
a text determining module 660, configured to determine target non-long-tail texts included in each target search mode;
the first sampling module 670 is configured to sample the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text, so as to obtain a non-long-tail search data set;
the second sampling module 680 is configured to sample the long-tailed text according to the first search index of the long-tailed text to obtain a long-tailed search data set, and construct a search data set with the non-long-tailed search data set.
The embodiment of the present invention further provides an electronic device, as shown in fig. 7, which includes a processor 71, a communication interface 72, a memory 73 and a communication bus 74, where the processor 71, the communication interface 72, and the memory 73 complete mutual communication through the communication bus 74,
a memory 73 for storing a computer program;
the processor 71, when executing the program stored in the memory 73, implements the following steps:
counting first search indexes of all texts, and dividing all the texts into long-tail texts and non-long-tail texts according to the first search indexes; determining a search mode corresponding to the non-long-tail text, and searching different target search modes in the search mode; counting second search indexes of each target search mode, and determining target non-long-tail texts contained in each target search mode; sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set; and sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set, and constructing a search data set with the non-long-tail search data set.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In another embodiment of the present invention, there is also provided a storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the method for constructing a search data set according to any one of the above embodiments.
In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for constructing a search data set according to any one of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a storage medium or transmitted from one storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (12)

1. A method of constructing a search data set, the method comprising:
counting first search indexes of all texts, and dividing all the texts into long-tail texts and non-long-tail texts according to the first search indexes;
determining a search mode corresponding to the non-long-tail text, and searching different target search modes in the search mode;
counting second search indexes of each target search mode, and determining target non-long-tail texts contained in each target search mode;
sampling the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text to obtain a non-long-tail search data set, comprising: determining a first sampling proportion of target non-long-tail texts contained in each target search mode according to the second search index of each target search mode; determining the number of samples corresponding to a non-long tail search data set, and determining the first sampling number of target non-long tail texts contained in each target search mode according to the sample number and the first sampling proportion; extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts contained in each target search mode according to the first search indexes of the target non-long-tail texts contained in each target search mode to obtain a non-long-tail search data set;
and sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set, and constructing a search data set with the non-long-tail search data set.
2. The method of claim 1, wherein the counting the first search index of each text comprises:
the method comprises the steps of obtaining a search log of a website in a preset time period, and determining each text contained in the search log, wherein each text is different;
counting first search indexes of texts in the search logs;
the dividing each text into a long-tail text and a non-long-tail text according to the first search index comprises:
aiming at any text, judging whether a first search index of the text is larger than a preset index threshold corresponding to the website or not;
if the first search index of the text is larger than the preset index threshold value, dividing the text into non-long-tail texts;
and if the first search index of the text is not greater than the preset index threshold value, dividing the text into a long-tail text.
3. The method according to claim 1, wherein the number of the non-long-tail texts is multiple, and the determining the search pattern corresponding to the non-long-tail text comprises:
for any non-long-tail text, performing word segmentation processing on the non-long-tail text to obtain a plurality of word segments;
aiming at any word, determining a category label to which the word belongs through a preset classification algorithm, wherein a search mode is obtained by combining the category labels;
and combining the category labels to which the participles belong to obtain a search mode corresponding to the non-long-tail text.
4. The method of claim 1, wherein the determining a first sampling proportion of target non-long-tail text included in each of the target search modes according to the second search index of each of the target search modes comprises:
aiming at any one target search mode, determining a logarithm corresponding to a second search index of the target search mode;
obtaining the logarithmic sum corresponding to the second search index of each target search mode to obtain a logarithmic sum;
aiming at any one target search mode, obtaining a quotient between a logarithm corresponding to a second search index of the target search mode and the logarithm sum;
and determining a quotient between a logarithm corresponding to the second search index of the target search mode and the sum of the logarithms as a first sampling proportion of the target non-long-tail text contained in the target search mode.
5. The method according to claim 1, wherein said extracting, according to the first search index of the target non-long-tail text included in each of the target search modes, the first sample number of target non-long-tail texts from the target non-long-tail texts included in each of the target search modes to obtain a non-long-tail search data set comprises:
judging whether the first sampling quantity of the target non-long-tail text contained in each target search mode is larger than the text quantity of the target non-long-tail text contained in each target search mode;
if the first sampling quantity of the target non-long-tail text contained in each target search mode is not larger than the text quantity of the target non-long-tail text contained in each target search mode, extracting the target non-long-tail text with the first sampling quantity from the target non-long-tail text contained in each target search mode according to the first search index of the target non-long-tail text contained in each target search mode, and obtaining a non-long-tail search data set.
6. The method according to claim 5, wherein the extracting, according to the first search index of the target non-long-tail text included in each of the target search modes, the first sample number of target non-long-tail texts from the target non-long-tail texts included in each of the target search modes to obtain a non-long-tail search data set further comprises:
if the first sampling quantity of the target non-long-tail text contained in the first search mode is larger than the text quantity of the target non-long-tail text contained in the first search mode, extracting all the target non-long-tail text contained in the first search mode;
the first search mode comprises any one of the target search modes, and the number of first samples of target non-long-tail texts contained in the remaining second search modes is not greater than the number of texts of the target non-long-tail texts contained in the second search modes;
extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts contained in each second search mode according to the first search index of the target non-long-tail texts contained in each second search mode;
determining the missing quantity of the target non-long-tail text contained in the first search mode according to the text quantity of the target non-long-tail text contained in the first search mode and the first sampling quantity of the target non-long-tail text contained in the first search mode;
distributing the missing quantity to each second search mode, and extracting the distributed quantity of target non-long-tail texts from the remaining target non-long-tail texts contained in each second search mode according to the first search index of the remaining target non-long-tail texts contained in each second search mode;
and forming a non-long-tail search data set by all target non-long-tail texts contained in the first search mode, the target non-long-tail texts with the first sampling quantity extracted from the target non-long-tail texts contained in each second search mode and the distributed quantity of target non-long-tail texts.
7. The method according to claim 6, wherein the assigning the missing number to each of the second search patterns, and extracting the assigned number of target non-long-tail texts from the remaining target non-long-tail texts included in each of the second search patterns according to the first search index of the remaining target non-long-tail texts included in each of the second search patterns comprises:
determining a second sampling proportion of the remaining target non-long-tail text contained in each second search mode according to the second search index of each second search mode;
determining the second sampling quantity of the residual target non-long-tail text contained in each second search mode according to the missing quantity and the second sampling proportion;
and extracting the target non-long-tail texts with the second sampling quantity from the residual target non-long-tail texts contained in each second search mode according to the first search index of the residual target non-long-tail texts contained in each second search mode.
8. The method of claim 1, wherein sampling the long-tailed text according to the first search index of the long-tailed text to obtain a long-tailed search data set comprises:
clustering the non-long-tail texts by a preset clustering algorithm, and judging whether the long-tail texts belong to the clustering result of the non-long-tail texts;
and if the long-tail text does not belong to the clustering result of the non-long-tail text, sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set.
9. The method of claim 8, further comprising:
if the long-tail text belongs to the clustering result of the non-long-tail text, dividing the long-tail text into the non-long-tail text, and jumping to the step of determining the search mode corresponding to the non-long-tail text.
10. An apparatus for constructing a search data set, the apparatus comprising:
the first index counting module is used for counting first search indexes of all texts;
the text dividing module is used for dividing each text into a long-tail text and a non-long-tail text according to the first search index;
the mode determining module is used for determining a search mode corresponding to the non-long-tail text;
the mode searching module is used for searching different target searching modes in the searching modes;
the second index counting module is used for counting second search indexes of the target search modes;
the text determining module is used for determining target non-long-tail texts contained in each target search mode;
the first sampling module is configured to sample the target non-long-tail text according to the second search index and the first search index of the target non-long-tail text, so as to obtain a non-long-tail search data set, and includes: determining a first sampling proportion of target non-long-tail texts contained in each target search mode according to the second search index of each target search mode; determining the number of samples corresponding to a non-long-tail search data set, and determining the first sampling number of target non-long-tail texts contained in each target search mode according to the sample number and the first sampling proportion; extracting the target non-long-tail texts with the first sampling quantity from the target non-long-tail texts contained in each target search mode according to the first search indexes of the target non-long-tail texts contained in each target search mode to obtain a non-long-tail search data set;
and the second sampling module is used for sampling the long-tail text according to the first search index of the long-tail text to obtain a long-tail search data set and constructing a search data set with the non-long-tail search data set.
11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 9 when executing a program stored in a memory.
12. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202211156488.8A 2022-09-22 2022-09-22 Search data set construction method and device, electronic equipment and storage medium Active CN115248847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211156488.8A CN115248847B (en) 2022-09-22 2022-09-22 Search data set construction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211156488.8A CN115248847B (en) 2022-09-22 2022-09-22 Search data set construction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115248847A CN115248847A (en) 2022-10-28
CN115248847B true CN115248847B (en) 2022-12-16

Family

ID=83699615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211156488.8A Active CN115248847B (en) 2022-09-22 2022-09-22 Search data set construction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115248847B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733756A (en) * 2018-04-11 2018-11-02 北京三快在线科技有限公司 Data preload method, apparatus, electronic equipment and readable storage medium storing program for executing
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN114661910A (en) * 2022-03-25 2022-06-24 平安科技(深圳)有限公司 Intention identification method and device, electronic equipment and storage medium
CN114860872A (en) * 2022-04-13 2022-08-05 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650907B (en) * 2020-12-25 2023-07-14 百度在线网络技术(北京)有限公司 Search word recommendation method, target model training method, device and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733756A (en) * 2018-04-11 2018-11-02 北京三快在线科技有限公司 Data preload method, apparatus, electronic equipment and readable storage medium storing program for executing
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium
CN114661910A (en) * 2022-03-25 2022-06-24 平安科技(深圳)有限公司 Intention identification method and device, electronic equipment and storage medium
CN114860872A (en) * 2022-04-13 2022-08-05 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115248847A (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
WO2019214248A1 (en) Risk assessment method and apparatus, terminal device, and storage medium
US20220147023A1 (en) Method and device for identifying industry classification of enterprise and particular pollutants of enterprise
CN109165691B (en) Training method and device for model for identifying cheating users and electronic equipment
CN108090216B (en) Label prediction method, device and storage medium
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
CN108804550B (en) Query term expansion method and device and electronic equipment
CN112347100B (en) Database index optimization method, device, computer equipment and storage medium
CN111639493A (en) Address information standardization method, device, equipment and readable storage medium
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN110689211A (en) Method and device for evaluating website service capability
CN110147493B (en) Method, device, computer equipment and storage medium for determining active factors
CN113407584A (en) Label extraction method, device, equipment and storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN105787004A (en) Text classification method and device
CN111881170B (en) Method, device, equipment and storage medium for mining timeliness query content field
CN115248847B (en) Search data set construction method and device, electronic equipment and storage medium
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN111327609A (en) Data auditing method and device
CN112487181A (en) Keyword determination method and related equipment
JP2023544929A (en) Video push methods, devices, electronic devices, storage media, and computer programs
KR102155692B1 (en) Methods for performing sentiment analysis of messages in social network service based on part of speech feature and sentiment analysis apparatus for performing the same
CN112308419A (en) Data processing method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant