CN111782950A - Sample data set acquisition method, device, equipment and storage medium - Google Patents

Sample data set acquisition method, device, equipment and storage medium Download PDF

Info

Publication number
CN111782950A
CN111782950A CN202010616445.8A CN202010616445A CN111782950A CN 111782950 A CN111782950 A CN 111782950A CN 202010616445 A CN202010616445 A CN 202010616445A CN 111782950 A CN111782950 A CN 111782950A
Authority
CN
China
Prior art keywords
sample data
initial
search
negative sample
click
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010616445.8A
Other languages
Chinese (zh)
Inventor
王步霖
杨一帆
李悦
郭圣昱
屠川川
陶然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202010616445.8A priority Critical patent/CN111782950A/en
Publication of CN111782950A publication Critical patent/CN111782950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The application discloses a sample data set acquisition method, device, equipment and storage medium, and belongs to the technical field of internet. The method comprises the following steps: acquiring a first sample data set corresponding to any search word, and selecting at least one target negative sample data according to the position of initial positive sample data in the same search result interface in the search result interface; and selecting at least one target positive sample data according to the historical click rate of the user identifier, and forming a second sample data set corresponding to any search word by the target negative sample data and the target positive sample data, wherein the second sample data set is used for training the ranking model. The number of the negative sample data is reduced, and the situation that the number of the negative sample data is far larger than that of the positive sample data is avoided, so that the situation that the ranking model of the subsequent training is more biased to the characteristics of the negative sample data is avoided. And the sequencing model is trained by adopting a second sample data set subsequently, so that the accuracy of the sequencing model is improved.

Description

Sample data set acquisition method, device, equipment and storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for obtaining a sample data set.
Background
In order to ensure the accuracy of the search result, a ranking model is usually called during the search to rank a plurality of pieces of data obtained by the search. How to train an accurate ranking model becomes an urgent problem to be solved.
In the related technology, for any user, after the user searches based on the search word to obtain at least one piece of data, each piece of displayed data can be regarded as data seen by the user and can be used as sample data, if the user clicks any piece of displayed data, the data is recorded as positive sample data, if the user does not click any piece of displayed data, the data is recorded as negative sample data, the positive sample data and the negative sample data can be obtained based on the clicking behavior of the user by adopting the mode, and then the ranking model is trained according to the positive sample data and the negative sample data to obtain the trained ranking model.
However, since the number of the browsing data of the user is much larger than the number of the clicking data, the positive sample data acquired by the above method is much smaller than the negative sample data, and in the process of training the ranking model, the learned features of the ranking model are more biased to the features of the negative sample data, so that the accuracy of the trained ranking model is low.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for acquiring a sample data set, and solves the problems in the related art. The technical scheme is as follows:
in one aspect, a method for acquiring a sample data set is provided, where the method includes:
acquiring a first sample data set corresponding to any search word, wherein the first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to the search word, and the initial negative sample data is data of no click behavior occurring in the search result interface corresponding to the search word;
selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface;
selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data;
and forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and the selected at least one target positive sample data, wherein the second sample data set is used for training the ranking model.
In one possible implementation manner, the selecting, according to the position of each initial negative sample data and the initial positive sample data located in the same search result interface in the search result interface, at least one target negative sample data from the multiple initial negative sample data includes:
determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if the search result interface comprises initial negative sample data positioned before the target position, determining the initial negative sample data positioned before the target position as the target negative sample data.
In another possible implementation, the method further includes:
if the search result interface further comprises initial negative sample data located after the target position, acquiring a second amount of initial negative sample data from the first amount of initial negative sample data located after the target position as the target negative sample data, wherein the ratio of the second amount to the first amount is a preset ratio, and the preset ratio is smaller than 1.
In another possible implementation manner, the selecting, according to the position of each initial negative sample data and the initial positive sample data located in the same search result interface in the search result interface, at least one target negative sample data from the multiple initial negative sample data includes:
determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if any initial negative sample data is positioned in front of the target position, determining any initial negative sample data as the target negative sample data.
In another possible implementation manner, before selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data, the method further includes:
obtaining at least one search record and at least one click record of any user identifier, wherein the search record comprises at least one piece of data corresponding to any user identifier, and the click record comprises data of click behaviors in the corresponding search record;
and determining the historical click rate of any user identifier according to the at least one search record and the at least one click record.
In another possible implementation manner, determining a historical click rate of any user identifier according to the at least one search record and the at least one click record includes:
determining the number of search records with click records in the at least one search record as the number of search click times, and determining the number of data in each click record as the number of click times of each click record;
and determining the historical click rate of any user identifier according to the search click times and the click times of each click record.
In another possible implementation manner, the determining a historical click rate of any user identifier according to the search click number and the click number of each click record includes:
determining the historical click rate of any user identifier by adopting the following formula:
Figure BDA0002563885030000021
wherein Q is the search click frequency of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiA historical average number of clicks identified for the any one user,
Figure BDA0002563885030000022
identifying a historical click rate for the any user.
In another possible implementation, the method further includes:
training the ranking model according to the at least one target negative sample data and the at least one target positive sample data in the second sample data set, wherein the ranking model is used for ranking a plurality of pieces of data obtained by searching according to any search word.
In another possible implementation, the method further includes:
acquiring a search data set according to a currently input search word, wherein the search data set comprises a plurality of pieces of data;
calling the sequencing model to sequence the plurality of pieces of data to obtain the arrangement sequence of the plurality of pieces of data;
and displaying the plurality of pieces of data according to the arrangement sequence in a search result interface corresponding to the search terms.
In another aspect, an apparatus for acquiring a sample data set is provided, the apparatus comprising:
the data set acquisition module is used for acquiring a first sample data set corresponding to any search word, wherein the first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to the search word, and the initial negative sample data is data of no click behavior occurring in the search result interface corresponding to the search word;
the first selection module is used for selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface;
the second selection module is used for selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data;
and the forming module is used for forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and the selected at least one target positive sample data, and the second sample data set is used for training the ranking model.
In one possible implementation manner, the first selecting module includes:
the position determining unit is used for determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and the selecting unit is used for determining the initial negative sample data positioned before the target position as the target negative sample data if the initial negative sample data positioned before the target position is included in the search result interface.
In another possible implementation manner, the selecting unit is further configured to, if the search result interface further includes initial negative sample data located after the target position, obtain, from the first number of initial negative sample data located after the target position, a second number of initial negative sample data as the target negative sample data, where a ratio between the second number and the first number is a preset ratio, and the preset ratio is smaller than 1.
In another possible implementation manner, the first selecting module includes:
the position determining unit is used for determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and the selecting unit is used for determining any initial negative sample data as the target negative sample data if the any initial negative sample data is positioned in front of the target position.
In another possible implementation manner, the apparatus further includes:
the record acquisition module is used for acquiring at least one search record and at least one click record of any user identifier, wherein the search record comprises at least one piece of data corresponding to any user identifier, and the click record comprises data of click behaviors in the corresponding search record;
and the click rate determining module is used for determining the historical click rate of any user identifier according to the at least one search record and the at least one click record.
In another possible implementation manner, the click rate determining module includes:
the number determining unit is used for determining the number of the search records with the click records in the at least one search record as the number of search clicks, and determining the number of the data included in each click record as the number of click of each click record;
and the click rate determining unit is used for determining the historical click rate of any user identifier according to the search click times and the click times of each click record.
In another possible implementation manner, the click rate determining unit is configured to determine a historical click rate of any user identifier by using the following formula:
Figure BDA0002563885030000031
wherein Q is the search click frequency of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiA historical average number of clicks identified for the any one user,
Figure BDA0002563885030000032
identifying a historical click rate for the any user.
In another possible implementation manner, the apparatus further includes:
and the training module is used for training the ranking model according to the at least one target negative sample data and the at least one target positive sample data in the second sample data set, and the ranking model is used for ranking a plurality of pieces of data obtained by searching according to any search word.
In another possible implementation manner, the apparatus further includes:
the data set acquisition module is used for acquiring a search data set according to a currently input search word, wherein the search data set comprises a plurality of pieces of data;
the sorting module is used for calling the sorting model and sorting the plurality of pieces of data to obtain the arrangement sequence of the plurality of pieces of data;
and the display module is used for displaying the plurality of pieces of data in the search result interface corresponding to the search terms according to the arrangement sequence.
In another aspect, a computer device is provided, which comprises one or more processors and one or more memories having stored therein at least one instruction, which is loaded and executed by the one or more processors to carry out the operations performed by the sample data set acquisition method according to the above aspect.
In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the sample data set acquisition method according to the above aspect.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are loaded and executed by a processor to implement the operations performed by the sample data set acquisition method as described in the above aspect.
According to the sample data set obtaining method, the device, the equipment and the storage medium, data with clicking behaviors in the search result interface corresponding to the search words are positive sample data, data without clicking behaviors in the search result interface are negative sample data, the positive sample data are data which are viewed by a user and are high in reliability, and the negative sample data are selected according to the position of the positive sample data in the search result interface, so that the reliability of the selected negative sample data is improved, the number of the negative sample data is reduced, the situation that the number of the negative sample data is far larger than that of the positive sample data is avoided, and the situation that a ranking model of follow-up training is more biased to the characteristics of the negative sample data is avoided. And positive sample data is selected according to the historical click rate of the user identifier, so that the reliability of the selected positive sample data is improved. The reliability of the selected positive sample data and the negative sample data is improved, and the selected positive sample data and the selected negative sample data are subsequently adopted to train the ranking model, so that the accuracy of the ranking model is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;
fig. 2 is a flowchart of a sample data set obtaining method according to an embodiment of the present application;
fig. 3 is a flowchart of a sample data set obtaining method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a search results interface provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a search results interface provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a search results interface provided by an embodiment of the present application;
fig. 7 is a flowchart of a sample data set obtaining method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a sample data set acquiring apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of another sample data set acquiring apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The sample data set obtaining method provided by the application can obtain at least one target negative sample data and at least one target positive sample data from a plurality of pieces of initial positive sample data and a plurality of pieces of initial negative sample data in the sample data set, so that the accuracy of a trained ranking model can be improved, and the method can be applied to the following scenes:
for example, the sample data set obtaining method provided by the application is applied to a search scene, when any user needs to check some data, a search word needs to be input in a terminal, the terminal obtains corresponding pieces of data based on the search word, and the pieces of data are arranged and displayed according to a certain arrangement sequence.
In addition, the method provided by the embodiment of the application is applied to the electronic equipment, and the electronic equipment can comprise a terminal and can also comprise a server.
Fig. 1 shows a schematic structural diagram of an implementation environment of an embodiment of the present application, referring to fig. 1, the implementation environment includes a terminal 101 and a server 102, the terminal 101 and the server 102 are connected through a communication network, after the terminal 101 obtains a plurality of pieces of data according to a search term input by a user, a first sample data set including a plurality of pieces of initial positive sample data and a plurality of pieces of initial negative sample data is obtained according to a click operation of the user, the first sample data set is sent to the server 102, the server 102 obtains at least one piece of target positive sample data and at least one piece of negative sample data according to the first sample data set, and a ranking model is trained subsequently according to the at least one piece of positive sample data and the at least one piece of.
The terminal can be various terminals such as a mobile phone, a tablet computer, a computer and the like, and the server can be a server, a server cluster consisting of a plurality of servers or a cloud computing service center.
Fig. 2 is a flowchart of a sample data set obtaining method provided in an embodiment of the present application, and referring to fig. 2, the method includes:
201. and acquiring a first sample data set corresponding to any search term.
The first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to a search word, and the initial negative sample data is data of a click behavior not occurring in the search result interface corresponding to the search word.
202. And selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface.
203. And selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data.
204. And forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and at least one target positive sample data, wherein the second sample data set is used for training the ranking model.
According to the method provided by the embodiment of the application, the data of the click behavior in the search result interface corresponding to the search word is positive sample data, the data of the click behavior in the search result interface is not negative sample data, the positive sample data is the data which is viewed by a user, the reliability is high, and the negative sample data is selected according to the position of the positive sample data in the search result interface, so that the reliability of the selected negative sample data is improved, the number of the negative sample data is reduced, the situation that the number of the negative sample data is far larger than that of the positive sample data is avoided, and the situation that a ranking model of subsequent training is more biased to the characteristics of the negative sample data is avoided. And positive sample data is selected according to the historical click rate of the user identifier, so that the reliability of the selected positive sample data is improved. The reliability of the selected positive sample data and the negative sample data is improved, and the selected positive sample data and the selected negative sample data are subsequently adopted to train the ranking model, so that the accuracy of the ranking model is improved.
In one possible implementation manner, selecting at least one target negative sample data from a plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data located in the same search result interface in the search result interface, includes:
determining the position of the initial positive sample data ranked at the last position in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if the search result interface comprises initial negative sample data positioned before the target position, determining the initial negative sample data positioned before the target position as the target negative sample data.
In another possible implementation, the method further includes:
if the search result interface further comprises initial negative sample data located behind the target position, acquiring a second amount of initial negative sample data as the target negative sample data from the first amount of initial negative sample data located behind the target position, wherein the ratio of the second amount to the first amount is a preset ratio, and the preset ratio is smaller than 1.
In another possible implementation manner, selecting at least one target negative sample data from a plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data located in the same search result interface in the search result interface, includes:
determining the position of the initial positive sample data ranked at the last position in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if any initial negative sample data is positioned in front of the target position, determining any initial negative sample data as the target negative sample data.
In another possible implementation manner, before selecting at least one target positive sample data from a plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data, the method further includes:
acquiring at least one search record and at least one click record of any user identifier, wherein the search record comprises at least one piece of data corresponding to any user identifier, and the click record comprises data of click behaviors in the corresponding search record;
and determining the historical click rate of any user identifier according to the at least one search record and the at least one click record.
In another possible implementation manner, determining a historical click rate of any user identifier according to at least one search record and at least one click record includes:
determining the number of search records with click records in at least one search record as the number of search click times, and determining the number of data in each click record as the number of click times of each click record;
and determining the historical click rate of any user identifier according to the search click times and the click times of each click record.
In another possible implementation manner, determining a historical click rate of any user identifier according to the number of search clicks and the number of clicks recorded by each click record includes:
determining the historical click rate of any user identifier by adopting the following formula:
Figure BDA0002563885030000061
wherein Q is the search click number of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiThe historical average number of clicks identified for any user,
Figure BDA0002563885030000062
the historical click rate identified for any user.
In another possible implementation, the method further includes:
training a ranking model according to at least one target negative sample data and at least one target positive sample data, wherein the ranking model is used for ranking a plurality of pieces of data obtained by searching according to any search word.
In another possible implementation, the method further includes:
acquiring a search data set according to a currently input search word, wherein the search data set comprises a plurality of pieces of data;
calling a sequencing model, and sequencing the plurality of pieces of data to obtain an arrangement sequence of the plurality of pieces of data;
and displaying a plurality of pieces of data in the search result interface corresponding to the search terms according to the arrangement sequence.
Fig. 3 is a flowchart of a sample data set obtaining method provided in an embodiment of the present application, and referring to fig. 3, the method is applied to an electronic device, and the method includes:
301. and acquiring a first sample data set corresponding to any search term.
In the embodiment of the application, a user can search a plurality of pieces of data corresponding to a search word based on the search word, and when the data are displayed in a terminal of the user, the data need to be displayed according to a certain sequencing order. One of the methods is to invoke a trained sorting model, sort a plurality of pieces of searched data to obtain an arrangement order of the plurality of pieces of data, and then display the plurality of pieces of data according to the arrangement order by a terminal, so that a user can view the plurality of pieces of data through the terminal and can perform a click operation on the plurality of pieces of data.
The sequencing model needs to be trained by adopting sample data, and the accuracy of the sequencing model is directly influenced by the quality of the obtained sample data, so that in order to improve the quality of the obtained sample data, after the target positive sample data and the target negative sample data are obtained by adopting a mode of screening the sample data, the accuracy of the sequencing model obtained by adopting the target positive sample data and the target negative sample data for training can be ensured.
The search term is any term input by any user in the search process, and may be "tea water", "afternoon tea", "lunch", and the like.
The first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data are data of clicking behaviors in a search result interface corresponding to the search word, and the initial negative sample data are data of clicking behaviors which do not occur in the search result interface corresponding to the search word.
Optionally, the first sample data set includes a plurality of initial positive sample data and a plurality of initial negative sample data, which are obtained by a plurality of user identifications based on the same search word. Or, the first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data which are acquired by a user identification based on the search word.
The first sample data set can be obtained based on a search process, wherein in one search process, after a terminal detects a search word input by a user, the terminal searches based on the search word, then a plurality of pieces of data obtained through searching are displayed in a search result interface, when the user performs click operation on any piece of data in the plurality of pieces of data, the terminal can detect the click operation on any piece of data, the data is determined as initial positive sample data, the data of which the user does not perform click operation is determined as initial negative sample data by the terminal, and according to the mode, the terminal can determine a plurality of pieces of initial positive sample data and a plurality of pieces of negative sample data to form the first sample data set.
Optionally, the search result interface may be a list page, where the list page includes a plurality of data sorted in order, and the user may view more data by pulling down the list page. In addition, a heterogeneous document is also included in the search result interface, and the heterogeneous document is a document consisting of a plurality of pictures. The user can check the content corresponding to any picture by triggering operation on the picture. And, the user can also slide left and right in the heterogeneous document, so that different contents in the heterogeneous document can be viewed.
For example, when the user inputs a search word of "afternoon tea", a plurality of pieces of data as shown in fig. 4 may be displayed in the search interface, and when a click operation on any piece of data is detected in the search interface, the piece of data may be determined as initial positive sample data, and the piece of data in which the trigger operation is not detected may be determined as initial negative sample data. In addition, the document corresponding to the first picture is a normal document, and two pictures arranged below the picture form a heterogeneous document.
Optionally, an application program is installed in the terminal, and the user may open a search function of the application program and then search in the application program to display a plurality of searched data in a search result interface of the application program.
Wherein the application may be an item recommender, a merchant recommender, or other type of recommender, among others.
In a possible implementation manner, since the obtained initial positive sample data and the initial negative sample data are obtained by searching based on the search word, the search word is included in both the initial positive sample data and the initial negative sample data.
Or after the terminal logs in based on the user identifier, searching is carried out based on the search word to obtain a plurality of searched data, and then initial positive sample data and initial negative sample data are obtained according to the clicking behavior of the user, so that the obtained initial positive sample data and initial negative sample data are both corresponding to the user identifier.
302. And selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface.
And the search result interface is used for displaying the data obtained by searching after the terminal searches based on the search words.
After a user searches based on a certain search word, the number of data obtained by the user through click search is far smaller than the number of data which is not clicked, therefore, if a first sample data set is obtained according to a large number of search behaviors of the user identification, the number of initial positive sample data in the first sample data set is far smaller than the number of initial negative sample data because the data in the first sample data set are obtained in the search process by the user identification, and the number of the initial positive sample data in the first sample data set is also far smaller than the number of the initial negative sample data in general, so that the initial negative sample data needs to be screened, and the number of the initial negative sample data is reduced.
Moreover, since different terminals used by different users are different, that is, different search result interfaces corresponding to different terminals are different, different terminals perform search based on search terms, and then displayed data is displayed in different search result interfaces, when processing initial positive sample data and initial negative sample data, it is necessary to determine the search result interfaces where the initial positive sample data and the initial negative sample data are located first, and then determine target negative sample data according to different search result interfaces.
And each initial negative sample data and each initial positive sample data are displayed in the corresponding search result interface, each initial negative sample data and each initial positive sample data have positions in the corresponding search result interface, the initial positive sample data in the search result interface are checked by a user, and the target negative sample data are selected according to the positions of the checked initial positive sample data in the corresponding search result interface.
For example, as shown in the search interface of fig. 4, 3 documents are displayed in order from top to bottom, the position of the first document may be regarded as position 1, the position of the second document may be regarded as position 2, and the position of the third document may be regarded as position 3.
In the process of acquiring the first sample data set, initial negative sample data and initial positive sample data are acquired according to the click behavior of a user in a search interface, and at least one target negative sample data is selected from the multiple initial negative sample data according to the position of each initial negative sample data in the first sample data set and the position of the initial positive sample data in the same search result interface with the initial negative sample data.
It should be noted that, in the process of selecting at least one target negative sample data from a plurality of initial negative sample data, in one case, the selection is performed in units of search result interfaces to which the initial negative sample data belongs, and in another case, the selection is performed in units of a single initial negative sample data, which will be described below:
in the first case: and determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface, and if the search result interface comprises the initial negative sample data positioned before the target position, determining the initial negative sample data positioned before the target position as the target negative sample data.
In the embodiment of the application, a search result interface to which any initial negative sample data belongs is determined, the position of the initial positive sample data in the search result interface is obtained, then the positions of a plurality of initial positive sample data in the search result interface are compared, the position of the initial positive sample data ranked at the last position is determined as the target position of the search result interface, the target position can be regarded as the position where the user finally clicks in the search result interface, and all data users before the target position see, so that the initial negative sample data before the target position can be determined to be data which are not interested by the user, and the initial negative sample data before the target position can be determined as the target negative sample data.
For example, when the target position in the search result interface to which the determined initial negative sample data belongs is position 7, the positions of the initial negative sample data before the target position include position 1, position 2, position 4 and position 6, and the remaining positions 3 and 5 are the initial positive sample data, the initial negative sample data corresponding to position 1, position 2, position 4 and position 6 are determined as the target negative sample data.
In addition, in fig. 4, the following description will be given by taking fig. 5 and 6 as examples, where a solid line frame in fig. 5 is a fixed target position, and data in a dashed line frame before the fixed target position is fixed target negative sample data, which can be regarded as reliable negative sample data. The solid line box in fig. 6 is the determined target position, the data in the dashed line box after the target position is unreliable negative sample data, and extraction according to probability is required when the target negative sample data is determined.
In the second case: and determining the position of the initial positive sample data ranked at the last position in the search result interface to which any initial negative sample data belongs as the target position of the search result interface, and determining any initial negative sample data as the target negative sample data if any initial negative sample data is positioned in front of the target position.
The process of determining the target position in the search result interface is similar to the above process, and is not repeated here.
It should be noted that, on the basis of the first case, in the search result interface, after the determined target position, initial positive sample data is no longer included, and all the initial positive sample data are initial negative sample data, and if the initial negative sample data still includes data that is not interested by the user, a part of the initial negative sample data is acquired from the initial negative sample data and is used as the target negative sample data.
Optionally, if the search result interface further includes initial negative sample data located after the target position, a second amount of initial negative sample data is obtained from the first amount of initial negative sample data located after the target position as the target negative sample data.
The ratio of the second quantity to the first quantity is a preset ratio, and the preset ratio is smaller than 1.
When the target negative sample data is acquired from the first quantity of initial negative sample data behind the target position, because the probability of each selected target negative sample data is the same, the ratio of the second quantity to the first quantity of the target negative sample data acquired from the first quantity of initial negative sample data is a preset ratio.
The preset ratio may be set by the electronic device or by the operator. For example, the predetermined ratio may be 0.4, 0.5, or other values.
Optionally, since the number of the acquired initial negative sample data is too large, if the preset ratio is set too large, the number of the negative sample data cannot be effectively reduced, and therefore, in order to effectively reduce the number of the negative sample data, in the process of setting the preset ratio, the number of the initial negative sample data may be determined according to the number of the initial negative sample data, and the preset ratio is inversely proportional to the number of the initial negative sample data. It may also be considered that when the number of initial negative sample data is large, the preset ratio is decreased when the preset ratio is set, and when the number of initial negative sample data is small, the preset ratio is increased when the preset ratio is set.
In the embodiment of the application, the number of the initial negative sample data is reduced, and the accuracy of the determined target negative sample data is ensured. In addition, as the determined target position is the position which is checked by the user in the search result interface for the last time, data users behind the target position are not checked, but whether the users are interested or not can not be determined, partial target negative sample data are obtained from the initial negative sample data, and the comprehensiveness of the obtained target negative sample data is further ensured.
303. And acquiring at least one search record and at least one click record of any user identification.
In this embodiment of the application, at least one target negative sample data that has been obtained in step 302, but for a plurality of initial positive sample data in the first sample data set, since the plurality of initial positive sample data are clicked by users, and different users have different habits, for example, some users habitually click a plurality of pieces of data many times, and these users may not be interested in the clicked data, then the initial positive sample data obtained according to these users may not be reliable, and some users click only the data that they are interested in, then the initial positive sample data obtained according to these users is very reliable.
Therefore, in the embodiment of the present application, at least one target positive sample data needs to be selected from a plurality of initial positive sample data, and in the selection process, historical click rates of different user identifiers need to be determined, and then the target positive sample data is selected according to the historical click rates of the user identifiers, and when the click rate of the user identifier is determined, the target positive sample data needs to be determined according to the search record and the click record of the user identifier.
In the embodiment of the application, for any user identifier, if the user identifier is searched based on a search word, the terminal displays a plurality of pieces of data corresponding to the search word in a search result interface, records the search word input by the user identifier and at least one piece of data corresponding to the search word, and generates a search record of the user identifier from the search word and the at least one piece of data corresponding to the search word. If the user identification triggers the click operation on any piece of data in the search result interface, the terminal also records the data of the user identification triggering the click operation, and generates a click record.
Therefore, the search record comprises at least one piece of data corresponding to any user identification, and the click record comprises the data of the click behavior in the corresponding search record.
For any user identifier, the user identifier corresponds to a search record, if the user identifier triggers a click behavior on data in the search record, the search record corresponds to a click record, and if the user identifier does not trigger a click behavior on data in the search record, the search record does not correspond to a click record.
304. And determining the historical click rate of any user identifier according to the at least one search record and the at least one click record.
When the first sample data set is obtained, because sample data is obtained according to a plurality of user identifications, and habits of users corresponding to different user identifications are different, a part of users like to click for a plurality of times after searching.
After at least one search record and at least one click record of the user identifier are obtained, historical search data and data including click behaviors of the user identifier can be determined, and further the historical click rate of the user identifier can be determined.
When the historical click rate of the user identifier is higher, the number of times of click behaviors triggered by the user identifier is smaller, the initial positive sample data obtained according to the user identifier is more reliable, and when the click rate of the user identifier is lower, the number of times of click behaviors triggered by the user identifier is larger, and the initial positive sample data obtained according to the user identifier is more unreliable.
Optionally, the number of search records in which a click record exists in at least one search record is determined as the number of search clicks, the number of data included in each click record is determined as the number of click times of each click record, and the historical click rate of any user identifier is determined according to the number of search clicks and the number of click times of each click record.
In the process of determining the historical click rate of the user identifier, determining search records with click behaviors, wherein the number of the search records is the number of times of searching the history of the user identifier, then obtaining the number of data included in the click records, wherein the number of the data is the number of times of clicking the user identifier to trigger the click behaviors, and then determining the historical click rate of any user identifier according to the search click times and the number of times of clicking each click record.
Optionally, the following formula is used to determine the historical click rate of any user identifier:
Figure BDA0002563885030000101
wherein Q is the search click number of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiThe historical average number of clicks identified for any user,
Figure BDA0002563885030000102
the historical click rate identified for any user.
In addition, when N isiWhen the sample data is 0, it indicates that the user identifier never has a click behavior before the current click behavior, and indicates that the initial positive sample data corresponding to the user identifier is reliable, and at this time, the initial positive sample data corresponding to the user identifier can be directly determined as the target positive sample data.
It should be noted that, in the embodiment of the present application, the steps 303, 304 and 302 are not in sequence. Step 303 and step 304 may be performed first, and then step 302 may be performed, or step 302 and then step 303 and step 304 may be performed first.
The second point to be noted is that, step 303-.
305. And selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data.
Since the historical click rate of the user identifier indicates the number of clicks of the user identifier, and the more the number of clicks of the user identifier is, the lower the obtained historical click rate is, when the historical click rate of the user identifier corresponding to the initial positive sample data is higher, it is indicated that the more reliable the initial positive sample data of the user identifier is, the higher the probability of determining the initial positive sample data as the target positive sample data is, and when the historical click rate of the user identifier corresponding to the initial positive sample data is lower, it is indicated that the less reliable the initial positive sample data of the user identifier is, the lower the probability of determining the initial positive sample data as the target positive sample data is.
Therefore, after the historical click rate of the user identifier corresponding to each initial positive sample data is determined, the reliability of the initial positive sample data corresponding to the user identifier can be determined, and at least one target positive sample data is obtained according to the historical click rate of the user identifier corresponding to each initial positive sample data.
In a possible implementation manner, since each initial positive sample data corresponds to one historical click rate, when target positive sample data is extracted from a plurality of initial positive sample data, the target positive sample data is extracted according to the historical click rate of each initial positive sample data.
For example, when the historical click rate of the initial positive sample data 1 is 0.6, the probability of extracting the initial positive sample data 1 is 0.6, the historical click rate of the initial positive sample data 2 is 0.9, the probability of extracting the initial positive sample data 2 is 0.9, and so on, at least one target positive sample data can be extracted from a plurality of initial positive sample data.
In another possible implementation manner, a plurality of initial positive sample data are grouped according to the user identifier, the initial positive sample data in each group correspond to the same user identifier, and then for each group of initial positive sample data, at least one target positive sample data is selected according to the historical click rate of the user identifier corresponding to the group of positive sample data.
306. And forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and at least one target positive sample data.
And the second sample data set is used for training the sequencing model.
In the embodiment of the application, the selected at least one target negative sample data and the selected at least one target positive sample data can be regarded as reliable sample data, a second sample data set corresponding to any search word can be formed after the target negative sample data and the target positive sample data are mixed, and then the ranking model is trained according to the second sample data set.
Optionally, when the at least one target negative sample data and the at least one target positive sample data are uniformly mixed, so that the target negative sample data and the target positive sample data are uniformly distributed, and when the ranking model is trained, the target negative sample data and the target positive sample data can be uniformly acquired.
It should be noted that, the embodiment of the present application may be executed by a terminal, or may also be executed by a server, or the embodiment of the present application may also be sent to the server after the terminal executes step 301, and the server executes step 302 and step 306.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
According to the method provided by the embodiment of the application, the data of the click behavior in the search result interface corresponding to the search word is positive sample data, the data of the click behavior in the search result interface is not negative sample data, the positive sample data is the data which is viewed by a user, the reliability is high, and the negative sample data is selected according to the position of the positive sample data in the search result interface, so that the reliability of the selected negative sample data is improved, the number of the negative sample data is reduced, the situation that the number of the negative sample data is far larger than that of the positive sample data is avoided, and the situation that a ranking model of subsequent training is more biased to the characteristics of the negative sample data is avoided. And positive sample data is selected according to the historical click rate of the user identifier, so that the reliability of the selected positive sample data is improved. The reliability of the selected positive sample data and the negative sample data is improved, and the selected positive sample data and the selected negative sample data are subsequently adopted to train the ranking model, so that the accuracy of the ranking model is improved.
In addition, partial initial negative sample data can be acquired from the initial negative sample data positioned at the target position and used as target negative sample data, the number of the initial negative sample data is reduced, the accuracy of the determined target negative sample data is ensured, partial target negative sample data is acquired from the initial negative sample data which is not determined whether the user is interested in, and the comprehensiveness of the acquired target negative sample data is further ensured.
Fig. 7 is a flowchart of a sample data set obtaining method provided in an embodiment of the present application, and referring to fig. 7, the method includes:
701. and acquiring a first sample data set corresponding to any search term.
The first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to a search word, and the initial negative sample data is data of a click behavior not occurring in the search result interface corresponding to the search word.
702. And selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface.
703. And selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data.
704. And forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and at least one target positive sample data.
And the second sample data set is used for training the sequencing model.
The process in step 701-701 is similar to that in step 301-306, and is not described herein again.
705. And training the ranking model according to at least one target negative sample data and at least one target positive sample data in the second sample data set.
The ranking model is used for ranking a plurality of pieces of data obtained by searching according to any search term.
And obtaining a second sample data set, namely training a ranking model according to at least one target negative sample data and at least one target positive sample data in the second sample data set.
The method comprises the steps of firstly obtaining an initial sequencing model or a sequencing model which is trained for one time or more, then obtaining a search word and corresponding initial positive sample data or initial negative sample data, and training the initial sequencing model according to the search word and the corresponding initial positive sample data or initial negative sample data to obtain a trained sequencing model.
If the search word corresponds to the initial positive sample data, the sample probability between the search word and the initial positive sample data is 1, which means that the probability of the user clicking the initial positive sample data is 1, and if the search word corresponds to the initial negative sample data, the sample probability between the search word and the initial negative sample data is 0, which means that the probability of the user clicking the initial negative sample data is 0. In the training process, at least one search word and corresponding initial positive sample data or initial negative sample data are input into a ranking model, prediction probability is obtained based on the ranking model, the prediction probability is used for representing the predicted probability of clicking data by a user, then the error between the sample probability corresponding to the initial positive sample data or the initial negative sample data and the prediction probability is obtained, the ranking model is adjusted, the error obtained by the adjusted ranking model is converged, and training of the ranking model is completed.
706. And acquiring a search data set according to the currently input search word, wherein the search data set comprises a plurality of pieces of data.
707. And calling a sequencing model to sequence the plurality of pieces of data to obtain the arrangement sequence of the plurality of pieces of data.
After a plurality of pieces of data are obtained, a sequencing model can be called according to the search word and the plurality of pieces of data to obtain the probability between the search word and each piece of data, and then the plurality of pieces of data are sequenced according to the sequence of the obtained probability from high to low, so that the sequence of the plurality of pieces of data is obtained.
708. And displaying a plurality of pieces of data in the search result interface corresponding to the search terms according to the arrangement sequence.
After the arrangement sequence of the plurality of pieces of data is determined, the plurality of pieces of data can be sequentially displayed according to the determined arrangement sequence, and then a user can view each piece of data in the search result interface.
In addition, in the checking process, the user can trigger a click behavior on a plurality of pieces of data in the search result interface, the terminal can continue to use the data with the click behavior as initial positive sample data, use the data without the click behavior as initial negative sample data, update the first sample data set by using the newly acquired initial positive sample data and initial negative sample data, further update the second sample data set by using the method provided by the embodiment of the application, train the ranking model by using the updated second sample data set subsequently, and continue to improve the accuracy of the ranking model.
According to the method provided by the embodiment of the application, the reliability of the obtained target negative sample data and the target positive sample data is improved by reducing the number of the initial negative sample data, and the accuracy of the sequencing model of the subsequent training is improved. And then, the sequencing model is called to sequence a plurality of data, so that the accuracy of sequencing the plurality of data is improved, and the data positioned in front can be ensured to be the data which is more interesting to the user.
Fig. 8 is a schematic structural diagram of a sample data set acquisition apparatus according to an embodiment of the present application. Referring to fig. 8, the apparatus includes:
a data set obtaining module 801, configured to obtain a first sample data set corresponding to any search word, where the first sample data set includes multiple initial positive sample data and multiple initial negative sample data, the initial positive sample data is data where a click behavior occurs in a search result interface corresponding to the search word, and the initial negative sample data is data where a click behavior does not occur in the search result interface corresponding to the search word;
a first selecting module 802, configured to select at least one target negative sample data from multiple initial negative sample data according to the position of each initial negative sample data and the position of the initial positive sample data located in the same search result interface in the search result interface to which the initial positive sample data belongs;
a second selecting module 803, configured to select at least one target positive sample data from the multiple initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data;
a constructing module 804, configured to construct a second sample data set corresponding to any search term from the selected at least one target negative sample data and the at least one target positive sample data, where the second sample data set is used to train the ranking model.
According to the device provided by the embodiment of the application, the data of the click behavior in the search result interface corresponding to the search word is positive sample data, the data of the click behavior in the search result interface is not negative sample data, the positive sample data is the data which is viewed by a user, the reliability is high, the negative sample data is selected according to the position of the positive sample data in the search result interface, the reliability of the selected negative sample data is improved, the number of the negative sample data is reduced, the situation that the number of the negative sample data is far larger than that of the positive sample data is avoided, and therefore the situation that a ranking model of follow-up training is more inclined to the characteristics of the negative sample data is avoided. And positive sample data is selected according to the historical click rate of the user identifier, so that the reliability of the selected positive sample data is improved. The reliability of the selected positive sample data and the negative sample data is improved, and the selected positive sample data and the selected negative sample data are subsequently adopted to train the ranking model, so that the accuracy of the ranking model is improved.
In one possible implementation, referring to fig. 9, the first selecting module 802 includes:
a position determining unit 8021, configured to determine, as a target position of the search result interface, a position of the initial positive sample data arranged at the last position in the search result interface to which any initial negative sample data belongs;
the selecting unit 8022 is configured to determine, if the search result interface includes initial negative sample data located before the target position, the initial negative sample data located before the target position as the target negative sample data.
In another possible implementation manner, the selecting unit 8022 is further configured to, if the search result interface further includes initial negative sample data located after the target position, obtain, from the first number of initial negative sample data located after the target position, a second number of initial negative sample data as the target negative sample data, where a ratio between the second number and the first number is a preset ratio, and the preset ratio is smaller than 1.
In another possible implementation manner, referring to fig. 9, the first selecting module 802 includes:
a position determining unit 8021, configured to determine, as a target position of the search result interface, a position of the initial positive sample data arranged at the last position in the search result interface to which any initial negative sample data belongs;
the selecting unit 8022 is configured to determine any initial negative sample data as the target negative sample data if the any initial negative sample data is located before the target position.
In another possible implementation, referring to fig. 9, the apparatus further includes:
the record obtaining module 805 is configured to obtain at least one search record and at least one click record of any user identifier, where the search record includes at least one piece of data corresponding to any user identifier, and the click record includes data of a click behavior occurring in the corresponding search record;
and a click rate determining module 806, configured to determine a historical click rate of any user identifier according to the at least one search record and the at least one click record.
In another possible implementation, referring to fig. 9, the click-through rate determining module 806 includes:
a number determining unit 8061, configured to determine, as the number of search hits in at least one search record, the number of search records in which a hit record exists, and determine the number of data included in each click record as the number of hits in each click record;
the click rate determining unit 8062 is configured to determine a historical click rate of any user identifier according to the search click times and the click times of each click record.
In another possible implementation, the click-through rate determining unit 8062 is configured to determine the historical click-through rate of any user identifier by using the following formula:
Figure BDA0002563885030000131
wherein Q is the search click number of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiThe historical average number of clicks identified for any user,
Figure BDA0002563885030000132
the historical click rate identified for any user.
In another possible implementation, referring to fig. 9, the apparatus further includes:
the training module 807 is configured to train a ranking model according to at least one target negative sample data and at least one target positive sample data in the second sample data set, where the ranking model is configured to rank multiple pieces of data obtained by searching according to any search term.
In another possible implementation, referring to fig. 9, the apparatus further includes:
a data set obtaining module 808, configured to obtain a search data set according to a currently input search word, where the search data set includes multiple pieces of data;
the sorting module 809 is configured to invoke a sorting model, sort the multiple pieces of data, and obtain an arrangement order of the multiple pieces of data;
the display module 810 is configured to display the plurality of pieces of data in the search result interface corresponding to the search term according to the arrangement order.
It should be noted that: the sample data set acquiring apparatus provided in the above embodiment is only illustrated by the division of the above functional modules when acquiring a data set, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the electronic device is divided into different functional modules to complete all or part of the above described functions. In addition, the embodiment of the sample data set obtaining apparatus provided in the foregoing embodiment and the embodiment of the sample data set obtaining method belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.
Fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 1000 can be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group audio Layer III, motion Picture Experts compression standard audio Layer 3), an MP4 player (Moving Picture Experts Group audio Layer IV, motion Picture Experts compression standard audio Layer 4), a notebook computer, a desktop computer, a head-mounted device, or any other intelligent terminal. Terminal 1000 can also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or the like by other names.
In general, terminal 1000 can include: a processor 1001 and a memory 1002.
Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one instruction for the processor 1001 to have in order to implement the sample data set acquisition method provided by the method embodiments herein.
In some embodiments, terminal 1000 can also optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, display screen 1005, camera assembly 1006, audio circuitry 1007, positioning assembly 1008, and power supply 1009.
The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 8G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1005 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In other embodiments, display 1005 can be one, disposed on a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in still other embodiments, display 1005 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-emitting diode), and the like.
The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.
A location component 1008 is employed to locate a current geographic location of terminal 1000 for navigation or LBS (location based Service). The positioning component 1008 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.
Power supply 1009 is used to supply power to various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1009 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 1000 can also include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.
Acceleration sensor 1011 can detect acceleration magnitudes on three coordinate axes of a coordinate system established with terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1001 may control the display screen 1005 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 1012 may detect a body direction and a rotation angle of the terminal 1000, and the gyro sensor 1012 and the acceleration sensor 1011 may cooperate to acquire a 3D motion of the user on the terminal 1000. From the data collected by the gyro sensor 1012, the processor 1001 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensor 1013 can be disposed on a side frame of terminal 1000 and/or underneath display screen 1005. When pressure sensor 1013 is disposed on a side frame of terminal 1000, a user's grip signal on terminal 1000 can be detected, and processor 1001 performs left-right hand recognition or shortcut operation according to the grip signal collected by pressure sensor 1013. When the pressure sensor 1013 is disposed at a lower layer of the display screen 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1005. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the user according to the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 1001 authorizes the user to have relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. Fingerprint sensor 1014 can be disposed on the front, back, or side of terminal 1000. When a physical key or vendor Logo is provided on terminal 1000, fingerprint sensor 1014 can be integrated with the physical key or vendor Logo.
The optical sensor 1015 is used to collect the ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the display screen 1005 according to the ambient light intensity collected by the optical sensor 1015. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the display screen 1005 is turned down. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the intensity of the ambient light collected by the optical sensor 1015.
Proximity sensor 1016, also known as a distance sensor, is typically disposed on a front panel of terminal 1000. Proximity sensor 1016 is used to gather the distance between the user and the front face of terminal 1000. In one embodiment, when proximity sensor 1016 detects that the distance between the user and the front surface of terminal 1000 is gradually reduced, processor 1001 controls display screen 1005 to switch from a bright screen state to a dark screen state; when proximity sensor 1016 detects that the distance between the user and the front of terminal 1000 is gradually increased, display screen 1005 is controlled by processor 1001 to switch from a breath-screen state to a bright-screen state.
Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.
Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1100 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memory 1102 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1101 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The server 1100 may be configured to perform the steps performed by the server in the sample data set obtaining method.
An embodiment of the present application further provides an electronic device, where the electronic device includes one or more processors and one or more memories, where at least one instruction is stored in the one or more memories, and the at least one instruction is loaded and executed by the one or more processors to implement the operations performed by the sample data set obtaining method.
An embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the operation performed by the sample data set obtaining method.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (12)

1. A method for acquiring a sample data set, the method comprising:
acquiring a first sample data set corresponding to any search word, wherein the first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to the search word, and the initial negative sample data is data of no click behavior occurring in the search result interface corresponding to the search word;
selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface;
selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data;
and forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and the selected at least one target positive sample data, wherein the second sample data set is used for training the ranking model.
2. The method of claim 1, wherein said selecting at least one target negative sample data from said plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface comprises:
determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if the search result interface comprises initial negative sample data positioned before the target position, determining the initial negative sample data positioned before the target position as the target negative sample data.
3. The method of claim 2, further comprising:
if the search result interface further comprises initial negative sample data located after the target position, acquiring a second amount of initial negative sample data from the first amount of initial negative sample data located after the target position as the target negative sample data, wherein the ratio of the second amount to the first amount is a preset ratio, and the preset ratio is smaller than 1.
4. The method of claim 1, wherein said selecting at least one target negative sample data from said plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface comprises:
determining the position of the initial positive sample data ranked at the last in the search result interface to which any initial negative sample data belongs as the target position of the search result interface;
and if any initial negative sample data is positioned in front of the target position, determining any initial negative sample data as the target negative sample data.
5. The method according to claim 1, wherein before selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data, the method further comprises:
obtaining at least one search record and at least one click record of any user identifier, wherein the search record comprises at least one piece of data corresponding to any user identifier, and the click record comprises data of click behaviors in the corresponding search record;
and determining the historical click rate of any user identifier according to the at least one search record and the at least one click record.
6. The method of claim 5, wherein determining the historical click rate for any of the user identifiers based on the at least one search record and the at least one click record comprises:
determining the number of search records with click records in the at least one search record as the number of search click times, and determining the number of data in each click record as the number of click times of each click record;
and determining the historical click rate of any user identifier according to the search click times and the click times of each click record.
7. The method of claim 6, wherein determining the historical click rate of any user identifier based on the search click and the click count of each click record comprises:
determining the historical click rate of any user identifier by adopting the following formula:
Figure FDA0002563885020000021
wherein Q is the search click frequency of any user identifier, IiNumber of clicks of a click record corresponding to the ith search record, NiA historical average number of clicks identified for the any one user,
Figure FDA0002563885020000022
identifying a historical click rate for the any user.
8. The method of claim 1, further comprising:
training the ranking model according to the at least one target negative sample data and the at least one target positive sample data in the second sample data set, wherein the ranking model is used for ranking a plurality of pieces of data obtained by searching according to any search word.
9. The method of claim 8, further comprising:
acquiring a search data set according to a currently input search word, wherein the search data set comprises a plurality of pieces of data;
calling the sequencing model to sequence the plurality of pieces of data to obtain the arrangement sequence of the plurality of pieces of data;
and displaying the plurality of pieces of data according to the arrangement sequence in a search result interface corresponding to the search terms.
10. An apparatus for sample data set acquisition, the apparatus comprising:
the data set acquisition module is used for acquiring a first sample data set corresponding to any search word, wherein the first sample data set comprises a plurality of initial positive sample data and a plurality of initial negative sample data, the initial positive sample data is data of a click behavior occurring in a search result interface corresponding to the search word, and the initial negative sample data is data of no click behavior occurring in the search result interface corresponding to the search word;
the first selection module is used for selecting at least one target negative sample data from the plurality of initial negative sample data according to the position of each initial negative sample data and the initial positive sample data in the same search result interface in the search result interface;
the second selection module is used for selecting at least one target positive sample data from the plurality of initial positive sample data according to the historical click rate of the user identifier corresponding to each initial positive sample data;
and the forming module is used for forming a second sample data set corresponding to any search word by using the selected at least one target negative sample data and the selected at least one target positive sample data, and the second sample data set is used for training the ranking model.
11. An electronic device, comprising one or more processors and one or more memories having stored therein at least one instruction, which is loaded and executed by the one or more processors to carry out the operations performed by the sample data set acquisition method according to any one of claims 1 to 9.
12. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform the operations performed by the sample data set acquisition method of any one of claims 1 to 9.
CN202010616445.8A 2020-06-30 2020-06-30 Sample data set acquisition method, device, equipment and storage medium Pending CN111782950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010616445.8A CN111782950A (en) 2020-06-30 2020-06-30 Sample data set acquisition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010616445.8A CN111782950A (en) 2020-06-30 2020-06-30 Sample data set acquisition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111782950A true CN111782950A (en) 2020-10-16

Family

ID=72761539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010616445.8A Pending CN111782950A (en) 2020-06-30 2020-06-30 Sample data set acquisition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111782950A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN112597208A (en) * 2020-12-29 2021-04-02 深圳价值在线信息科技股份有限公司 Enterprise name retrieval method, enterprise name retrieval device and terminal equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570197A (en) * 2016-11-15 2017-04-19 北京百度网讯科技有限公司 Searching and ordering method and device based on transfer learning
CN107424043A (en) * 2017-06-15 2017-12-01 北京三快在线科技有限公司 A kind of Products Show method and device, electronic equipment
CN107526846A (en) * 2017-09-27 2017-12-29 百度在线网络技术(北京)有限公司 Generation, sort method, device, server and the medium of channel sequencing model
CN109508394A (en) * 2018-10-18 2019-03-22 青岛聚看云科技有限公司 A kind of training method and device of multi-medium file search order models
CN110458602A (en) * 2019-07-09 2019-11-15 北京三快在线科技有限公司 Method of Commodity Recommendation, device, electronic equipment and storage medium
CN110472027A (en) * 2019-07-18 2019-11-19 平安科技(深圳)有限公司 Intension recognizing method, equipment and computer readable storage medium
US20190391983A1 (en) * 2017-06-05 2019-12-26 Ancestry.Com Dna, Llc Customized coordinate ascent for ranking data records
CN111061954A (en) * 2019-12-19 2020-04-24 腾讯音乐娱乐科技(深圳)有限公司 Search result sorting method and device and storage medium
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN106570197A (en) * 2016-11-15 2017-04-19 北京百度网讯科技有限公司 Searching and ordering method and device based on transfer learning
US20190391983A1 (en) * 2017-06-05 2019-12-26 Ancestry.Com Dna, Llc Customized coordinate ascent for ranking data records
CN107424043A (en) * 2017-06-15 2017-12-01 北京三快在线科技有限公司 A kind of Products Show method and device, electronic equipment
CN107526846A (en) * 2017-09-27 2017-12-29 百度在线网络技术(北京)有限公司 Generation, sort method, device, server and the medium of channel sequencing model
CN109508394A (en) * 2018-10-18 2019-03-22 青岛聚看云科技有限公司 A kind of training method and device of multi-medium file search order models
CN110458602A (en) * 2019-07-09 2019-11-15 北京三快在线科技有限公司 Method of Commodity Recommendation, device, electronic equipment and storage medium
CN110472027A (en) * 2019-07-18 2019-11-19 平安科技(深圳)有限公司 Intension recognizing method, equipment and computer readable storage medium
CN111061954A (en) * 2019-12-19 2020-04-24 腾讯音乐娱乐科技(深圳)有限公司 Search result sorting method and device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴飞;庄越挺;: "互联网跨媒体分析与检索:理论与算法", 计算机辅助设计与图形学学报, no. 01, 15 January 2010 (2010-01-15) *
李明琦;: "网页搜索排序模型研究", 智能计算机与应用, no. 02, 1 February 2020 (2020-02-01) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328891A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method for training search model, method for searching target object and device thereof
CN112597208A (en) * 2020-12-29 2021-04-02 深圳价值在线信息科技股份有限公司 Enterprise name retrieval method, enterprise name retrieval device and terminal equipment

Similar Documents

Publication Publication Date Title
CN109740068B (en) Media data recommendation method, device and storage medium
CN110674022B (en) Behavior data acquisition method and device and storage medium
CN108737897B (en) Video playing method, device, equipment and storage medium
CN110278464B (en) Method and device for displaying list
CN110865754B (en) Information display method and device and terminal
CN108717432B (en) Resource query method and device
CN111836069A (en) Virtual gift presenting method, device, terminal, server and storage medium
CN111935516B (en) Audio file playing method, device, terminal, server and storage medium
CN112084811A (en) Identity information determining method and device and storage medium
CN111858382A (en) Application program testing method, device, server, system and storage medium
CN111031391A (en) Video dubbing method, device, server, terminal and storage medium
CN110147503B (en) Information issuing method and device, computer equipment and storage medium
CN109618192B (en) Method, device, system and storage medium for playing video
CN109547847B (en) Method and device for adding video information and computer readable storage medium
CN110890969A (en) Method and device for mass-sending message, electronic equipment and storage medium
CN111782950A (en) Sample data set acquisition method, device, equipment and storage medium
CN112131473B (en) Information recommendation method, device, equipment and storage medium
CN112100528A (en) Method, device, equipment and medium for training search result scoring model
CN111563201A (en) Content pushing method, device, server and storage medium
CN111796990A (en) Resource display method, device, terminal and storage medium
CN111641853B (en) Multimedia resource loading method and device, computer equipment and storage medium
CN114817709A (en) Sorting method, device, equipment and computer readable storage medium
CN113377976A (en) Resource searching method and device, computer equipment and storage medium
CN111258673A (en) Fast application display method and terminal equipment
CN115905374A (en) Application function display method and device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination