CN109508394A

CN109508394A - A kind of training method and device of multi-medium file search order models

Info

Publication number: CN109508394A
Application number: CN201811214519.4A
Authority: CN
Inventors: 赵明; 徐钊; 于松; 袁丽; 王永选; 杨梅
Original assignee: Qingdao Poly Cloud Technology Co Ltd
Current assignee: Qingdao Poly Cloud Technology Co Ltd
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2019-03-22

Abstract

The present invention relates to computer technologies, a kind of method and device of multi-medium file search order models training are disclosed, to improve the sequence accuracy of search result.This method is, after determining keyword and corresponding search result, calculate the sequence labelling function value of the corresponding search result of each keyword, and generate corresponding sample data, wherein, the type of theme correlation degree of the sequence labelling function value and multimedia file being clicked between number and the multimedia file is positively correlated, and filter out the sample data that sequence labelling function value meets preset condition, model training is carried out using preset algorithm, generates corresponding multi-medium file search order models.In this way, the consistent sample data of search intention of keyword and search result can be filtered out, the quality of sample data is improved, and then is capable of the sequence accuracy of the multi-medium file search order models of effective training for promotion acquisition, solves the problems, such as the sorting consistence of search result.

Description

A kind of training method and device of multi-medium file search order models

Technical field

The present invention relates to computing technique more particularly to a kind of training methods and dress of multi-medium file search order models It sets.

Background technique

Under prior art, user carries out multi-medium file search (e.g., film is searched for) on smart television, when, intelligence electricity Depending on the keyword often inputted according to user, search result is returned to using list mode.In order to be provided subject to more for user True search result needs to scan for model training based on existing historical search data, however, using existing history Search data scan for model training and but have the following deficiencies:

1) after smart television receives the keyword that user inputs, all multimedia files comprising keyword can be made It is presented for search result.

However, only the search result comprising keyword usually contains very high noise from the point of view of current feedback result Data.

For example, it is assumed that the keyword of user's input is " superman ", then, after smart television scans for, can will include The film of " superman " two words is used as search result to present, however, user only wants to see the film of superman's series, Other include " superman " two word, and incoherent film is noise data to content, e.g., The Incredibles, however, " superman is total Mobilize " etc. the poor search result of correlations may be because the higher reason of temperature, sort before " superman ", therefore, row Sequence result does not meet the expectation of user.

Obviously, if the search result comprising the higher noise data of temperature is scanned for model as sample data Training, will affect the sequence accuracy of search model.

2) it in practical application, when inputting keyword by remote controler due to user, in order to save the time, often uses The first letter of pinyin of multimedia file is inputted as keyword.

However, being inputted first letter of pinyin as keyword from the point of view of current feedback result, will cause antistop list Meaning inaccuracy can be enabled in search result to a certain extent comprising more noise datas.

For example, it is assumed that the keyword of user's input is " CR ", and in search result, other than " superman ", there are also " grass A series of texts of people (example: taking life scarecrow by force) ", " adult (example: growing to manhood) ", " successor (example: Shaolin successor) " etc., this The actual search intention of a little texts and user are far apart, and therefore, after being scanned for using these noise datas, acquisition is searched It still can include much noise data in hitch fruit.

Similarly, if the search result comprising the higher noise data of temperature is scanned for model as sample data Training, will affect the sequence accuracy of search model.

In view of this, need to design a kind of method for building up of new multi-medium file search order models, it is above-mentioned to overcome Defect.

Summary of the invention

The object of the present invention is to provide the training methods and device of a kind of multi-medium file search order models, to mention The sequence accuracy of high search result.

Specific technical solution provided in an embodiment of the present invention is as follows:

A kind of training method of multi-medium file search order models, comprising:

It determines keyword, and obtains the corresponding search result of each keyword, wherein include in a search result At least one corresponding multimedia file of corresponding keyword；

The sequence labelling function value of the corresponding search result of each keyword is calculated, and is based on each search result Sequence labelling function value generate corresponding sample data；Wherein, the sequence labelling function value of a search result, at least with The type of theme association of the multimedia file that described search result includes being clicked between number and the multimedia file Degree is positively correlated；

The sample data that sequence labelling function value meets preset condition is filtered out, model training is carried out using preset algorithm, Generate corresponding multi-medium file search order models.

Optionally, the sequence labelling function value an of search result is calculated, comprising:

The accumulation for counting each multimedia file that one search result includes is clicked number；

Counting each multimedia file that one search result includes is specifying the section in duration to be clicked number Change rate.

The topic type distribution probability vector of each multimedia file in one search result is counted, and is based on institute Topic type distribution probability vector is stated, the type of theme correlation degree between each multimedia file is calculated；

The accumulation based on each multimedia file is clicked number, the section is clicked number change rate and institute Type of theme correlation degree is stated, the sequence labelling function value of one search result is calculated.

Optionally, the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree before the sequence labelling function value for calculating one search result, further wrap It includes:

The accumulation for including in one search result is clicked multimedia file that number is not zero as positive example, And the accumulation for including in one search result is clicked multimedia file that number is zero as negative example.

For each positive example, the highest M of type of theme correlation degree other multimedia files are determined respectively, it is described Other multimedia files are not included in one search result；

Other multimedia files of selection setting number are right as positive example from other each multimedia files obtained The negative example is replaced, and the ratio of replaced positive example and negative example is enabled to reach setting ratio thresholding.

Optionally, the sample data that sequence labelling function value meets preset condition is filtered out, comprising:

Filter out the sequence highest N number of sample data of labelling function value, wherein N is default natural number；Alternatively,

Filter out the sample data that sequence labelling function value reaches setup parameter thresholding.

Optionally, it filters out before sequence labelling function value meets the sample data of preset condition, further executes following Any one in operation or combination:

Filter out the sample data for meeting preset data scale；

Filter out the sample data that the multimedia file that the corresponding search result of keyword includes reaches given threshold；

The issuing time for filtering out the multimedia file that the corresponding search result of keyword includes reaches setting duration thresholding Sample data；

Deleting keyword is forms data or/and the corresponding sample data of single-letter.

Optionally, model training is carried out using preset algorithm, generates corresponding multi-medium file search order models, wrapped It includes:

Determine that the linked character of each sample data, the linked character of a sample data include at least keyword respectively The correlative character between multimedia file that feature, keyword and corresponding search result include, search result include more The correlative character between multimedia file that the attributive character and search result of media file include；

Sample data is divided into training set and test set；

Based on the training set and corresponding linked character, decision tree GBDT algorithm is promoted using distributed gradient, is carried out The repetitive exercise of Multiple trees model obtains corresponding training pattern；

Based on the test set and corresponding linked character, the training pattern of generation is tested, generation is commented accordingly Valence index；

Evaluation index adjusting training parameter based on acquisition optimizes training pattern, obtains finally more by repeatedly training Media file search order models.

A kind of training device of multi-medium file search order models, comprising:

Acquiring unit, for determining keyword, and the corresponding search result of each keyword of acquisition, wherein one is searched It include at least one corresponding multimedia file of corresponding keyword in hitch fruit；

Processing unit for calculating the sequence labelling function value of the corresponding search result of each keyword, and is based on The sequence labelling function value of each search result generates corresponding sample data；Wherein, the sequence of a search result marks letter Numerical value, the multimedia file at least including with described search result are clicked between number and the multimedia file Type of theme correlation degree is positively correlated；

Training unit meets the sample data of preset condition for filtering out sequence labelling function value, using preset algorithm Model training is carried out, corresponding multi-medium file search order models are generated.

Optionally, when calculating the sequence labelling function value an of search result, the processing unit is used for:

Optionally, the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree, before the sequence labelling function value for calculating one search result, the processing is single Member is further used for:

Optionally, sequence labelling function value is filtered out when meeting the sample data of preset condition, and the training unit is used In:

Optionally, filter out sequence labelling function value meet the sample data of preset condition before, the training unit into One step for performing the following operations in any one or combination:

Filter out the sample data for meeting preset data scale；

Optionally, model training is carried out using preset algorithm, when generating corresponding multi-medium file search order models, The training unit is used for:

Sample data is divided into training set and test set；

A kind of training device of multi-medium file search order models includes at least processor and memory, wherein

Processor executes following process for reading the program in memory:

Optionally, when calculating the sequence labelling function value an of search result, the processor is used for:

Optionally, the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree, before the sequence labelling function value for calculating one search result, the processor It is further used for:

Optionally, sequence labelling function value is filtered out when meeting the sample data of preset condition, and the processor is used for:

Optionally, it filters out before sequence labelling function value meets the sample data of preset condition, the processor is into one Walk any one in for performing the following operations or combination:

Filter out the sample data for meeting preset data scale；

Optionally, model training is carried out using preset algorithm, when generating corresponding multi-medium file search order models, The processor is used for:

Sample data is divided into training set and test set；

A kind of storage medium is stored with the program of the training for realizing multi-medium file search order models, the journey When sequence is run by processor, following steps are executed:

In the embodiment of the present invention, after determining keyword and corresponding search result, calculate that each keyword is corresponding to be searched The sequence labelling function value of hitch fruit, and generate corresponding sample data, wherein the sequence labelling function value and multimedia The type of theme correlation degree of file being clicked between number and the multimedia file is positively correlated, and the row of filtering out Sequence labelling function value meets the sample data of preset condition, carries out model training using preset algorithm, generates corresponding multimedia File search order models.In this way, the mark of sample data can be effectively completed, to filter out keyword and search knot The consistent sample data of the search intention of fruit improves the quality of sample data, and then can effectively improve training acquisition The sequence accuracy of multi-medium file search order models, solves the problems, such as the sorting consistence of search result.

Detailed description of the invention

Fig. 1 is the training flow diagram of multi-medium file search order models in the embodiment of the present invention；

Fig. 2 is the training device illustrative view of functional configuration of multi-medium file search order models in the embodiment of the present invention；

Fig. 3 is the training device entity structure schematic diagram of multi-medium file search order models in the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, is not whole embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

As shown in fig.1, in the embodiment of the present invention, establishing multi-medium file search order models in the embodiment of the present invention Detailed process it is as follows:

Step 100: determining the keyword of setting, and obtain the corresponding search result of each keyword, wherein one It include at least one corresponding multimedia file of corresponding keyword in search result.

Step 110: calculating the sequence labelling function value of the corresponding search result of each keyword, a search result Sequence labelling function value, the multimedia file at least including with described search result is clicked number and more matchmakers File type associations degree between body file is positively correlated.

Specifically exist, by taking a search result x as an example, calculate search result x sequence labelling function value when, include but It is not limited to following operation:

Step A: the accumulation of each multimedia file is clicked number in statistics search result x.

In order to guarantee the reasonability and accuracy of sample data mark, each multimedia text in search result x can be counted The accumulation of part is clicked number, wherein so-called accumulation is clicked number, after referring to multimedia file publication, arrives statistics Total degree is clicked between current time.

For example, it is assumed that after one keyword of input, it is assumed that the keyword set Q={ q of acquisition₁,…,q_m, each pass Keyword qi corresponds to a search result, includes film list in search result1≤i≤m, And the record result of user's one click is characterized asWherein, the shadow being clicked Piece is equal to 1, other films are equal to 0, specifically as shown in table 1, then, and in the search result obtained based on keyword qi, each film (i.e. multimedia file) from the accumulation after publication by points number be characterized asWherein, tire out Product is clicked that number is more, and keyword qi is relative to search resultBetween search intention it is stronger, i.e., correlation degree is got over By force.

Table 1 (qi):

Step B: section of each multimedia file that statistics search result x includes in specified duration is clicked number Change rate.

In practical application, the accumulation of multimedia file, which is clicked number, can not reflect the search result of nearest a period of time Variation tendency it is more to measure number change rate therefore can be clicked according to section of the multimedia file in specified duration Click situation of change of the media file in current period, for example, being clicked number based on section for new online film Change rate can quickly determine ordering relation of the film in search result, avoid accumulating quilt caused by show time is too short Number of clicks is insufficient and leads to film sequence rearward.

Optionally, section is clicked number change rate and can be indicated using following formula, wherein when Δ t indicates specified Long, T indicates that section is clicked the start time of number change rate: as follows:

Step C: the topic type distribution probability vector for each multimedia file that statistics search result x includes, and be based on The topic type distribution probability vector calculates the type of theme correlation degree between each multimedia file.

Optionally, (Latent Dirichlet Allocation, LDA) theme can be distributed using implicit Di Li Cray Clustering algorithm calculates the theme distribution probability vector of each multimedia file

Wherein, Topic characterizes theme, P (Topic1) characterization master The probability of topic 1, S characterize theme number.The distribution of theme distribution probability vector reflects different multimedia files in different themes On preference, can be used to measure the type of theme correlation degree of two multimedia files.Therefore, for search result The multimedia file that it is 0 that middle accumulation, which is clicked number not, can be based on the multimedia text that above-mentioned theme distribution probability vector characterizes Part is clicked the ranking results of number from high to low, calculates separately the more matchmakers of each multimedia file and sequence before it The cosine similarity of body file, so that the type of theme correlation degree between multimedia file two-by-two is obtained, specifically, calculating Formula is as follows:

Step D: the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree calculate the sequence labelling function value of search result x.

Specifically, the sequence labelling function value of search result x can be calculated using the following equation:

Wherein, α and β is scale factor, it is ensured that each component is same magnitude, in addition, for steady long tail effect shape At partial velocities, can also to the functional value of above-mentioned sequence labelling function value carry out logarithm floor operation with optimize sample mark, Then the output result of final sample data can indicate are as follows:

Above-mentioned expression formula characterizes the corresponding search result obtained after input keyword qi, i.e. keyword qi+ search knot Fruit and corresponding sequence labelling function value characterize one and mark the sample data finished.

Using aforesaid way, the mark of sample data can be carried out respectively for each keyword, to obtain carry out mould Sample data after all marks of type training.

On the other hand, in the embodiment of the present invention, after issuing a period of time, accumulate by points number be still 0 more matchmakers Body file, it is believed that it is completely uncorrelated between such multimedia file and corresponding keyword and the search intention of user, Also the value of multimedia file sequence can not be embodied, it therefore, can be using such multimedia file as including in search result Negative example, using the accumulation multimedia file that be clicked number not be 0 as the positive example for including in search result.Since search is clicked Long tail effect, it is generally the case that even if accumulation long period (i.e. multimedia file has issued the long period), search result In include negative number of cases amount also can much larger than positive example therefore, can be based on the theme of multimedia file in the embodiment of the present invention Correlation is replaced the negative example in part in search result, flat so as to adjust the ratio of positive example and negative example in search result Weighing apparatus, and then further increase the quality of sample data；Because introducing new positive example also will increase the multiplicity that search result is recalled Property, Sorting space is expanded, the performance of Ranking Algorithm is effectively improved, to further increase the multimedia text of subsequent training The sequence accuracy of part searching order model.

Optionally, still by taking the corresponding search result x of keyword qi as an example, in the sequence labelling function for calculating search result x Before value, performed replacement step is as follows:

One, the accumulation for including in search result x is clicked multimedia file that number is not zero as positive example, and The accumulation for including in search result x is clicked multimedia file that number is zero as negative example.

Two, it is directed to each positive example, determines the highest M of type of theme correlation degree other multimedia files, institute respectively It states other multimedia files and described search result x and the keyword qi is not corresponding, M is default natural number.

Optionally, LDA Subject Clustering algorithm still can be used, by calculating theme distribution probability vector, is obtained and one Other strongest collection of multimedia documents of the theme relevance of a positive example Wherein, r_lOther multimedia files are characterized,Characterize the theme distribution probability vector cosine similarity of the two.

For example, it is assumed that keyword qi is " superman ", then the positive example obtained is " superman 1 " etc., and the negative example obtained is " variable body superman ", " prehistory superman " etc., it is assumed that the ratio of positive and negative example is 1:3.So, it is associated with the type of theme of " superman 1 " Other a multimedia files of degree highest 2 (assuming that M=2) are " just person alliance " and " body of steel ".

In the present embodiment, only by taking a positive example as an example, in practical application, for each positive example, theme will be selected Highest M other multimedia files of type association degree, such as, it is assumed that there are 5 positive examples, then needing to obtain sum is 5M's Other multimedia files, and therefrom choose the positive side for replacing negative example.

Three, other multimedia files of setting number are chosen from other each multimedia files obtained as positive example The negative example is replaced, the ratio of replaced positive example and negative example is enabled to reach setting ratio thresholding.

For example, can will be with other a multimedias of type of theme correlation degree highest 2 (assuming that M=2) of " superman 1 " File " just person alliance " and " body of steel " are used as positive example, and replacing negative example is " variable body superman ", and " prehistory superman " enables most The ratio of whole positive example and negative example, which reaches, is approximately equal to 1:1 or 2:1.

Of course, it is possible to pre-generate when search engine is established and indexed and save several theme relative sets In this way, whichsoever multimedia file, can be at any time from corresponding main as positive side after some cycles are searched in accumulation Inscribe relative setOther multimedia files needed for middle acquisition are replaced negative side as positive example.

After being replaced to the negative example in part, the positive example being newly added is possible to be clicked number and area there are no accumulation Between be clicked number change rate, therefore, initial value can be labeled based on the cosine similarity of theme distribution probability vector and Sequence after accumulating a period of time, then calculates that corresponding accumulation is clicked number, section is clicked number change rate and theme class Type correlation degree, and final and original positive example is together, calculates the sequence labelling function value of search result x, and be converted into marking Sample data after note.

Further, after accumulating the fixed period, if the accumulation for the positive example being newly added remains as 0 by points number, As negative example, it is replaced, is will not be described in great detail again using same way.

Step 120: filtering out the sample data that sequence labelling function value meets preset condition, carried out using preset algorithm Model training generates corresponding multi-medium file search order models.

Specifically, can filter out good sample data as far as possible after the sample data for obtaining the magnanimity that mark finishes and use In subsequent model training, optionally, above-mentioned preset condition, which may is that, filters out the sequence highest N number of sample of labelling function value Data (N is setting natural number), above-mentioned preset condition is also possible to: filtering out sequence labelling function value and reaches setup parameter door The sample data of limit.

This is because sequence labelling function value is higher, illustrate the search intention between keyword and corresponding search result Correspondence is more clear, and therefore, the higher sample data of sequence labelling function value is more high-quality, is more advantageous to and improves subsequent training The sequence accuracy of multi-medium file search order models.

Further, in addition to carrying out sample data screening according to sequence labelling function value, in the embodiment of the present invention, in base It, can also be one of in the following ways or any before sequence labelling function value obtains final good sample data Combination carries out the prescreening of sample data:

Under big data environment, the generalization ability of model training can be enhanced based on the sample data set of magnanimity, because The quantity of this sample data is the bigger the better, and the longer the better for the duration of accumulation, it is therefore possible to use mode including but not limited to:

Mode 1: the sample data for meeting preset data scale is filtered out.

Such as: the total quantity for the sample data for needing to filter out >=1,000,000.

Another example is: needing to filter out data volume >=200,000 of the sample data as test sample set.

Mode 2: the sample number that the multimedia file that the corresponding search result of keyword includes reaches given threshold is filtered out According to.

Such as: need to filter out the sample data for multimedia file >=60 that search result includes.

Mode 3: when filtering out the issuing time of the multimedia file that the corresponding search result of keyword includes and reaching setting The sample data of long thresholding.

Such as: need to filter out the sample data of publication duration >=15 day for the multimedia file that search result includes.

Mode 4: deleting keyword is forms data or/and the corresponding sample data of single-letter.

In search log, it often will appear the search of single digital, single-letter, search purpose is indefinite, but tired It is long-pending to be clicked that number is very high, and the search result hit rate returned is also very big, this is typical bad sample data, is needed It removes, to guarantee the search intention consistency between keyword and search result.

It is further, excellent required for filtering out in the embodiment of the present invention based on above-mentioned sample data screening strategy After the sample data of matter, model training can be carried out using preset algorithm, generate corresponding multi-medium file search sequence mould Type specifically includes:

1) determine that the linked character of each sample data, the linked character of a sample data include at least key respectively The correlative character between multimedia file that word feature, keyword and corresponding search result include, what search result included The correlative character between multimedia file that the attributive character and search result of multimedia file include.

Above-mentioned each category feature can obtain corresponding search knot by keyword using the method calculated in search It extracts, can also be extracted during carrying out positive and negative example replacement when fruit, during sequence labelling function value can also be calculated It extracts, can also be extracted after calculating sequence labelling function value, in the journal file that can be saved after extraction, carrying out model When training, it can obtain and use from journal file, will not be described in great detail.

Specifically, the linked character of sample data can be used to the form table of feature vector refering to shown in table 2 Show, be denoted as: f=(f_q,f_q-d,f_d,f_d-d)；

Wherein, refering to shown in table 2, f_q, keyword feature is indicated, at least by the length of keyword and keyword in history The number composition occurred in search.

f_q-d, the correlative character between multimedia file that expression keyword and corresponding search result include, at least By accounting, the offset of keyword and multimedia file title, keyword and more matchmakers of the keyword in multimedia file title The term frequency-inverse document word frequency (TF-IDF) of the text similarity (BM25) of body file, keyword and multimedia file forms.

f_d, indicate the attributive character of multimedia file, at least illustrate multimedia file in the measurement of multiple dimensions.

f_d-d, the correlative character between multimedia file is indicated, by each multimedia text at least in expression search result Theme distribution probability similarity after part is clicked number sequence according to accumulation, between adjacent multimedia file.

Table 2

2) sample data is converted into specified format.

Optionally, sample data can be converted to meet Spark MLlib (Apache Spark be aim at it is extensive Data processing and the computing engines of Universal-purpose quick designed, MLlib is library machine learning (ML) of Spark) sequence learn class Type LabelPoint (a kind of basic data type of MLlib) format.

3) sample data after conversion is divided into training set and test set.

4) it is based on training set and corresponding linked character, decision tree (Gradient is promoted using distributed gradient Boosting Decision Tree, GBDT) algorithm, the repetitive exercise of Multiple trees model is carried out, training pattern is obtained.

Specifically, can construct on the basis of the distributed GBDT algorithm of Spark MLLib based on Lambda algorithm It is polynary to add regression tree (Lambda and Multiple Additive Regression Tree, LambdaMART) progress more The repetitive exercise of decision-tree model.

Wherein, optionally, algorithm iteration number: 300~500 generations, the depth of MART tree: 3 layers, Learning Step: 0.05, Loss function: L2loss function:Wherein, i is true value, and F (xi) is predicted value, and N is sample Data number, Loss value is smaller, then it is better to characterize the training pattern.

5) it is based on test set and corresponding linked character, the training pattern of generation is tested, generates corresponding evaluation Index.

It optionally, can be using degree of fitting and NDGG as evaluation index, wherein

Degree of fitting: LambdaMART algorithm is evaluated using root-mean-square error (Root Mean Squared Error, RMSE) Degree of fitting on test set；

Normalization is lost storage gain (Normalized Discounted Cumulative Gain, NDCG): using The sequence effect of NDCG evaluation training pattern.

6) the evaluation index adjusting training parameter based on acquisition optimizes training pattern, and repeatedly training obtains optimal more matchmakers Body file search order models.

Optionally, can use Spark MLlib bottom tree construction (DecisionTreeRegressionMode) will most Excellent multi-medium file search order models are converted to XML format, facilitate ElasticSearch (searching based on Lucene Rope server provides the full-text search engine of a distributed multi-user ability) search engine load.

The optimal multi-medium file search order models of acquisition are stored in hadoop distribution in the form of an xml-file In formula file system (Hadoop Distributed File System, HDFS), then, ElasticSearch phase is being updated When closing index, optimal multi-medium file search order models are written to specified index field, finally, in search statement Middle addition Ranking Algorithm (Learning to Rank, LTR) model calls, in this way, can be using optimal multimedia text Part searching order model scans for the crux word being newly entered, and obtains corresponding search result, in described search result Include current most accurate multimedia file, and presents current most accurate ranking results between multimedia file.

Certainly, above-mentioned multi-medium file search order models (i.e. LTR model) need to update according to the setting period: such as, pressing Sample data is reselected and replaced according to the setting period, and re-starts model training, to ensure that multi-medium file search sorts The timeliness of model.

Obviously, the linked character multi-medium file search order models obtained based on sample data, can be accurately Guarantee the search intention consistency between the search result of keyword, effectively increases the accuracy of search result, and improve The sequence for the multimedia file that search result includes is accurate, i.e. effective use Ranking Algorithm optimization multimedia file is searched The ranking results of rope.

Based on the above embodiment, as shown in fig.2, in the embodiment of the present invention, a kind of multi-medium file search row is provided The training device of sequence model, the training device include at least:

Acquiring unit 20, for determining keyword, and the corresponding search result of each keyword of acquisition, wherein one It include at least one corresponding multimedia file of corresponding keyword in search result；

Processing unit 21, for calculating the sequence labelling function value of the corresponding search result of each keyword, Yi Jiji Corresponding sample data is generated in the sequence labelling function value of each search result；Wherein, the sequence mark of a search result Functional value, the multimedia file at least including with described search result are clicked between number and the multimedia file Type of theme correlation degree be positively correlated；

Training unit 22 meets the sample data of preset condition for filtering out sequence labelling function value, using pre- imputation Method carries out model training, generates corresponding multi-medium file search order models.

Optionally, when calculating the sequence labelling function value an of search result, processing unit 21 is used for:

Optionally, the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree, before the sequence labelling function value for calculating one search result, processing unit 21 It is further used for:

Optionally, sequence labelling function value is filtered out when meeting the sample data of preset condition, and training unit 22 is used for:

Optionally, filter out sequence labelling function value meet the sample data of preset condition before, training unit 22 into one Walk any one in for performing the following operations or combination:

Filter out the sample data for meeting preset data scale；

Optionally, model training is carried out using preset algorithm, when generating corresponding multi-medium file search order models, Training unit 22 is used for:

Sample data is divided into training set and test set；

Based on the above embodiment, as shown in fig.3, in the embodiment of the present invention, a kind of multi-medium file search row is provided The training device of sequence model, the training device include at least:

Processor 300 executes following process for reading the program in memory 310:

Wherein, in Fig. 3, bus architecture may include the bus and bridge of any number of interconnection, specifically by processor The various circuits for the memory that 300 one or more processors represented and memory 310 represent link together.Total coil holder Structure can also link together various other circuits of such as peripheral equipment, voltage-stablizer and management circuit or the like, this It is all a bit it is known in the art, therefore, it will not be further described herein.Bus interface provides interface.Transceiver It can be multiple element, that is, include transmitter and receiver, provide for over a transmission medium being communicated with various other devices Unit.For different user equipmenies, user interface, which can also be, external the interface for needing equipment is inscribed, and connection is set Standby including but not limited to keypad, display, loudspeaker, microphone, control stick etc..

Processor 300, which is responsible for management bus architecture and common processing, memory 310, can store processor 300 and is holding Used data when row operation.

Optionally, when calculating the sequence labelling function value an of search result, processor 300 is used for:

Optionally, the accumulation based on each multimedia file is clicked number, the section is clicked number variation Rate and the type of theme correlation degree, before the sequence labelling function value for calculating one search result, processor 300 It is further used for:

Optionally, sequence labelling function value is filtered out when meeting the sample data of preset condition, and processor 300 is used for:

Optionally, filter out sequence labelling function value meet the sample data of preset condition before, processor 300 into one Walk any one in for performing the following operations or combination:

Filter out the sample data for meeting preset data scale；

Optionally, model training is carried out using preset algorithm, when generating corresponding multi-medium file search order models, Processor 300 is used for:

Sample data is divided into training set and test set；

Based on the same inventive concept, a kind of storage medium is provided, is stored with for realizing multi-medium file search sequence mould The program of the training of type when described program is run by processor, executes following steps:

Based on the above embodiment, it in the embodiment of the present invention, after determining keyword and corresponding search result, calculates each The sequence labelling function value of the corresponding search result of a keyword, and generate corresponding sample data, wherein the sequence mark Infuse the type of theme correlation degree positive of functional value and multimedia file being clicked between number and the multimedia file It closes, and filters out the sample data that sequence labelling function value meets preset condition, model training is carried out using preset algorithm, Generate corresponding multi-medium file search order models.In this way, the mark of sample data can be effectively completed, to screen The consistent sample data of the search intention of keyword and search result out improves the quality of sample data, and then can be effective The sequence accuracy of the multi-medium file search order models of training acquisition is improved, solves the sorting consistence of search result Problem.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer journey Sequence product.Therefore, complete hardware embodiment, complete software embodiment or combining software and hardware aspects can be used in the present invention The form of embodiment.Moreover, it wherein includes the calculating of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in machine usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions each in flowchart and/or the block diagram The combination of process and/or box in process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that instruction stored in the computer readable memory generation includes The manufacture of command device, the command device are realized in one box of one or more flows of the flowchart and/or block diagram Or the function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that Series of operation steps are executed on computer or other programmable devices to generate computer implemented processing, thus calculating The instruction executed on machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side The step of function of being specified in block diagram one box or multiple boxes.

Although preferred embodiments of the present invention have been described, once a person skilled in the art knows basic wounds The property made concept, then additional changes and modifications may be made to these embodiments.It is wrapped so the following claims are intended to be interpreted as It includes preferred embodiment and falls into all change and modification of the scope of the invention.

Obviously, those skilled in the art can carry out various modification and variations without departing from this to the embodiment of the present invention The spirit and scope of inventive embodiments.If being wanted in this way, these modifications and variations of the embodiment of the present invention belong to right of the present invention Ask and its equivalent technologies within the scope of, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of training method of multi-medium file search order models characterized by comprising

It determines keyword, and obtains the corresponding search result of each keyword, wherein comprising corresponding in a search result At least one corresponding multimedia file of keyword；

Calculate the sequence labelling function value of the corresponding search result of each keyword, and the sequence based on each search result Labelling function value generates corresponding sample data；Wherein, the sequence labelling function value of a search result, at least with described search As a result the type of theme correlation degree positive of the multimedia file for including being clicked between number and the multimedia file It closes；

The sample data that sequence labelling function value meets preset condition is filtered out, model training is carried out using preset algorithm, is generated Corresponding multi-medium file search order models.

2. the method as described in claim 1, which is characterized in that calculate the sequence labelling function value of a search result, comprising:

Counting each multimedia file that one search result includes is specifying the section in duration to be clicked number variation Rate；

The topic type distribution probability vector of each multimedia file in one search result is counted, and is based on the theme Type distribution probability vector calculates the type of theme correlation degree between each multimedia file；

The accumulation based on each multimedia file is clicked number, the section is clicked number change rate and the theme Type association degree calculates the sequence labelling function value of one search result.

3. method according to claim 2, which is characterized in that the accumulation based on each multimedia file is clicked secondary Several, the described section is clicked number change rate and the type of theme correlation degree, calculates the sequence of one search result Before labelling function value, further comprise:

The accumulation for including in one search result is clicked multimedia file that number is not zero as positive example, and will The accumulation for including in one search result is clicked multimedia file that number is zero as negative example；

For each positive example, the highest M of type of theme correlation degree other multimedia files are determined respectively, described other are more Media file is not included in one search result；

Other multimedia files of setting number are chosen from other each multimedia files obtained as positive example, to described negative Example is replaced, and the ratio of replaced positive example and negative example is enabled to reach setting ratio thresholding.

4. the method as described in claim 1, which is characterized in that filter out the sample that sequence labelling function value meets preset condition Data, comprising:

5. method as claimed in claim 4, which is characterized in that filter out the sample that sequence labelling function value meets preset condition Before data, any one in following operation or combination are further executed:

Filter out the sample data for meeting preset data scale；

The issuing time for filtering out the multimedia file that the corresponding search result of keyword includes reaches the sample of setting duration thresholding Notebook data；

6. the method according to claim 1 to 5, which is characterized in that carry out model training using preset algorithm, generate Corresponding multi-medium file search order models, comprising:

Determine the linked character of each sample data respectively, the linked character of a sample data include at least keyword feature, The correlative character between multimedia file that keyword and corresponding search result include, the multimedia text that search result includes The correlative character between multimedia file that the attributive character and search result of part include；

Sample data is divided into training set and test set；

Based on the training set and corresponding linked character, decision tree GBDT algorithm is promoted using distributed gradient, carries out determine more The repetitive exercise of plan tree-model obtains corresponding training pattern；

Based on the test set and corresponding linked character, the training pattern of generation is tested, corresponding evaluation is generated and refers to Mark；

Evaluation index adjusting training parameter based on acquisition optimizes training pattern, obtains final multimedia by repeatedly training File search order models.

7. a kind of training device of multi-medium file search order models characterized by comprising

Acquiring unit, for determining keyword, and the corresponding search result of each keyword of acquisition, wherein a search knot It include at least one corresponding multimedia file of corresponding keyword in fruit；

Processing unit, for calculating the sequence labelling function value of the corresponding search result of each keyword, and based on each The sequence labelling function value of search result generates corresponding sample data；Wherein, the sequence labelling function value of a search result, The theme class of the multimedia file at least including with described search result being clicked between number and the multimedia file Type correlation degree is positively correlated；

Training unit is met the sample data of preset condition for filtering out sequence labelling function value, is carried out using preset algorithm Model training generates corresponding multi-medium file search order models.

8. a kind of training device of multi-medium file search order models, which is characterized in that processor and memory are included at least, Wherein,

Processor executes following process for reading the program in memory:

9. a kind of storage medium, which is characterized in that be stored with the journey of the training for realizing multi-medium file search order models Sequence when described program is run by processor, executes following steps: