CN105608071A - Generation method and device for determining machine learning algorithm of head word - Google Patents

Generation method and device for determining machine learning algorithm of head word Download PDF

Info

Publication number
CN105608071A
CN105608071A CN201510965343.6A CN201510965343A CN105608071A CN 105608071 A CN105608071 A CN 105608071A CN 201510965343 A CN201510965343 A CN 201510965343A CN 105608071 A CN105608071 A CN 105608071A
Authority
CN
China
Prior art keywords
historical search
machine learning
learning algorithm
word
centre word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510965343.6A
Other languages
Chinese (zh)
Inventor
刘鎏
伍兆盖
肖峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510965343.6A priority Critical patent/CN105608071A/en
Publication of CN105608071A publication Critical patent/CN105608071A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a generation method and device for determining a machine learning algorithm of a head word. The generation method comprises the following steps: obtaining a plurality of historical search keywords corresponding to the same uniform resource locator, generating a search keyword set corresponding to the same uniform resource locator, processing the plurality of historical search keywords in the search keyword set, and carrying out model training on a processing result to generate the machine learning algorithm. The generation method and device can generate the machine learning algorithm used for extracting the head word, provides technical guarantee for accurately extracting the head word of the search keyword of a huge order of magnitudes subsequently on the basis of the machine learning algorithm through a standard and objective way, and furthermore, a purpose that manpower and time cost for extracting the head word are saved can be realized.

Description

Be used for generation method and the device of the machine learning algorithm of determining centre word
Technical field
The present invention relates to field of computer technology, particularly, the present invention relates to a kind of generation method and device of the machine learning algorithm for definite centre word.
Background technology
Along with the fast development of network, internet can provide solution or relevant solution to existing all everyday problem almost, for people's live and work provides a great convenience. User is in the time using network search engines to carry out relevant inquiring, existing network search engines can provide the search suggestion relevant to searched key word to user according to the searched key word of user's input, user can select the keyword more mating with its query intention to carry out search inquiry from search suggestion, the existing method of obtaining the search suggestion relevant to searched key is to extract the centre word of searched key word by the mode of artificial pre-mark, and recommends the interested searched key word of its most probable according to the centre word extracting to user. But, the centre word extracting mode of artificial pre-mark is only applicable to the searched key word of minute quantity, along with the common use of search engine, corresponding searched key word increases sharply thereupon, and the searched key word that cannot meet the huge order of magnitude by the centre word extracting mode of artificial pre-mark extracts demand. On the one hand, the mode that artificial mark extracts centre word cannot realize to centre word the extraction of automation, meanwhile, need more manpower and time just can complete and extract corresponding centre word, and extraction efficiency is too low; On the other hand, due to everyone the subjective assessment difference to same centre word, therefore also different to the mark of same centre word, may cause the centre word of extraction and actual user's true search target deviation larger.
Therefore, need to realize a kind of method of extracting the centre word of searched key word for automation, realize the object of the centre word in the searched key word of extraction enormous amount of efficiently and accurately.
Summary of the invention
For overcoming above-mentioned technical problem or solving the problems of the technologies described above at least in part, the following technical scheme of special proposition:
Embodiments of the invention have proposed a kind of generation method of the machine learning algorithm for definite centre word, comprising:
Obtain multiple historical search keywords of corresponding same URL, and generate the searched key set of words corresponding to described same URL;
Multiple historical search keywords in described searched key set of words are processed, and result is carried out to model training generate described machine learning algorithm.
Preferably, obtain multiple historical search keywords of corresponding same URL, specifically comprise:
Obtain for multiple users' historical search and click record;
Extract the corresponding relation that historical search keyword and search result items and search result items and URL in record are clicked in described historical search;
Obtain multiple historical search keywords of corresponding same URL according to described corresponding relation.
Preferably, extract described historical search and click the corresponding relation of historical search keyword and search result items and search result items and URL in record, specifically comprise:
Extract described historical search and click the historical search keyword that in record, multiple users input respectively, and corresponding relation between the search result items clicked of the historical search keyword inputted separately based on it of multiple user; And
Extract the corresponding relation of described search result items and corresponding URL.
Preferably, the multiple historical search keywords in described searched key set of words are processed, and result is carried out to model training generate described machine learning algorithm, specifically comprise:
Multiple historical search keywords in described searched key set of words are carried out to word segmentation processing, to obtain the centre word training set that comprises multiple participle fragments;
Carry out model training based on described centre word training set and generate described machine learning algorithm.
Preferably, the historical search keyword in described searched key set of words is carried out to word segmentation processing, to obtain the centre word training set that comprises multiple participle fragments, comprising:
Historical search keyword in described searched key set of words is carried out respectively to word segmentation processing, to obtain multiple participle fragments;
Described multiple participle fragments are carried out to Screening Treatment, and the selection result is defined as to centre word training set.
Preferably, carry out model training based on described centre word training set and generate described machine learning algorithm, comprising:
Described each participle fragment is represented by the mode of vectorial dimension;
Extract the characteristic attribute of each participle fragment in described centre word training set;
Based on described characteristic attribute, the described participle fragment representing is carried out to classification based training generate described machine learning algorithm in the mode of vectorial dimension.
Wherein, described characteristic attribute comprises following at least any one:
Part of speech relevant information;
Relation information with corresponding historical search keyword;
TF-IDF;
Special word relevant information;
Entity word relevant information.
Another embodiment of the present invention has proposed a kind of generating apparatus of the machine learning algorithm for definite centre word, comprising:
Obtain generation module, for obtaining multiple historical search keywords of corresponding same URL, and generate the searched key set of words corresponding to described same URL;
Processing module, processes for the multiple historical search keywords to described searched key set of words, and result is carried out to model training generates described machine learning algorithm.
Preferably, described in, obtaining generation module specifically comprises:
The first acquiring unit, clicks record for obtaining for multiple users' historical search;
Extraction unit, clicks for extracting described historical search the corresponding relation that records historical search keyword and search result items and search result items and URL;
Second acquisition unit, for obtaining multiple historical search keywords of corresponding same URL according to described corresponding relation.
Preferably, described extraction unit specifically for
Extract described historical search and click the historical search keyword that in record, multiple users input respectively, and corresponding relation between the search result items clicked of the historical search keyword inputted separately based on it of multiple user; And
Extract the corresponding relation of described search result items and corresponding URL.
Preferably, described processing module specifically comprises:
Processing unit, carries out word segmentation processing for the multiple historical search keywords to described searched key set of words, to obtain the centre word training set that comprises multiple participle fragments;
Generation unit, generates described machine learning algorithm for carrying out model training based on described centre word training set.
Preferably, described processing unit comprises:
Process subelement, carry out respectively word segmentation processing for the historical search keyword to described searched key set of words, to obtain multiple participle fragments;
Screening subelement, for described multiple participle fragments are carried out to Screening Treatment, and is defined as centre word training set by the selection result.
Preferably, described generation unit comprises:
Represent subelement, for described each participle fragment is represented by the mode of vectorial dimension;
Extract subelement, for extracting the characteristic attribute of described each participle fragment of centre word training set;
Generate subelement, for based on described characteristic attribute, the described participle fragment representing is carried out to classification based training generate described machine learning algorithm in the mode of vectorial dimension.
Wherein, described characteristic attribute comprises following at least any one:
Part of speech relevant information;
Relation information with corresponding historical search keyword;
TF-IDF;
Special word relevant information;
Entity word relevant information.
In embodiments of the invention, a kind of generation scheme of the machine learning algorithm for definite centre word has been proposed, extract the centre word corresponding with searched key word by machine learning algorithm, for the follow-up centre word that extracts exactly the searched key word of enormous quantity level by standardization and objective mode based on machine learning algorithm provides technique guarantee, can realize saving and extract the manpower of centre word and the object of time cost; Simultaneously, the machine learning algorithm generating by the present invention carries out classification based training in centre word leaching process, make the centre word extracting more meet user's true search intention, avoid because people causes the centre word and the actual user's that extract the larger situation of true search intention deviation for the various criterion of subjective assessment centre word, further for the centre word of realizing efficiently and accurately extraction enormous quantity level searched key word provides strong technology precondition and guarantee.
The aspect that the present invention is additional and advantage in the following description part provide, and these will become obviously from the following description, or recognize by practice of the present invention.
Brief description of the drawings
The present invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments obviously and easily and understand, wherein:
Fig. 1 is the flow chart of the generation method of the machine learning algorithm for definite centre word of an embodiment in the present invention;
Fig. 2 is the flow chart of the generation method of the machine learning algorithm for definite centre word of a preferred embodiment in the present invention;
Fig. 3 is the exemplary plot that is created on a concrete application scenarios of the machine learning algorithm for definite centre word of an embodiment in the present invention;
Fig. 4 is the structural representation of the generating apparatus of the machine learning algorithm for definite centre word of another embodiment in the present invention;
Fig. 5 is the structural representation of the generating apparatus of the machine learning algorithm for definite centre word of another preferred embodiment in the present invention.
Detailed description of the invention
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish. Be exemplary below by the embodiment being described with reference to the drawings, only for explaining the present invention, and can not be interpreted as limitation of the present invention.
Unless those skilled in the art of the present technique are appreciated that specially statement, singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form. Should be further understood that, the wording using in description of the present invention " comprises " and refers to and have described feature, integer, step, operation, element and/or assembly, exists or adds one or more other features, integer, step, operation, element, assembly and/or their group but do not get rid of. Should be appreciated that, when we claim element to be " connected " or " coupling " when another element, it can be directly connected or coupled to other elements, or also can have intermediary element. In addition, " connection " used herein or " coupling " can comprise wireless connections or wireless coupling. Wording "and/or" used herein comprises whole or arbitrary unit of listing item and all combinations that one or more is associated.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (comprising technical term and scientific terminology), have with the present invention under the identical meaning of the general understanding of those of ordinary skill in field. It should also be understood that, such as those terms that define in general dictionary, should be understood to have the meaning consistent with meaning in the context of prior art, unless and as here by specific definitions, otherwise can not explain by idealized or too formal implication.
Fig. 1 is the flow chart of the generation method of the machine learning algorithm for definite centre word of an embodiment in the present invention.
Step S110: obtain multiple historical search keywords of corresponding same URL, and generate the searched key set of words corresponding to same URL; Step S120: the multiple historical search keywords in searched key set of words are processed, and result is carried out to model training generate machine learning algorithm.
In embodiments of the invention, go out a kind of generation scheme of the machine learning algorithm for definite centre word, extract the centre word corresponding with searched key word by machine learning algorithm, for the follow-up centre word that extracts exactly the searched key word of enormous quantity level by standardization and objective mode based on machine learning algorithm provides technique guarantee, can realize saving and extract the manpower of centre word and the object of time cost; Simultaneously, the machine learning algorithm generating by the present invention carries out classification based training in centre word leaching process, make the centre word extracting more meet user's true search intention, avoid because people causes the centre word and the actual user's that extract the larger situation of true search intention deviation for the various criterion of subjective assessment centre word, further for the centre word of having realized efficiently and accurately extraction enormous quantity level searched key word provides strong technology precondition and guarantee
Step S110: obtain multiple historical search keywords of corresponding same URL, and generate the searched key set of words corresponding to same URL.
For example, obtain corresponding same URL if multiple historical search keywords of http://waimai.baidu.com are as " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ", and generate the searched key set of words corresponding to http://waimai.baidu.com, comprise " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ".
Step S120: the multiple historical search keywords in searched key set of words are processed, and result is carried out to model training generate machine learning algorithm.
For example, each user searches for for each historical search keyword respectively by search engine, obtain corresponding multiple search result items for each historical search keyword, each user chooses arbitrary search result items from multiple search result items, by choosing historical search keyword corresponding to same search result items to carry out polymerization, and pass through wherein PV (PageView of searched key word automatic marking after polymerization, page browsing amount) centre word of the highest searched key word is as the training set of off-line training, represent the each centre word in training set with multi-C vector subsequently, and generate machine learning algorithm by off-line training, when the server of search engine obtains after the searched key word " take-away is made a reservation " of user's input online, the machine learning algorithm obtaining by off-line training extracts the centre word of " take-away is made a reservation " as " take-away " and " making a reservation ".
As shown in Figure 2, obtain multiple historical search keywords of corresponding same URL, and generation specifically comprises step S211, step S212 and step S213 corresponding to the step of the searched key set of words of same URL. Step S211: obtain for multiple users' historical search and click record; Step S212: the corresponding relation of historical search keyword and search result items and search result items and URL in extraction historical search click record; Step S213: the multiple historical search keywords that obtain corresponding same URL according to corresponding relation.
Wherein, in extraction historical search click record, the step of the corresponding relation of historical search keyword and search result items and search result items and URL specifically comprises step S2121 (not shown) and step S2122 (not shown). Step S1421: extract historical search and click the historical search keyword inputted respectively of multiple users in record, and corresponding relation between the search result items clicked of the historical search keyword inputted separately based on it of multiple user; Step S2122: the corresponding relation that extracts search result items and corresponding URL.
For example, obtain for multiple users' historical search and click record, the historical search keyword that multiple users input respectively from historical search click record extraction historical search click record is as " take-away ", " take-away is made a reservation ", and extract the search result items that historical search keyword that multiple users input separately based on it is clicked, and corresponding relation between historical search keyword and search result items, as the historical search keyword " take-away is made a reservation " based on user's input is searched in search engine, can obtain the search result items that search result items comprises sensing " official website is taken out by Baidu ", point to the search result items of " official website is taken out by U.S. group " etc., user clicks and chooses search result items " official website is taken out by Baidu " from multiple search result items, can obtain historical search keyword " take-away make a reservation " search result items chosen of user based on input for " official website is taken out by Baidu ", extract historical search keyword " take-away is made a reservation " and the corresponding relation of search result items for " official website is taken out by Baidu ", subsequently, extract the URL (UniformResourceLocator corresponding with search result items " official website is taken out by Baidu ", URL) http://waimai.baidu.com, and then extract historical search keyword " take-away is made a reservation " and URL:http: the corresponding relation of //waimai.baidu.com, obtain corresponding same URL according to the multiple historical search keywords that extract with the corresponding relation of multiple URL subsequently, as multiple historical search keywords of http://waimai.baidu.com, as get and URL:http: the historical search keyword that //waimai.baidu.com is corresponding comprises " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " etc., generate subsequently the searched key set of words corresponding to same http://waimai.baidu.com, comprising " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ".
Wherein, step S120 specifically comprises step S121 (not shown) and step S122 (not shown). Step S121: the multiple historical search keywords in searched key set of words are carried out to word segmentation processing, to obtain the centre word training set that comprises multiple participle fragments; Step S122: carry out model training based on centre word training set and generate machine learning algorithm.
Wherein, step S121 comprises step S1211 (not shown) and step S1212 (not shown). Step S1211: the historical search keyword in searched key set of words is carried out respectively to word segmentation processing, to obtain multiple participle fragments; Step S1212: multiple participle fragments are carried out to Screening Treatment, and the selection result is defined as to centre word training set.
Wherein, the method for word segmentation processing includes but not limited to:
Forward Maximum Method method (by left-to-right direction);
Reverse maximum matching method (by the right side to left direction);
Minimum cutting (making the word cutting out in each count minimum);
Two-way maximum matching method (carry out by left-to-right, by the right side to left twice sweep).
Wherein, the method for Screening Treatment comprises: by the participle fragment filtering not included in candidate's vocabulary; By participle fragment filtering the shortest participle length.
For example, in searched key set of words corresponding to same http://waimai.baidu.com, comprise historical search keyword " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ", historical search keyword in this searched key set of words is carried out respectively to word segmentation processing, as historical search keyword " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " are carried out respectively to word segmentation processing, to obtain multiple participle fragments; As being carried out to participle by Forward Maximum Method method, historical search keyword " take-away food delivery " obtains participle fragment " take-away " and " food delivery ", in like manner, respectively historical search keyword " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " are carried out to participle by Forward Maximum Method method, obtain participle fragment " take-away ", " making a reservation ", " crying " and " lunch "; If preset candidate's vocabulary comprises " take-away ", " crying " and " making a reservation ", multiple participle fragments " take-away ", " making a reservation ", " crying " and " lunch " are carried out to Screening Treatment, obtain " lunch " not in candidate's vocabulary, filtering participle fragment " lunch " subsequently, obtain the selection result for " take-away ", " making a reservation " and " crying ", can " cry " filtering by the shortest participle fragment of participle length in the selection result subsequently, obtain the selection result for " take-away " and " making a reservation ", and the selection result is defined as to centre word training set.
Wherein, step S122 comprises step S1221 (not shown), step S1222 (not shown) and step S1223 (not shown). Step S1221: each participle fragment is represented by the mode of vectorial dimension; Step S1222: the characteristic attribute that extracts each participle fragment in centre word training set; Step S1223: based on characteristic attribute, the participle fragment representing is carried out to classification based training generate machine learning algorithm in the mode of vectorial dimension.
Wherein, characteristic attribute includes but not limited to:
Part of speech relevant information;
Relation information with corresponding historical search keyword;
TF-IDF;
Special word relevant information;
Entity word relevant information.
For example, the participle fragment that screening is obtained, as " take-away " and " making a reservation " represents that by various dimensions vector the difference of each participle represents mode, and be each by the participle mark point word class after various dimensions vector representation by type labeling, and be labeled as positive example by each by the word corresponding with participle fragment " take-away " and " making a reservation " of the participle after various dimensions vector representation, other word is negative example, subsequently, extract the characteristic attribute of each participle fragment in centre word training set, comprise part of speech relevant information, as not only considered the part of speech of word self, also take into account the relevant information of the part of speech of its front word and rear word, relation information with corresponding historical search keyword, TF-IDF (TermFrequency – InverseDocumentFrequency), special word relevant information, as whether appear in special vocabulary and entity word relevant information as whether medium at entity vocabulary, based on characteristic attribute, the participle fragment representing with vectorial dimension is carried out to classification based training and generate machine learning algorithm, generate machine learning algorithm as carried out off-line classification based training to the participle fragment " take-away " representing with vectorial dimension and " making a reservation " by Liblinear (grader).
In a concrete application scenarios, as user, by terminal device, in the input frame at search engine, inputted search keyword is as " robbing red packet the Spring Festival ", and the searched key word of the server Real-time Obtaining user input of search engine " is robbed red packet the Spring Festival "; Subsequently, extract centre word corresponding in searched key word, as " Spring Festival ", " robbing " and " red packet ", the centre word " Spring Festival " that extracts according to the machine learning algorithm generating in the present invention program, " robbing " and " red packet " search being associated with searched key word suggestion that really directional user recommends as " micro-letter is robbed red packet ", " robbing red packet activity the Spring Festival ", " robbing red packet attack strategy the Spring Festival ", " Alipay the Spring Festival rob red packet " etc., as shown in Figure 3.
Fig. 4 is the structural representation of the generating apparatus of the machine learning algorithm for definite centre word of another embodiment in the present invention.
Generation module 410 obtains generation module and obtains multiple historical search keywords of corresponding same URL, and generates the searched key set of words corresponding to same URL; Processing module 420 is processed the multiple historical search keywords in searched key set of words, and result is carried out to model training generates machine learning algorithm.
In embodiments of the invention, a kind of generation scheme of the machine learning algorithm for definite centre word has been proposed, extract the centre word corresponding with searched key word by machine learning algorithm, can realize the automatic extraction of centre word, extract exactly the centre word of the searched key word of enormous quantity level by standardization and objective mode, thereby greatly saved the manpower and the time cost that extract centre word, simultaneously, in centre word leaching process, carry out classification based training by machine learning algorithm, make the centre word extracting more meet user's true search intention, avoid causing the centre word that extracts and the larger situation of actual user's true search intention deviation because of people for the various criterion of subjective assessment centre word, realize the object of the centre word of efficiently and accurately extraction enormous quantity level searched key word, and recommend to meet most its interested searched key word according to the centre word extracting to user, help user fast and accurately to search its required Query Result, improve user's search experience.
Generation module 410 obtains multiple historical search keywords of corresponding same URL, and generates the searched key set of words corresponding to same URL.
For example, obtain corresponding same URL if multiple historical search keywords of http://waimai.baidu.com are as " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ", with, and generate the searched key set of words corresponding to http://waimai.baidu.com, comprise " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ".
Processing module 420 is processed the multiple historical search keywords in searched key set of words, and result is carried out to model training generates machine learning algorithm.
For example, each user searches for for each historical search keyword respectively by search engine, obtain corresponding multiple search result items for each historical search keyword, each user chooses arbitrary search result items from multiple search result items, by choosing historical search keyword corresponding to same search result items to carry out polymerization, and pass through wherein PV (PageView of searched key word automatic marking after polymerization, page browsing amount) centre word of the highest searched key word is as the training set of off-line training, represent the each centre word in training set with multi-C vector subsequently, and generate machine learning algorithm by off-line training, when the server of search engine obtains after the searched key word " take-away is made a reservation " of user's input online, the machine learning algorithm obtaining by off-line training extracts the centre word of " take-away is made a reservation " as " take-away " and " making a reservation ".
As shown in Figure 5, obtain generation module and specifically comprise the first acquiring unit 511, extraction unit 512 and second acquisition unit 513. The first acquiring unit 511 is obtained for multiple users' historical search and is clicked record; Extraction unit 512 extracts historical search clicks the corresponding relation of historical search keyword and search result items and search result items and URL in record; Second acquisition unit 513 obtains multiple historical search keywords of corresponding same URL according to corresponding relation.
Wherein, extraction unit is clicked the historical search keyword inputted respectively of multiple users in record specifically for extracting historical search, and corresponding relation between the search result items clicked of the historical search keyword inputted separately based on it of multiple user; And extract the corresponding relation of search result items and corresponding URL.
For example, obtain for multiple users' historical search and click record, the historical search keyword that multiple users input respectively from historical search click record extraction historical search click record is as " take-away ", " take-away is made a reservation ", and extract the search result items that historical search keyword that multiple users input separately based on it is clicked, and corresponding relation between historical search keyword and search result items, as the historical search keyword " take-away is made a reservation " based on user's input is searched in search engine, can obtain the search result items that search result items comprises sensing " official website is taken out by Baidu ", point to the search result items of " official website is taken out by U.S. group " etc., user clicks and chooses search result items " official website is taken out by Baidu " from multiple search result items, can obtain historical search keyword " take-away make a reservation " search result items chosen of user based on input for " official website is taken out by Baidu ", extract historical search keyword " take-away is made a reservation " and the corresponding relation of search result items for " official website is taken out by Baidu ", subsequently, extract the URL (UniformResourceLocator corresponding with search result items " official website is taken out by Baidu ", URL) http://waimai.baidu.com, and then extract historical search keyword " take-away is made a reservation " and URL:http: the corresponding relation of //waimai.baidu.com, obtain corresponding same URL according to the multiple historical search keywords that extract with the corresponding relation of multiple URL subsequently, as multiple historical search keywords of http://waimai.baidu.com, as get and URL:http: the historical search keyword that //waimai.baidu.com is corresponding comprises " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " etc., generate subsequently the searched key set of words corresponding to same http://waimai.baidu.com, comprising " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ".
Wherein, processing module specifically comprises processing unit (not shown) and generation unit (not shown). Processing unit carries out word segmentation processing to the multiple historical search keywords in searched key set of words, to obtain the centre word training set that comprises multiple participle fragments; Generation unit carries out model training based on centre word training set and generates machine learning algorithm.
Wherein, processing unit comprises processing subelement (not shown) and screening subelement (not shown). Process subelement the historical search keyword in searched key set of words is carried out respectively to word segmentation processing, to obtain multiple participle fragments; Multiple participle fragments are carried out Screening Treatment by screening subelement, and the selection result is defined as to centre word training set.
Wherein, the method for word segmentation processing includes but not limited to:
Forward Maximum Method method (by left-to-right direction);
Reverse maximum matching method (by the right side to left direction);
Minimum cutting (making the word cutting out in each count minimum);
Two-way maximum matching method (carry out by left-to-right, by the right side to left twice sweep).
Wherein, the method for Screening Treatment comprises: by the participle fragment filtering not included in candidate's vocabulary; By participle fragment filtering the shortest participle length.
For example, in searched key set of words corresponding to same http://waimai.baidu.com, comprise historical search keyword " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway ", historical search keyword in this searched key set of words is carried out respectively to word segmentation processing, as historical search keyword " take-away food delivery ", " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " are carried out respectively to word segmentation processing, to obtain multiple participle fragments; As being carried out to participle by Forward Maximum Method method, historical search keyword " take-away food delivery " obtains participle fragment " take-away " and " food delivery ", in like manner, respectively historical search keyword " take-away is made a reservation ", " lunch take-away " and " ordering takeaway " are carried out to participle by Forward Maximum Method method, obtain participle fragment " take-away ", " making a reservation ", " crying " and " lunch "; If preset candidate's vocabulary comprises " take-away ", " crying " and " making a reservation ", multiple participle fragments " take-away ", " making a reservation ", " crying " and " lunch " are carried out to Screening Treatment, obtain " lunch " not in candidate's vocabulary, filtering participle fragment " lunch " subsequently, obtain the selection result for " take-away ", " making a reservation " and " crying ", can " cry " filtering by the shortest participle fragment of participle length in the selection result subsequently, obtain the selection result for " take-away " and " making a reservation ", and the selection result is defined as to centre word training set.
Wherein, generation unit comprises expression subelement (not shown), extracts subelement (not shown) and generates subelement (not shown). Represent that subelement represents each participle fragment by the mode of vectorial dimension; Extract the characteristic attribute that subelement extracts each participle fragment in centre word training set; Generate subelement based on characteristic attribute, the participle fragment representing is carried out to classification based training generate machine learning algorithm in the mode of vectorial dimension.
Wherein, characteristic attribute includes but not limited to:
Part of speech relevant information;
Relation information with corresponding historical search keyword;
TF-IDF;
Special word relevant information;
Entity word relevant information.
For example, the participle fragment that screening is obtained, as " take-away " and " making a reservation " represents that by various dimensions vector the difference of each participle represents mode, and be each by the participle mark point word class after various dimensions vector representation by type labeling, and be labeled as positive example by each by the word corresponding with participle fragment " take-away " and " making a reservation " of the participle after various dimensions vector representation, other word is negative example, subsequently, extract the characteristic attribute of each participle fragment in centre word training set, comprise part of speech relevant information, as not only considered the part of speech of word self, also take into account the relevant information of the part of speech of its front word and rear word, relation information with corresponding historical search keyword, TF-IDF (TermFrequency – InverseDocumentFrequency), special word relevant information, as whether appear in special vocabulary and entity word relevant information as whether medium at entity vocabulary, based on characteristic attribute, the participle fragment representing with vectorial dimension is carried out to classification based training and generate machine learning algorithm, generate machine learning algorithm as carried out off-line classification based training to the participle fragment " take-away " representing with vectorial dimension and " making a reservation " by Liblinear (grader).
In a concrete application scenarios, as user, by terminal device, in the input frame at search engine, inputted search keyword is as " robbing red packet the Spring Festival ", and the searched key word of the server Real-time Obtaining user input of search engine " is robbed red packet the Spring Festival "; Subsequently, extract centre word corresponding in searched key word, as " Spring Festival ", " robbing " and " red packet ", the centre word " Spring Festival " that extracts according to the machine learning algorithm generating in the present invention program, " robbing " and " red packet " search being associated with searched key word suggestion that really directional user recommends as " micro-letter is robbed red packet ", " robbing red packet activity the Spring Festival ", " robbing red packet attack strategy the Spring Festival ", " Alipay the Spring Festival rob red packet " etc., as shown in Figure 3.
Those skilled in the art of the present technique are appreciated that the one or more equipment relating to for carrying out operation described in the application that the present invention includes. these equipment can be required object specialized designs and manufacture, or also can comprise the known device in all-purpose computer. these equipment have storage computer program therein, and these computer programs optionally activate or reconstruct. such computer program (for example can be stored in equipment, computer) in computer-readable recording medium or be stored in and be suitable for store electrons instruction and be coupled to respectively in the medium of any type of bus, described computer-readable medium includes but not limited to that the dish of any type (comprises floppy disk, hard disk, CD, CD-ROM, and magneto-optic disk), ROM (Read-OnlyMemory, read-only storage), RAM (RandomAccessMemory, memory immediately), EPROM (ErasableProgrammableRead-OnlyMemory, Erarable Programmable Read only Memory), EEPROM (ElectricallyErasableProgrammableRead-OnlyMemory, EEPROM), flash memory, magnetic card or light card. namely, computer-readable recording medium for example comprises, by equipment (, computer) with the form storage that can read or any medium of transmission information.
Those skilled in the art of the present technique are appreciated that the combination that can realize with computer program instructions the frame in each frame and these structure charts and/or block diagram and/or the flow graph in these structure charts and/or block diagram and/or flow graph. Those skilled in the art of the present technique are appreciated that, the processor that these computer program instructions can be offered to all-purpose computer, special purpose computer or other programmable data processing methods is realized, thereby carries out the scheme of specifying in the frame of structure chart disclosed by the invention and/or block diagram and/or flow graph or multiple frame by the processor of computer or other programmable data processing methods.
Those skilled in the art of the present technique be appreciated that step in the various operations discussed in the present invention, method, flow process, measure, scheme can by alternately, change, combination or delete. Further, have other steps in the various operations discussed in the present invention, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete. Further, of the prior art have with the present invention in step in disclosed various operations, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete.
The above is only part embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. for determining the generation method of machine learning algorithm for centre word, it is characterized in that,Comprise:
Obtain multiple historical search keywords of corresponding same URL, and generate corresponding toThe searched key set of words of described same URL;
Multiple historical search keywords in described searched key set of words are processed, and to processingResult is carried out model training and is generated described machine learning algorithm.
2. the generation side of the machine learning algorithm for definite centre word according to claim 1Method, is characterized in that, obtains multiple historical search keywords of corresponding same URL,Specifically comprise:
Obtain for multiple users' historical search and click record;
Extract described historical search and click historical search keyword and search result items and search in recordThe corresponding relation of result items and URL;
Obtain multiple historical searches of corresponding same URL according to described corresponding relationKeyword.
3. the life of the machine learning algorithm for definite centre word according to claim 1 and 2One-tenth method, is characterized in that, extracts described historical search and clicks in record historical search keyword and searchThe corresponding relation of rope result items and search result items and URL, specifically comprises:
Extract described historical search and click the historical search keyword that in record, multiple users input respectively,And between the search result items clicked of the historical search keyword inputted separately based on it of multiple usersCorresponding relation; And
Extract the corresponding relation of described search result items and corresponding URL.
4. according to the machine learning algorithm for definite centre word described in claim 1-3 any oneGeneration method, it is characterized in that, to the multiple historical search keys in described searched key set of wordsWord is processed, and result is carried out to model training generates described machine learning algorithm, concreteComprise:
Multiple historical search keywords in described searched key set of words are carried out to word segmentation processing, to obtainGet the centre word training set that comprises multiple participle fragments;
Carry out model training based on described centre word training set and generate described machine learning algorithm.
5. according to the machine learning algorithm for definite centre word described in claim 1-4 any oneGeneration method, it is characterized in that, the historical search keyword in described searched key set of words is enteredRow word segmentation processing, to obtain the centre word training set that comprises multiple participle fragments, comprising:
Historical search keyword in described searched key set of words is carried out respectively to word segmentation processing, to obtainGet multiple participle fragments;
Described multiple participle fragments are carried out to Screening Treatment, and the selection result is defined as to centre word trainingCollection.
6. for determining the generating apparatus of machine learning algorithm for centre word, it is characterized in that,Comprise:
Obtain generation module, close for multiple historical searches of obtaining corresponding same URLKeyword, and generate the searched key set of words corresponding to described same URL;
Processing module, carries out for the multiple historical search keywords to described searched key set of wordsProcess, and result is carried out to model training generate described machine learning algorithm.
7. the generation dress of the machine learning algorithm for definite centre word according to claim 6Put, it is characterized in that, described in obtain generation module and specifically comprise:
The first acquiring unit, clicks record for obtaining for multiple users' historical search;
Extraction unit, clicks record historical search keyword and search for extracting described historical searchThe corresponding relation of result items and search result items and URL;
Second acquisition unit, for obtaining corresponding same unified resource location according to described corresponding relationMultiple historical search keywords of symbol.
8. according to the life of the machine learning algorithm for definite centre word described in claim 6 or 7Apparatus for converting, is characterized in that, described extraction unit is clicked record specifically for extracting described historical searchIn the historical search keyword inputted respectively of multiple users, and going through of inputting separately based on it of multiple userCorresponding relation between the search result items that history searched key word is clicked; And
Extract the corresponding relation of described search result items and corresponding URL.
9. according to the machine learning algorithm for definite centre word described in claim 6-8 any oneGenerating apparatus, it is characterized in that, described processing module specifically comprises:
Processing unit, carries out for the multiple historical search keywords to described searched key set of wordsWord segmentation processing, to obtain the centre word training set that comprises multiple participle fragments;
Generation unit, generates described machine for carrying out model training based on described centre word training setLearning algorithm.
10. according to the machine learning algorithm for definite centre word described in claim 6-9 any oneGenerating apparatus, it is characterized in that, described processing unit comprises:
Process subelement, enter respectively for the historical search keyword to described searched key set of wordsRow word segmentation processing, to obtain multiple participle fragments;
Screening subelement, for described multiple participle fragments are carried out to Screening Treatment, and by the selection resultBe defined as centre word training set.
CN201510965343.6A 2015-12-21 2015-12-21 Generation method and device for determining machine learning algorithm of head word Pending CN105608071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510965343.6A CN105608071A (en) 2015-12-21 2015-12-21 Generation method and device for determining machine learning algorithm of head word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510965343.6A CN105608071A (en) 2015-12-21 2015-12-21 Generation method and device for determining machine learning algorithm of head word

Publications (1)

Publication Number Publication Date
CN105608071A true CN105608071A (en) 2016-05-25

Family

ID=55988015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510965343.6A Pending CN105608071A (en) 2015-12-21 2015-12-21 Generation method and device for determining machine learning algorithm of head word

Country Status (1)

Country Link
CN (1) CN105608071A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101097578A (en) * 2007-06-07 2008-01-02 北京金山软件有限公司 Network resource searching method and system
JP4534666B2 (en) * 2004-08-24 2010-09-01 富士ゼロックス株式会社 Text sentence search device and text sentence search program
JP4569179B2 (en) * 2004-06-03 2010-10-27 富士ゼロックス株式会社 Document search device
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
CN103324645A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and device for recommending webpage
CN103559284A (en) * 2013-11-07 2014-02-05 北京国双科技有限公司 Word expansion method and device for webpage keywords
CN103873601A (en) * 2012-12-11 2014-06-18 百度在线网络技术(北京)有限公司 Addressing class query word mining method and system
CN104391958A (en) * 2014-11-28 2015-03-04 北京国双科技有限公司 Correlation detection method and device for web page search keywords
CN104598607A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and system for recommending search phrase

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4569179B2 (en) * 2004-06-03 2010-10-27 富士ゼロックス株式会社 Document search device
JP4534666B2 (en) * 2004-08-24 2010-09-01 富士ゼロックス株式会社 Text sentence search device and text sentence search program
CN101097578A (en) * 2007-06-07 2008-01-02 北京金山软件有限公司 Network resource searching method and system
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
CN103324645A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and device for recommending webpage
CN103873601A (en) * 2012-12-11 2014-06-18 百度在线网络技术(北京)有限公司 Addressing class query word mining method and system
CN103559284A (en) * 2013-11-07 2014-02-05 北京国双科技有限公司 Word expansion method and device for webpage keywords
CN104391958A (en) * 2014-11-28 2015-03-04 北京国双科技有限公司 Correlation detection method and device for web page search keywords
CN104598607A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and system for recommending search phrase

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李亚娟: "中文问题里的中心词识别研究", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN111666417B (en) * 2020-04-13 2023-06-23 百度在线网络技术(北京)有限公司 Method, device, electronic equipment and readable storage medium for generating synonyms

Similar Documents

Publication Publication Date Title
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
CN106991092B (en) Method and equipment for mining similar referee documents based on big data
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US20140270497A1 (en) Accurate text classification through selective use of image data
CN107341183B (en) Website classification method based on comprehensive characteristics of hidden network website
CN105187242B (en) A kind of user's anomaly detection method excavated based on variable-length pattern
CN112100529A (en) Search content ordering method and device, storage medium and electronic equipment
CN104598535A (en) Event extraction method based on maximum entropy
CN103577432A (en) Method and system for searching commodity information
CN105589954A (en) Method and device for determining search suggestion based on central words
CN109492081B (en) Text information searching and information interaction method, device, equipment and storage medium
CN105512180A (en) Search recommendation method and device
CN102722709A (en) Method and device for identifying garbage pictures
CN103593371A (en) Method and device for recommending search keywords
CN112464666B (en) Unknown network threat automatic discovery method based on hidden network data
CN105095434A (en) Recognition method and device for timeliness requirement
CN102609539B (en) Search method and search system
CN105069077A (en) Search method and device
CN112035599A (en) Query method and device based on vertical search, computer equipment and storage medium
CN108197243A (en) Method and device is recommended in a kind of input association based on user identity
CN104361092A (en) Searching method and device
CN109063744B (en) Neural network model training method and business document similarity determining method and system
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN112685642A (en) Label recommendation method and device, electronic equipment and storage medium
CN104615621B (en) Correlation treatment method and system in search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160525