CN107784019A - Word treatment method and system are searched in a kind of searching service - Google Patents

Word treatment method and system are searched in a kind of searching service Download PDF

Info

Publication number
CN107784019A
CN107784019A CN201610785278.3A CN201610785278A CN107784019A CN 107784019 A CN107784019 A CN 107784019A CN 201610785278 A CN201610785278 A CN 201610785278A CN 107784019 A CN107784019 A CN 107784019A
Authority
CN
China
Prior art keywords
word
speech
search
participle
prior searches
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610785278.3A
Other languages
Chinese (zh)
Inventor
邓凯
陈亚
程进兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Commerce Group Co Ltd
Original Assignee
Suning Commerce Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Commerce Group Co Ltd filed Critical Suning Commerce Group Co Ltd
Priority to CN201610785278.3A priority Critical patent/CN107784019A/en
Publication of CN107784019A publication Critical patent/CN107784019A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The embodiment of the invention discloses word treatment method and system is searched in a kind of searching service, belong to search technique field.This method and system can improve the search degree of accuracy and search result recall rate, user is in e-commerce website search experience for lifting to being analyzed with the search term modified, restriction or length are longer.This method comprises the following steps:Product data are excavated and crawled, dictionary is extracted from product data;And the part of speech sequence and feature list of prior searches word are obtained according to dictionary, analysis;Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtains the part of speech of current search word, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.

Description

Word treatment method and system are searched in a kind of searching service
Technical field
The present invention relates to search technique field, it particularly relates to search for word treatment method in a kind of searching service and be System.
Background technology
With the development of ecommerce, increasing consumer orientation is in shopping at network, unlike solid shop/brick and mortar store, net Network shopping is needed according to shopping, and the keyword of the commodity/service of needs, i.e. search term are inputted on e-commerce website.At present, Processing of most of e-commerce website to search term is simple search term processing, places into site search engine and carries out completely Matching search.This mode requires that consumer needs to input the search term (for example, name of product) of simple, intuitive, can just reach Preferably search effect.In other words, the higher of the requirement for the search term that e-commerce platform search engine inputs to user is only searched Rope word is clear, can just have preferably search result and recall rate with the search term that paired word individual character occurs.
If the search term of consumer entering limits with modification or length is longer, search engine can not just return to phase The higher product of closing property, or even completely unrelated product can be returned, influence search experience of the user in e-commerce website.
The content of the invention
It is an object of the invention to provide searching for word treatment method and system in a kind of searching service, can to modification, Limit or the longer search term of length is analyzed and processed, improve the search degree of accuracy and search result recall rate, lift user In e-commerce website search experience.
To achieve the above object, the embodiment of the present invention uses following technical scheme:
In a first aspect, the embodiment of the present invention, which provides, searches for word treatment method in a kind of searching service, this method includes following Step:
Product data are excavated and crawled, dictionary is extracted from product data;And prior searches word is obtained according to dictionary, analysis Part of speech sequence and feature list;
Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtained current The part of speech of search term, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.
With reference in a first aspect, as the mode that the first may be implemented, described excavation and product data are crawled, from product Extracting data dictionary;And the part of speech sequence and feature list of prior searches word are obtained, is specifically included following according to dictionary, analysis Step:
Product data are excavated and crawled from public internet network;
All kinds of dictionaries are extracted from the product data of excavation;
Using the word in dictionary, the part of speech sequence of prior searches word is analyzed, and obtains the feature list of prior searches word.
With reference to the first mode that may implement of first aspect, as second of mode that may implement, described point The part of speech sequence of prior searches word is analysed, is comprised the following steps:
Sentence in test set is segmented, part of speech then is put on to each participle;
Train 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first parameter Pi is represented hidden The prior probability of N number of part of speech of Tibetan state, between the second parameter A represents N number of part of speech, the state transition probability matrix of front and rear word; 3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Using 3 parameters of training, the part of speech sequence of prior searches word is drawn.
With reference to second of mode that may implement of first aspect, as the mode that the third may be implemented, described obtains The feature list of prior searches word is taken, is comprised the following steps:Word segmentation processing is carried out to prior searches word, cutting is analytic unit, And analytic unit attribute is assigned, generate the feature list of prior searches word;The feature list includes prior searches word, previously searched Rope word literal meaning and analytic unit attribute.
With reference in a first aspect, or the side that may implement to the third of the first mode that may implement in first aspect Any one in formula, as the 4th kind of mode that may implement, described determination search term to be searched, specifically include following Step:
Word segmentation processing is carried out to current search word;
Obtain the part of speech of each participle after current search word participle;
According to the part of speech of each participle, assign each participle different weights, determine search term to be searched.
Second aspect, the embodiment of the present invention provide search term processing system in also a kind of searching service, including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to word Storehouse, analysis obtain the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature of prior searches word List, obtains the part of speech of current search word, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
With reference to second aspect, as the first be probably realization mode, described extraction and analysis module include:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches The feature list of word.
It is described as the mode for being probably realization for second with reference to the first mode in the cards of second aspect Analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, Between one parameter Pi represents that the prior probability of N number of part of speech of hidden state, the second parameter A represent N number of part of speech, the state of front and rear word Transition probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
With reference to second of mode in the cards of second aspect, as the third be probably realization mode, it is described Analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and it is single to assign analysis Meta-attribute;
Generation unit:For generating the feature list of prior searches word;The feature list include prior searches word, previously Search term literal meaning and analytic unit attribute.
The side that may implement to the third with reference to the mode that the first in second aspect, or second aspect may be implemented Any one in formula, as the 4th kind of mode that may implement, described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search to be searched Word.
Compared with prior art, word treatment method and system are searched in the searching service of the embodiment of the present invention, can be improved The degree of accuracy and search result recall rate are searched for, user is in e-commerce website search experience for lifting.The search of the embodiment of the present invention Word treatment method, word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtain and work as The part of speech of preceding search term, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.The present embodiment leads to Cross and deep semantic understanding and analysis are carried out to the search term of user's input, the real search intention of user is determined, after understanding Search term, carry out filtered search, greatly improve the search degree of accuracy and search result recall rate.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, it will use below required in embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ability For the those of ordinary skill of domain, on the premise of not paying creative work, it can also be obtained according to these accompanying drawings other attached Figure.
Fig. 1 is system architecture schematic diagram provided in an embodiment of the present invention;
Fig. 2 is analysis method schematic flow sheet provided in an embodiment of the present invention;
Fig. 3 is the schematic flow sheet of step 10) in analysis method provided in an embodiment of the present invention;
Fig. 4 is step 103) schematic flow sheet in analysis method provided in an embodiment of the present invention;
Fig. 5 is the schematic flow sheet of step 20) in analysis method provided in an embodiment of the present invention;
Fig. 6 is the structural representation of analysis system provided in an embodiment of the present invention.
Embodiment
To make those skilled in the art more fully understand technical scheme, below in conjunction with the accompanying drawings and specific embodiment party Formula is described in further detail to the present invention.Embodiments of the present invention are described in more detail below, the embodiment is shown Example is shown in the drawings, wherein same or similar label represents same or similar element or has identical or class from beginning to end Like the element of function.
Embodiment below with reference to accompanying drawing description is exemplary, is only used for explaining the present invention, and can not be explained For limitation of the present invention.Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative used herein " one ", "one", " described " and "the" may also comprise plural form.It is to be further understood that in the specification of the present invention The wording " comprising " used refers to the feature, integer, step, operation, element and/or component be present, but it is not excluded that depositing Or other one or more features of addition, integer, step, operation, element, component and/or their groups.It should be understood that work as We claim element to be " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, Huo Zheye There may be intermediary element.In addition, " connection " used herein or " coupling " can include wireless connection or coupling.Used here as Wording "and/or" including one or more associated list items any cell and all combine.The art skill Art personnel are appreciated that unless otherwise defined, all terms (including technical term and scientific terminology) used herein have with The general understanding identical meaning of those of ordinary skill in art of the present invention.It should also be understood that such as general words Those terms defined in allusion quotation, which should be understood that, has the meaning consistent with the meaning in the context of prior art, and removes It is non-to be defined as here, it will not be explained with the implication of idealization or overly formal.
Method flow in the present embodiment, specifically can be in a kind of processing system for searching service as shown in Figure 1 Upper execution, including:Network, extraction and analysis module, determining module and database.Extraction disclosed in the present embodiment and Analysis module and determining module, can be specifically the equipment such as server, work station, supercomputer, or by multiple services A kind of server cluster system for data processing of device composition.
Database can be specifically a kind of Redis database or other kinds of distributed data base, relational data Storehouse etc., can be specifically the storage device for including the data server of storage device and being connected with data server, either A kind of server cluster system for database being made up of multiple data servers and storage server.
In the present embodiment, extraction and analysis module specifically can be used for extracting prior searches word from search daily record, than Such as:Extract and analyze user nearest 1 week, 1 month or other times in the range of the search daily record that generates, first pass through pretreatment rule The search term specification extracted then is turned into the form that can correctly handle.The search term extracted is formed into dictionary again, and obtained Take the part of speech sequence and feature list of prior searches word.By dictionary, the part of speech sequence of prior searches word and feature list store to Database, for example Redis databases are stored in, so that semantic module reads these data by accessing, inquiring about database.
In the present embodiment, database specifically can be used for the search daily record (ratio that storage system generates in the process of running Such as:Include the search daily record of the search term that continuously transmits of the user terminal in a Session Time section), extraction and analysis mould Block generation dictionary, the part of speech sequence and feature list of prior searches word.The database can also be used as public character library dictionary platform With the database of open electric business resource platform, or the database phase with public character library dictionary platform and open electric business resource platform Connect and carry out data interaction.Can also be, extraction and analysis module be database carry out data interaction while, also with it is public Character library dictionary platform is connected with the database of open electric business resource platform and carries out data interaction.
In the present embodiment, determining module specifically can be implemented as a kind of search server or the clothes for search service Business device cluster, determining module are used for part of speech sequence and feature list according to prior searches word, analyze the semanteme of current search word. According to the semanteme of determination, scan for, search result is fed back into user.User equipment can specifically make an independent dress Put, or be integrated in a variety of media data playing devices, such as set top box, mobile phone, tablet personal computer (Tablet Personal Computer), laptop computer (Laptop Computer), multimedia player, digital camera, individual number Word assistant (personal digital assistant, abbreviation PDA), guider, mobile Internet access device (Mobile Internet Device, MID) or wearable device (Wearable Device) etc..
Word treatment method is searched in a kind of searching service of the embodiment of the present invention, as shown in Fig. 2 including following two steps Suddenly:
S10:Product data are excavated and crawled, dictionary is extracted from product data;And obtain and previously search according to dictionary, analysis The part of speech sequence and feature list of rope word.
Preferably, as shown in figure 3, S10 specifically includes step 101) to step 103).
Step 101) excavates and crawled product data from public internet network.
Wherein, it is open to the public to include public character library dictionary platform and open electric business resource platform etc. for public internet network The network platform.Product data are obtained from public internet network, the information source for obtaining data can be expanded, raising subsequently establishes word The precision in storehouse.In this step, data mining and web crawlers technology can be utilized, excavates and crawl production from public internet network Product data.
Product data are the data related to product, mainly include name of product and product attribute.Product attribute includes production The product descriptions such as category type, model, size, capacity, brand.Such as:Product data are Panasonic (Panasonic) XQB65- 6.5 kilograms of Q6131.The product data include the description such as name of product, brand, model and capacity.In another example:Product data are 6.5 kilograms of rotary drum washing machine.The product data include the description such as name of product, product type and product capacity.
All kinds of dictionaries are extracted in the product data that step 102) is excavated from step 101).
This step is classified the product data that step 101) is excavated, and establishes dictionary.This step can pass through engineering Habit and data mining technology, extract all kinds of dictionaries from the product data of excavation.The dictionary of extraction has:Name of product, model, Brand, color, material, speciality, style style etc..For example, extracting product data by step 101) and being:Burberry/ bars The double-breasted waistband wind coat of precious five colors of jasmine female money, then extract dictionary by step 102):Burberry/ Ba Baoli → brand, Double-breasted → style style, wind coat → name of product.In another example extracting product data by step 101) and being:Song of Joy Liu Great waves pacify enlightening with the summer of money Western-style clothes vest female's spring and autumn vest overcoat shortage of money Joker 2016, then extract dictionary by step 102):Shortage of money → style style, Joker → style style, summer → season, vest overcoat → name of product, Western-style clothes vest → name of product.Example again Such as, extracting product data by step 101) is:Esky babies nightwear 3 months to 6 months, then extracted by step 102) Dictionary:Esky → brand, baby → crowd, individual month → unit _ age;Nightwear → name of product.
Step 103) analyzes the part of speech sequence of previous search term, and obtain using the word in the dictionary of step 102) extraction Take the feature list of prior searches word.
In this preference, viterbi algorithm (corresponding English is Viterbi) and HMM can be used (corresponding English is Hidden Markov Model), the most possible part of speech sequence of previous search term is found out, so as to in dictionary Word adds part of speech label.
As shown in figure 4, in step 103, label is added to the word in dictionary, specifically included:
Step 1031) segments the sentence in test set, then puts on part of speech to each participle.
Such as:Xiao Ming/n /adv writes/v operations/n.
Wherein, n represents title, and adv represents adverbial word, and v represents verb.Observer state is the word in sentence, hidden state It is part of speech.
Step 1032) trains 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B.
First parameter Pi represents the prior probability of N number of part of speech of hidden state.First parameter Pi directly does the frequency of part of speech Statistics.
Between second parameter A represents N number of part of speech, the state transition probability matrix P of front and rear word (St | St-1).P(St|St- 1) it is equal to the number that number/part of speech St-1 that two parts of speech (St-1, St) are occurred by tandem occurs.
3rd parameter B represents N number of part of speech to the confusion matrix P (Ot | St) of M phrase.P (Ot | St) be equal to phrase Ot and The number that the number that part of speech St occurs simultaneously/part of speech St occurs.
Assuming that N number of part of speech, M phrase, the first parameter Pi are the vectors that length is N altogether, the second parameter A is a N*N Matrix, the 3rd parameter B are N*M matrixes.
When carrying out part-of-speech tagging to sentence, it is to be ensured that the phrase after participle is all in M.
Step 1033) trains 3 parameters using step 1032), draws the part of speech sequence of prior searches word.
Illustrate below.If only 2 kinds of parts of speech and 3 words.2 kinds of parts of speech are noun and verb.3 words are to take an examination, be small Bright, concern.If in the statistics of the sentence occurred before, noun occurs 100 times, verb occurs 70 times.
First parameter Pi:
Noun Verb
100 70
The Speech conversion probability matrix of word before and after second parameter A.Wherein, Y-axis represents the word of the word on the current location S left sides Property, X-axis represents the part of speech of current location S word.As shown in the table, X-axis is noun and verb, and Y-axis is noun and verb.Occur The part of speech of front and rear word be noun and noun, noun and verb, verb and noun, verb and verb.
Noun Verb
Noun 0.25 0.75
Verb 0.8 0.2
3rd parameter B parts of speech to phrase emission matrix, it is as shown in the table.
Examination Concern Xiao Ming
Noun 0.4 0.3 1
Verb 0.6 0.7 0
Assuming that the part of speech of analysis " Xiao Ming's examination " and " concern examination " two words.
Xiao Ming takes an examination
The first step, ask the part of speech of " Xiao Ming " may:
Noun word frequency * B (Xiao Ming is noun)=100*1=100 in first parameter Pi
Verb word frequency * B (Xiao Ming is verb)=70*0=0 in Pi
First step conclusion " Xiao Ming " most likely noun
Second step, ask the part of speech of " examination " may
100*A (noun followed by noun) * B (examination is noun)=100*0.25*0.4=10
100*A (noun followed by verb) * B (examination is verb)=100*0.75*0.6=45
Second step conclusion " examination " most likely verb
Conclusion:[Xiao Ming's examination] → [noun verb]
Concern examination
The first step, ask the part of speech of " concern " may:
Noun word frequency * B (concern is noun)=100*0.3=30 in Pi
Verb word frequency * B (concern is verb)=70*0.7=49 in Pi
First step conclusion " concern " most likely verb
Second step, ask the part of speech of " examination " may:
49*A (verb followed by noun) * B (examination is noun)=49*0.8*0.4=15.68
49*A (verb followed by verb) * B (examination is verb)=49*0.2*0.6=5.88
Second step conclusion " examination " most likely verb
Conclusion:[concern examination] → [verb noun]
In above-mentioned steps, with viterbi algorithm (Viterbi) and HMM, (corresponding English is Hidden Markov Model) find out the most possible part of speech sequence of a sentence, i.e. the part of speech sequence of maximum probability.Viterbi algorithm is A kind of dynamic programming algorithm is used to find the most possible-Viterbi path-hidden state sequence for producing observed events sequence.Note The most possible hidden state and probability of each step is recorded, avoids computing repeatedly during each step advance.
By step 103), after finding out the part of speech of search term, be easy to subsequent step 20) real-time analysis user search Character string, find the keyword in search term.
In step 103), obtain prior searches word feature list, obtain prior searches word, cutting be analytic unit (i.e. Carry out word segmentation processing), and analytic unit attribute is assigned, generate the feature list of prior searches word.This feature list includes search Word, the literal meaning of search term and analytic unit attribute.Obtaining prior searches word can extract from the daily record of website.Analysis is single Meta-attribute includes the attribute and hiding attribute from literal extraction.
The initial search word inputted before user is divided into multiple minimum analytic units, and it is single to assign each minimum analysis The attributes such as first feature part of speech, feature, possible rewriting.Preferably, using the algorithm of maximum matching cutting word, by user's input Initial search word is divided into multiple minimum analytic units.Attribute is assigned to enumerate:User inputs ' 10kg ', and system can be automatically recognized as ' ten kilograms ', or when user inputs ' basket color ', system can detect that this is not a word and prompting may be rewritten as ' indigo plant Color '.It is because most of word cut out can be found from the dictionary in step 102, word in dictionary is corresponding Attribute is read out, and is then put together with the word cut out.Such as:(Ba Baoli brands), (female's money style style).It is minimum Analytic unit and its attribute are used as the source data of Chinese semantic analysis.
One long string of name of product is broken into the list of a short word, each element in list is short by one Word and its label form.
Original name of product:" the double-breasted waistband wind coat of five colors of burberry/ Ba Baoli female money "
List after being parsed by this step:(burberry brands), (/ symbol), (Ba Baoli brands), (female's money wind Lattice style), (five Chinese language words), (color Chinese language words), (double-breasted style style), (waistband Chinese language words), (wind coat produces Product;Main body).
Step 103) can generate the feature list related to search term, and this feature list covers the literal of search term and contained Justice, from the attribute of literal extraction and hiding attribute (such as Price Range, color, material, model, brand, rewriting etc.).
Step 20):Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, The part of speech of current search word is obtained, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
Above-mentioned steps 10) it is that past daily record data is analyzed and processed, obtain minimum analytic unit and its attribute.This Step is after being segmented to current search word, finds out the association between minimum analytic unit, parses user intention.Step 20) Current search word is analyzed in real time.As shown in figure 5, step 20) specifically includes following steps:
Step 201) carries out word segmentation processing to current search word.
The algorithm of maximum matching cutting word can be used in this step, the search term of user's input is subjected to word segmentation processing, point Into multiple words.When occurring alphabetical in new search word, letter is divided into product type, phonetic, English and unit first.Use LALR (Look-Ahead Left Recursive) syntax analyzer judges model and unit.Judge phonetic using dictionary lookup And English.Simultaneously using the semantic relation between LALR syntax analyzers analysing word and word, such as ' price is no more than 500 yuan '.
Step 202) obtains the part of speech of each participle after current search word segments.
In this step, the part of speech respectively segmented can use above-mentioned steps 103) method carry out, i.e., using hidden Ma Erke Husband's model and viterbi algorithm put on the part of speech of each participle.
Step 203) assigns each participle different weights, according to weight size, determined to be searched according to the part of speech of each participle Search term.
Assign each participle different weights, preferably using name of product and brand as weight limit.If multiple products Title, based on the name of product of picking position rearward, the weight of other parts of speech is relatively small, plays certain booster action.
The feature list generated according to step 103), on the basis of search term is understood, more accurately search term is carried out Search, and give determinant attribute bigger weight, so as to improve search effect, improve recall rate.
After all participles scan for as search term, when acquisition result is undesirable, then the relatively low participle of weight is saved Slightly, the larger participle of weight is retained as search term.
Such as:" Samsung mobile phone ", by step 103):Samsung-brand, mobile phone-product;Can by step 202) Know, Samsung-adjective, mobile phone-noun." mobile phone " weight is more than the weight of " Samsung ", and the emphasis of search is " mobile phone ".Just When beginning to search for, while retrieve " Samsung " and " mobile phone ".If not retrieving Related product, search for " mobile phone ".
In another example:" iPhone shell ", by step 103):Apple-brand, mobile phone shell-name of product;Pass through step It is rapid 202) to understand, apple-adjective, mobile phone shell-noun.The emphasis of search is " mobile phone shell ".
Again for example:" 6.5 kilograms of rotary drum washing machines of Panasonic (panasonic) XQB65-Q6131 ", can by step 103) Know:Panasonic-brand, panasonic- brands, XQB65-Q6131- models, 6.5 kilograms-capacity, rotary drum washing machine-ProductName Claim;Pass through step 202), Panasonic-adjective, panasonic- adjectives, XQB65-Q6131- nouns, 6.5 kilograms- Word, rotary drum washing machine-noun.
In this search, because " XQB65-Q61316.5 kilograms of impeller of Panasonic (panasonic) is not done washing with search term Machine " directly corresponding to result, therefore analysis system by " Panasonic's rotary drum washing machine " as new search term, scan for.
In reality is searched for, user may input the more indefinite search term of freer implication.The embodiment of the present invention Method can be found out the emphasis in search statement and the literal and hiding attribute covered, reached with deep understanding user's search statement More accurate search effect.By the method for the embodiment of the present invention, depth excavates each attribute of search term, it is expressed implication Become apparent from.The part of speech of each participle is listed by step 202), to excavate the search term of different weights.Searched unconventional Rope word/sentence is converted to the search term that search engine may search for.Some search term character strings can be very long, is repaiied with very abundant Excuse, search engine, which is done, is easily returned to the degree of correlation low search result during plain text matching.Original search string is by solution Analysis, can filter out important word to search for, improve the degree of correlation of result.The method of the present embodiment is analyzed by big data, is looked for The search term relatively low to recall rate, depth analysis is carried out using machine learning, excavate attribute, augment semantics understand system, improve Search result recall rate.
As shown in fig. 6, the present invention also provides search term processing system in a kind of searching service, including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to word Storehouse, analysis obtain the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature of prior searches word List, obtains the part of speech of current search word, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
In above-described embodiment, preferably, described extraction and analysis module includes:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches The feature list of word.
Preferably, described analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, Between one parameter Pi represents that the prior probability of N number of part of speech of hidden state, the second parameter A represent N number of part of speech, the state of front and rear word Transition probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
Described analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and it is single to assign analysis Meta-attribute;
Generation unit:For generating the feature list of prior searches word;The feature list include prior searches word, previously Search term literal meaning and analytic unit attribute.
Pass through the second participle unit and generation unit, the feature list of generation prior searches word.
Preferably, described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search to be searched Word.
Preferably, described determining module, is additionally operable to when search result inaccuracy, retain the larger participle of weight, As search term to be searched, search is re-started.
The system of the embodiment of the present invention can be found out the emphasis in search statement and covered with deep understanding user's search statement Literal and hiding attribute, reach more accurate search effect.By the system of the embodiment of the present invention, depth excavates search term Each attribute, it is expressed implication and become apparent from.The part of speech of each participle is analyzed by part of speech analysis submodule, to excavate The search term of different weights.Unconventional search term/sentence is converted into the search term that search engine may search for.Some search Word character string can be very long, and with very abundant qualifier, search engine, which is done, is easily returned to that the degree of correlation is low to search during plain text matching Hitch fruit.Original search string can filter out important word to search for, improve the degree of correlation of result by parsing.This The system of embodiment is analyzed by big data, finds the relatively low search term of recall rate, is carried out depth analysis using machine learning, is dug Attribute is dug, augment semantics understand system, improve search result recall rate.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for equipment For applying example, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method Part explanation.
One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with The hardware of correlation is instructed to complete by computer program, described program can be stored in a computer read/write memory medium In, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, the change or replacement that can readily occur in, all should It is included within the scope of the present invention.Therefore, protection scope of the present invention should be defined by scope of the claims.

Claims (10)

1. search for word treatment method in a kind of searching service, it is characterised in that this method comprises the following steps:
Product data are excavated and crawled, dictionary is extracted from product data;And the word of prior searches word is obtained according to dictionary, analysis Property sequence and feature list;
Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtains current search The part of speech of word, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.
2. in accordance with the method for claim 1, it is characterised in that described excavation and product data are crawled, from product data Middle extraction dictionary;And the part of speech sequence and feature list of prior searches word are obtained, specifically includes following step according to dictionary, analysis Suddenly:
Product data are excavated and crawled from public internet network;
All kinds of dictionaries are extracted from the product data of excavation;
Using the word in dictionary, the part of speech sequence of prior searches word is analyzed, and obtains the feature list of prior searches word.
3. in accordance with the method for claim 2, it is characterised in that the part of speech sequence of described analysis prior searches word, including Following steps:
Sentence in test set is segmented, part of speech then is put on to each participle;
Train 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first parameter Pi represents to hide shape The prior probability of N number of part of speech of state, between the second parameter A represents N number of part of speech, the state transition probability matrix of front and rear word;3rd Parameter B represents N number of part of speech to the confusion matrix of M phrase;
Using 3 parameters of training, the part of speech sequence of prior searches word is drawn.
4. in accordance with the method for claim 3, it is characterised in that the feature list of described acquisition prior searches word, including Following steps:Word segmentation processing is carried out to prior searches word, cutting is analytic unit, and assigns analytic unit attribute, and generation is previous The feature list of search term;The feature list includes prior searches word, prior searches word literal meaning and analytic unit attribute.
5. according to the method any one of claim 1-4, it is characterised in that described determination search to be searched Word, specifically include following steps:
Word segmentation processing is carried out to current search word;
Obtain the part of speech of each participle after current search word participle;
According to the part of speech of each participle, assign each participle different weights, determine search term to be searched.
A kind of 6. search term processing system in searching service, it is characterised in that including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to dictionary, divide Analysis obtains the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature list of prior searches word, The part of speech of current search word is obtained, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
7. according to the system described in claim 6, it is characterised in that described extraction and analysis module includes:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches word Feature list.
8. according to the system described in claim 7, it is characterised in that described analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first ginseng Number Pi represents the prior probability of N number of part of speech of hidden state, between the second parameter A represents N number of part of speech, the state transfer of front and rear word Probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
9. according to the system described in claim 8, it is characterised in that described analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and assigns analytic unit category Property;
Generation unit:For generating the feature list of prior searches word;The feature list includes prior searches word, prior searches Word literal meaning and analytic unit attribute.
10. according to the system any one of claim 6-9, it is characterised in that described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search term to be searched.
CN201610785278.3A 2016-08-30 2016-08-30 Word treatment method and system are searched in a kind of searching service Pending CN107784019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610785278.3A CN107784019A (en) 2016-08-30 2016-08-30 Word treatment method and system are searched in a kind of searching service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610785278.3A CN107784019A (en) 2016-08-30 2016-08-30 Word treatment method and system are searched in a kind of searching service

Publications (1)

Publication Number Publication Date
CN107784019A true CN107784019A (en) 2018-03-09

Family

ID=61451268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610785278.3A Pending CN107784019A (en) 2016-08-30 2016-08-30 Word treatment method and system are searched in a kind of searching service

Country Status (1)

Country Link
CN (1) CN107784019A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083681A (en) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
CN110727862A (en) * 2019-09-24 2020-01-24 苏宁云计算有限公司 Method and device for generating query strategy of commodity search
CN112949287A (en) * 2021-01-13 2021-06-11 平安科技(深圳)有限公司 Hot word mining method, system, computer device and storage medium
CN113836396A (en) * 2021-08-31 2021-12-24 深圳市世强元件网络有限公司 Method and system for narrowing and retrieving in industry search field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106287A (en) * 2013-03-06 2013-05-15 深圳市宜搜科技发展有限公司 Processing method and processing system for retrieving sentences by user
CN103942347A (en) * 2014-05-19 2014-07-23 焦点科技股份有限公司 Word separating method based on multi-dimensional comprehensive lexicon
CN105808526A (en) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 Commodity short text core word extracting method and device
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106287A (en) * 2013-03-06 2013-05-15 深圳市宜搜科技发展有限公司 Processing method and processing system for retrieving sentences by user
CN103942347A (en) * 2014-05-19 2014-07-23 焦点科技股份有限公司 Word separating method based on multi-dimensional comprehensive lexicon
CN105808526A (en) * 2016-03-30 2016-07-27 北京京东尚科信息技术有限公司 Commodity short text core word extracting method and device
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵红丹等: "基于隐马尔科夫模型的词性标注", 《安阳师范学院学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083681A (en) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
CN110083681B (en) * 2019-04-12 2024-02-09 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
CN110727862A (en) * 2019-09-24 2020-01-24 苏宁云计算有限公司 Method and device for generating query strategy of commodity search
CN110727862B (en) * 2019-09-24 2022-11-08 苏宁云计算有限公司 Method and device for generating query strategy of commodity search
CN112949287A (en) * 2021-01-13 2021-06-11 平安科技(深圳)有限公司 Hot word mining method, system, computer device and storage medium
CN113836396A (en) * 2021-08-31 2021-12-24 深圳市世强元件网络有限公司 Method and system for narrowing and retrieving in industry search field

Similar Documents

Publication Publication Date Title
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
CN101876981B (en) A kind of method and device building knowledge base
US9563665B2 (en) Product search method and system
CN105095433B (en) Entity recommended method and device
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN108287843A (en) A kind of method and apparatus and navigation equipment of interest point information retrieval
Shi et al. Corpus-based semantic class mining: distributional vs. pattern-based approaches
CN111831821B (en) Training sample generation method and device of text classification model and electronic equipment
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN104881458B (en) A kind of mask method and device of Web page subject
CN106708966A (en) Similarity calculation-based junk comment detection method
CN106599041A (en) Text processing and retrieval system based on big data platform
CN104933164A (en) Method for extracting relations among named entities in Internet massive data and system thereof
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
KR20080031262A (en) Relationship networks
CN101799802B (en) Method and system for extracting entity relationship by using structural information
CN107784019A (en) Word treatment method and system are searched in a kind of searching service
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN102200973A (en) Equipment and method for generating viewpoint pair with emotional-guidance-based influence relationship
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN109214417A (en) The method for digging and device, computer equipment and readable medium that user is intended to
CN108304377A (en) A kind of extracting method and relevant apparatus of long-tail word
CN104503988A (en) Searching method and device
CN104346382B (en) Use the text analysis system and method for language inquiry
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180309