CN107784019A - Word treatment method and system are searched in a kind of searching service - Google Patents
Word treatment method and system are searched in a kind of searching service Download PDFInfo
- Publication number
- CN107784019A CN107784019A CN201610785278.3A CN201610785278A CN107784019A CN 107784019 A CN107784019 A CN 107784019A CN 201610785278 A CN201610785278 A CN 201610785278A CN 107784019 A CN107784019 A CN 107784019A
- Authority
- CN
- China
- Prior art keywords
- word
- speech
- search
- participle
- prior searches
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The embodiment of the invention discloses word treatment method and system is searched in a kind of searching service, belong to search technique field.This method and system can improve the search degree of accuracy and search result recall rate, user is in e-commerce website search experience for lifting to being analyzed with the search term modified, restriction or length are longer.This method comprises the following steps:Product data are excavated and crawled, dictionary is extracted from product data;And the part of speech sequence and feature list of prior searches word are obtained according to dictionary, analysis;Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtains the part of speech of current search word, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.
Description
Technical field
The present invention relates to search technique field, it particularly relates to search for word treatment method in a kind of searching service and be
System.
Background technology
With the development of ecommerce, increasing consumer orientation is in shopping at network, unlike solid shop/brick and mortar store, net
Network shopping is needed according to shopping, and the keyword of the commodity/service of needs, i.e. search term are inputted on e-commerce website.At present,
Processing of most of e-commerce website to search term is simple search term processing, places into site search engine and carries out completely
Matching search.This mode requires that consumer needs to input the search term (for example, name of product) of simple, intuitive, can just reach
Preferably search effect.In other words, the higher of the requirement for the search term that e-commerce platform search engine inputs to user is only searched
Rope word is clear, can just have preferably search result and recall rate with the search term that paired word individual character occurs.
If the search term of consumer entering limits with modification or length is longer, search engine can not just return to phase
The higher product of closing property, or even completely unrelated product can be returned, influence search experience of the user in e-commerce website.
The content of the invention
It is an object of the invention to provide searching for word treatment method and system in a kind of searching service, can to modification,
Limit or the longer search term of length is analyzed and processed, improve the search degree of accuracy and search result recall rate, lift user
In e-commerce website search experience.
To achieve the above object, the embodiment of the present invention uses following technical scheme:
In a first aspect, the embodiment of the present invention, which provides, searches for word treatment method in a kind of searching service, this method includes following
Step:
Product data are excavated and crawled, dictionary is extracted from product data;And prior searches word is obtained according to dictionary, analysis
Part of speech sequence and feature list;
Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtained current
The part of speech of search term, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.
With reference in a first aspect, as the mode that the first may be implemented, described excavation and product data are crawled, from product
Extracting data dictionary;And the part of speech sequence and feature list of prior searches word are obtained, is specifically included following according to dictionary, analysis
Step:
Product data are excavated and crawled from public internet network;
All kinds of dictionaries are extracted from the product data of excavation;
Using the word in dictionary, the part of speech sequence of prior searches word is analyzed, and obtains the feature list of prior searches word.
With reference to the first mode that may implement of first aspect, as second of mode that may implement, described point
The part of speech sequence of prior searches word is analysed, is comprised the following steps:
Sentence in test set is segmented, part of speech then is put on to each participle;
Train 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first parameter Pi is represented hidden
The prior probability of N number of part of speech of Tibetan state, between the second parameter A represents N number of part of speech, the state transition probability matrix of front and rear word;
3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Using 3 parameters of training, the part of speech sequence of prior searches word is drawn.
With reference to second of mode that may implement of first aspect, as the mode that the third may be implemented, described obtains
The feature list of prior searches word is taken, is comprised the following steps:Word segmentation processing is carried out to prior searches word, cutting is analytic unit,
And analytic unit attribute is assigned, generate the feature list of prior searches word;The feature list includes prior searches word, previously searched
Rope word literal meaning and analytic unit attribute.
With reference in a first aspect, or the side that may implement to the third of the first mode that may implement in first aspect
Any one in formula, as the 4th kind of mode that may implement, described determination search term to be searched, specifically include following
Step:
Word segmentation processing is carried out to current search word;
Obtain the part of speech of each participle after current search word participle;
According to the part of speech of each participle, assign each participle different weights, determine search term to be searched.
Second aspect, the embodiment of the present invention provide search term processing system in also a kind of searching service, including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to word
Storehouse, analysis obtain the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature of prior searches word
List, obtains the part of speech of current search word, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
With reference to second aspect, as the first be probably realization mode, described extraction and analysis module include:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches
The feature list of word.
It is described as the mode for being probably realization for second with reference to the first mode in the cards of second aspect
Analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein,
Between one parameter Pi represents that the prior probability of N number of part of speech of hidden state, the second parameter A represent N number of part of speech, the state of front and rear word
Transition probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
With reference to second of mode in the cards of second aspect, as the third be probably realization mode, it is described
Analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and it is single to assign analysis
Meta-attribute;
Generation unit:For generating the feature list of prior searches word;The feature list include prior searches word, previously
Search term literal meaning and analytic unit attribute.
The side that may implement to the third with reference to the mode that the first in second aspect, or second aspect may be implemented
Any one in formula, as the 4th kind of mode that may implement, described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search to be searched
Word.
Compared with prior art, word treatment method and system are searched in the searching service of the embodiment of the present invention, can be improved
The degree of accuracy and search result recall rate are searched for, user is in e-commerce website search experience for lifting.The search of the embodiment of the present invention
Word treatment method, word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtain and work as
The part of speech of preceding search term, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.The present embodiment leads to
Cross and deep semantic understanding and analysis are carried out to the search term of user's input, the real search intention of user is determined, after understanding
Search term, carry out filtered search, greatly improve the search degree of accuracy and search result recall rate.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, it will use below required in embodiment
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ability
For the those of ordinary skill of domain, on the premise of not paying creative work, it can also be obtained according to these accompanying drawings other attached
Figure.
Fig. 1 is system architecture schematic diagram provided in an embodiment of the present invention;
Fig. 2 is analysis method schematic flow sheet provided in an embodiment of the present invention;
Fig. 3 is the schematic flow sheet of step 10) in analysis method provided in an embodiment of the present invention;
Fig. 4 is step 103) schematic flow sheet in analysis method provided in an embodiment of the present invention;
Fig. 5 is the schematic flow sheet of step 20) in analysis method provided in an embodiment of the present invention;
Fig. 6 is the structural representation of analysis system provided in an embodiment of the present invention.
Embodiment
To make those skilled in the art more fully understand technical scheme, below in conjunction with the accompanying drawings and specific embodiment party
Formula is described in further detail to the present invention.Embodiments of the present invention are described in more detail below, the embodiment is shown
Example is shown in the drawings, wherein same or similar label represents same or similar element or has identical or class from beginning to end
Like the element of function.
Embodiment below with reference to accompanying drawing description is exemplary, is only used for explaining the present invention, and can not be explained
For limitation of the present invention.Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative used herein
" one ", "one", " described " and "the" may also comprise plural form.It is to be further understood that in the specification of the present invention
The wording " comprising " used refers to the feature, integer, step, operation, element and/or component be present, but it is not excluded that depositing
Or other one or more features of addition, integer, step, operation, element, component and/or their groups.It should be understood that work as
We claim element to be " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, Huo Zheye
There may be intermediary element.In addition, " connection " used herein or " coupling " can include wireless connection or coupling.Used here as
Wording "and/or" including one or more associated list items any cell and all combine.The art skill
Art personnel are appreciated that unless otherwise defined, all terms (including technical term and scientific terminology) used herein have with
The general understanding identical meaning of those of ordinary skill in art of the present invention.It should also be understood that such as general words
Those terms defined in allusion quotation, which should be understood that, has the meaning consistent with the meaning in the context of prior art, and removes
It is non-to be defined as here, it will not be explained with the implication of idealization or overly formal.
Method flow in the present embodiment, specifically can be in a kind of processing system for searching service as shown in Figure 1
Upper execution, including:Network, extraction and analysis module, determining module and database.Extraction disclosed in the present embodiment and
Analysis module and determining module, can be specifically the equipment such as server, work station, supercomputer, or by multiple services
A kind of server cluster system for data processing of device composition.
Database can be specifically a kind of Redis database or other kinds of distributed data base, relational data
Storehouse etc., can be specifically the storage device for including the data server of storage device and being connected with data server, either
A kind of server cluster system for database being made up of multiple data servers and storage server.
In the present embodiment, extraction and analysis module specifically can be used for extracting prior searches word from search daily record, than
Such as:Extract and analyze user nearest 1 week, 1 month or other times in the range of the search daily record that generates, first pass through pretreatment rule
The search term specification extracted then is turned into the form that can correctly handle.The search term extracted is formed into dictionary again, and obtained
Take the part of speech sequence and feature list of prior searches word.By dictionary, the part of speech sequence of prior searches word and feature list store to
Database, for example Redis databases are stored in, so that semantic module reads these data by accessing, inquiring about database.
In the present embodiment, database specifically can be used for the search daily record (ratio that storage system generates in the process of running
Such as:Include the search daily record of the search term that continuously transmits of the user terminal in a Session Time section), extraction and analysis mould
Block generation dictionary, the part of speech sequence and feature list of prior searches word.The database can also be used as public character library dictionary platform
With the database of open electric business resource platform, or the database phase with public character library dictionary platform and open electric business resource platform
Connect and carry out data interaction.Can also be, extraction and analysis module be database carry out data interaction while, also with it is public
Character library dictionary platform is connected with the database of open electric business resource platform and carries out data interaction.
In the present embodiment, determining module specifically can be implemented as a kind of search server or the clothes for search service
Business device cluster, determining module are used for part of speech sequence and feature list according to prior searches word, analyze the semanteme of current search word.
According to the semanteme of determination, scan for, search result is fed back into user.User equipment can specifically make an independent dress
Put, or be integrated in a variety of media data playing devices, such as set top box, mobile phone, tablet personal computer (Tablet
Personal Computer), laptop computer (Laptop Computer), multimedia player, digital camera, individual number
Word assistant (personal digital assistant, abbreviation PDA), guider, mobile Internet access device (Mobile
Internet Device, MID) or wearable device (Wearable Device) etc..
Word treatment method is searched in a kind of searching service of the embodiment of the present invention, as shown in Fig. 2 including following two steps
Suddenly:
S10:Product data are excavated and crawled, dictionary is extracted from product data;And obtain and previously search according to dictionary, analysis
The part of speech sequence and feature list of rope word.
Preferably, as shown in figure 3, S10 specifically includes step 101) to step 103).
Step 101) excavates and crawled product data from public internet network.
Wherein, it is open to the public to include public character library dictionary platform and open electric business resource platform etc. for public internet network
The network platform.Product data are obtained from public internet network, the information source for obtaining data can be expanded, raising subsequently establishes word
The precision in storehouse.In this step, data mining and web crawlers technology can be utilized, excavates and crawl production from public internet network
Product data.
Product data are the data related to product, mainly include name of product and product attribute.Product attribute includes production
The product descriptions such as category type, model, size, capacity, brand.Such as:Product data are Panasonic (Panasonic) XQB65-
6.5 kilograms of Q6131.The product data include the description such as name of product, brand, model and capacity.In another example:Product data are
6.5 kilograms of rotary drum washing machine.The product data include the description such as name of product, product type and product capacity.
All kinds of dictionaries are extracted in the product data that step 102) is excavated from step 101).
This step is classified the product data that step 101) is excavated, and establishes dictionary.This step can pass through engineering
Habit and data mining technology, extract all kinds of dictionaries from the product data of excavation.The dictionary of extraction has:Name of product, model,
Brand, color, material, speciality, style style etc..For example, extracting product data by step 101) and being:Burberry/ bars
The double-breasted waistband wind coat of precious five colors of jasmine female money, then extract dictionary by step 102):Burberry/ Ba Baoli → brand,
Double-breasted → style style, wind coat → name of product.In another example extracting product data by step 101) and being:Song of Joy Liu
Great waves pacify enlightening with the summer of money Western-style clothes vest female's spring and autumn vest overcoat shortage of money Joker 2016, then extract dictionary by step 102):Shortage of money
→ style style, Joker → style style, summer → season, vest overcoat → name of product, Western-style clothes vest → name of product.Example again
Such as, extracting product data by step 101) is:Esky babies nightwear 3 months to 6 months, then extracted by step 102)
Dictionary:Esky → brand, baby → crowd, individual month → unit _ age;Nightwear → name of product.
Step 103) analyzes the part of speech sequence of previous search term, and obtain using the word in the dictionary of step 102) extraction
Take the feature list of prior searches word.
In this preference, viterbi algorithm (corresponding English is Viterbi) and HMM can be used (corresponding
English is Hidden Markov Model), the most possible part of speech sequence of previous search term is found out, so as to in dictionary
Word adds part of speech label.
As shown in figure 4, in step 103, label is added to the word in dictionary, specifically included:
Step 1031) segments the sentence in test set, then puts on part of speech to each participle.
Such as:Xiao Ming/n /adv writes/v operations/n.
Wherein, n represents title, and adv represents adverbial word, and v represents verb.Observer state is the word in sentence, hidden state
It is part of speech.
Step 1032) trains 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B.
First parameter Pi represents the prior probability of N number of part of speech of hidden state.First parameter Pi directly does the frequency of part of speech
Statistics.
Between second parameter A represents N number of part of speech, the state transition probability matrix P of front and rear word (St | St-1).P(St|St-
1) it is equal to the number that number/part of speech St-1 that two parts of speech (St-1, St) are occurred by tandem occurs.
3rd parameter B represents N number of part of speech to the confusion matrix P (Ot | St) of M phrase.P (Ot | St) be equal to phrase Ot and
The number that the number that part of speech St occurs simultaneously/part of speech St occurs.
Assuming that N number of part of speech, M phrase, the first parameter Pi are the vectors that length is N altogether, the second parameter A is a N*N
Matrix, the 3rd parameter B are N*M matrixes.
When carrying out part-of-speech tagging to sentence, it is to be ensured that the phrase after participle is all in M.
Step 1033) trains 3 parameters using step 1032), draws the part of speech sequence of prior searches word.
Illustrate below.If only 2 kinds of parts of speech and 3 words.2 kinds of parts of speech are noun and verb.3 words are to take an examination, be small
Bright, concern.If in the statistics of the sentence occurred before, noun occurs 100 times, verb occurs 70 times.
First parameter Pi:
Noun | Verb |
100 | 70 |
The Speech conversion probability matrix of word before and after second parameter A.Wherein, Y-axis represents the word of the word on the current location S left sides
Property, X-axis represents the part of speech of current location S word.As shown in the table, X-axis is noun and verb, and Y-axis is noun and verb.Occur
The part of speech of front and rear word be noun and noun, noun and verb, verb and noun, verb and verb.
Noun | Verb | |
Noun | 0.25 | 0.75 |
Verb | 0.8 | 0.2 |
3rd parameter B parts of speech to phrase emission matrix, it is as shown in the table.
Examination | Concern | Xiao Ming | |
Noun | 0.4 | 0.3 | 1 |
Verb | 0.6 | 0.7 | 0 |
Assuming that the part of speech of analysis " Xiao Ming's examination " and " concern examination " two words.
Xiao Ming takes an examination
The first step, ask the part of speech of " Xiao Ming " may:
Noun word frequency * B (Xiao Ming is noun)=100*1=100 in first parameter Pi
Verb word frequency * B (Xiao Ming is verb)=70*0=0 in Pi
First step conclusion " Xiao Ming " most likely noun
Second step, ask the part of speech of " examination " may
100*A (noun followed by noun) * B (examination is noun)=100*0.25*0.4=10
100*A (noun followed by verb) * B (examination is verb)=100*0.75*0.6=45
Second step conclusion " examination " most likely verb
Conclusion:[Xiao Ming's examination] → [noun verb]
Concern examination
The first step, ask the part of speech of " concern " may:
Noun word frequency * B (concern is noun)=100*0.3=30 in Pi
Verb word frequency * B (concern is verb)=70*0.7=49 in Pi
First step conclusion " concern " most likely verb
Second step, ask the part of speech of " examination " may:
49*A (verb followed by noun) * B (examination is noun)=49*0.8*0.4=15.68
49*A (verb followed by verb) * B (examination is verb)=49*0.2*0.6=5.88
Second step conclusion " examination " most likely verb
Conclusion:[concern examination] → [verb noun]
In above-mentioned steps, with viterbi algorithm (Viterbi) and HMM, (corresponding English is Hidden
Markov Model) find out the most possible part of speech sequence of a sentence, i.e. the part of speech sequence of maximum probability.Viterbi algorithm is
A kind of dynamic programming algorithm is used to find the most possible-Viterbi path-hidden state sequence for producing observed events sequence.Note
The most possible hidden state and probability of each step is recorded, avoids computing repeatedly during each step advance.
By step 103), after finding out the part of speech of search term, be easy to subsequent step 20) real-time analysis user search
Character string, find the keyword in search term.
In step 103), obtain prior searches word feature list, obtain prior searches word, cutting be analytic unit (i.e.
Carry out word segmentation processing), and analytic unit attribute is assigned, generate the feature list of prior searches word.This feature list includes search
Word, the literal meaning of search term and analytic unit attribute.Obtaining prior searches word can extract from the daily record of website.Analysis is single
Meta-attribute includes the attribute and hiding attribute from literal extraction.
The initial search word inputted before user is divided into multiple minimum analytic units, and it is single to assign each minimum analysis
The attributes such as first feature part of speech, feature, possible rewriting.Preferably, using the algorithm of maximum matching cutting word, by user's input
Initial search word is divided into multiple minimum analytic units.Attribute is assigned to enumerate:User inputs ' 10kg ', and system can be automatically recognized as
' ten kilograms ', or when user inputs ' basket color ', system can detect that this is not a word and prompting may be rewritten as ' indigo plant
Color '.It is because most of word cut out can be found from the dictionary in step 102, word in dictionary is corresponding
Attribute is read out, and is then put together with the word cut out.Such as:(Ba Baoli brands), (female's money style style).It is minimum
Analytic unit and its attribute are used as the source data of Chinese semantic analysis.
One long string of name of product is broken into the list of a short word, each element in list is short by one
Word and its label form.
Original name of product:" the double-breasted waistband wind coat of five colors of burberry/ Ba Baoli female money "
List after being parsed by this step:(burberry brands), (/ symbol), (Ba Baoli brands), (female's money wind
Lattice style), (five Chinese language words), (color Chinese language words), (double-breasted style style), (waistband Chinese language words), (wind coat produces
Product;Main body).
Step 103) can generate the feature list related to search term, and this feature list covers the literal of search term and contained
Justice, from the attribute of literal extraction and hiding attribute (such as Price Range, color, material, model, brand, rewriting etc.).
Step 20):Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word,
The part of speech of current search word is obtained, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
Above-mentioned steps 10) it is that past daily record data is analyzed and processed, obtain minimum analytic unit and its attribute.This
Step is after being segmented to current search word, finds out the association between minimum analytic unit, parses user intention.Step 20)
Current search word is analyzed in real time.As shown in figure 5, step 20) specifically includes following steps:
Step 201) carries out word segmentation processing to current search word.
The algorithm of maximum matching cutting word can be used in this step, the search term of user's input is subjected to word segmentation processing, point
Into multiple words.When occurring alphabetical in new search word, letter is divided into product type, phonetic, English and unit first.Use
LALR (Look-Ahead Left Recursive) syntax analyzer judges model and unit.Judge phonetic using dictionary lookup
And English.Simultaneously using the semantic relation between LALR syntax analyzers analysing word and word, such as ' price is no more than 500 yuan '.
Step 202) obtains the part of speech of each participle after current search word segments.
In this step, the part of speech respectively segmented can use above-mentioned steps 103) method carry out, i.e., using hidden Ma Erke
Husband's model and viterbi algorithm put on the part of speech of each participle.
Step 203) assigns each participle different weights, according to weight size, determined to be searched according to the part of speech of each participle
Search term.
Assign each participle different weights, preferably using name of product and brand as weight limit.If multiple products
Title, based on the name of product of picking position rearward, the weight of other parts of speech is relatively small, plays certain booster action.
The feature list generated according to step 103), on the basis of search term is understood, more accurately search term is carried out
Search, and give determinant attribute bigger weight, so as to improve search effect, improve recall rate.
After all participles scan for as search term, when acquisition result is undesirable, then the relatively low participle of weight is saved
Slightly, the larger participle of weight is retained as search term.
Such as:" Samsung mobile phone ", by step 103):Samsung-brand, mobile phone-product;Can by step 202)
Know, Samsung-adjective, mobile phone-noun." mobile phone " weight is more than the weight of " Samsung ", and the emphasis of search is " mobile phone ".Just
When beginning to search for, while retrieve " Samsung " and " mobile phone ".If not retrieving Related product, search for " mobile phone ".
In another example:" iPhone shell ", by step 103):Apple-brand, mobile phone shell-name of product;Pass through step
It is rapid 202) to understand, apple-adjective, mobile phone shell-noun.The emphasis of search is " mobile phone shell ".
Again for example:" 6.5 kilograms of rotary drum washing machines of Panasonic (panasonic) XQB65-Q6131 ", can by step 103)
Know:Panasonic-brand, panasonic- brands, XQB65-Q6131- models, 6.5 kilograms-capacity, rotary drum washing machine-ProductName
Claim;Pass through step 202), Panasonic-adjective, panasonic- adjectives, XQB65-Q6131- nouns, 6.5 kilograms-
Word, rotary drum washing machine-noun.
In this search, because " XQB65-Q61316.5 kilograms of impeller of Panasonic (panasonic) is not done washing with search term
Machine " directly corresponding to result, therefore analysis system by " Panasonic's rotary drum washing machine " as new search term, scan for.
In reality is searched for, user may input the more indefinite search term of freer implication.The embodiment of the present invention
Method can be found out the emphasis in search statement and the literal and hiding attribute covered, reached with deep understanding user's search statement
More accurate search effect.By the method for the embodiment of the present invention, depth excavates each attribute of search term, it is expressed implication
Become apparent from.The part of speech of each participle is listed by step 202), to excavate the search term of different weights.Searched unconventional
Rope word/sentence is converted to the search term that search engine may search for.Some search term character strings can be very long, is repaiied with very abundant
Excuse, search engine, which is done, is easily returned to the degree of correlation low search result during plain text matching.Original search string is by solution
Analysis, can filter out important word to search for, improve the degree of correlation of result.The method of the present embodiment is analyzed by big data, is looked for
The search term relatively low to recall rate, depth analysis is carried out using machine learning, excavate attribute, augment semantics understand system, improve
Search result recall rate.
As shown in fig. 6, the present invention also provides search term processing system in a kind of searching service, including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to word
Storehouse, analysis obtain the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature of prior searches word
List, obtains the part of speech of current search word, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
In above-described embodiment, preferably, described extraction and analysis module includes:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches
The feature list of word.
Preferably, described analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein,
Between one parameter Pi represents that the prior probability of N number of part of speech of hidden state, the second parameter A represent N number of part of speech, the state of front and rear word
Transition probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
Described analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and it is single to assign analysis
Meta-attribute;
Generation unit:For generating the feature list of prior searches word;The feature list include prior searches word, previously
Search term literal meaning and analytic unit attribute.
Pass through the second participle unit and generation unit, the feature list of generation prior searches word.
Preferably, described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search to be searched
Word.
Preferably, described determining module, is additionally operable to when search result inaccuracy, retain the larger participle of weight,
As search term to be searched, search is re-started.
The system of the embodiment of the present invention can be found out the emphasis in search statement and covered with deep understanding user's search statement
Literal and hiding attribute, reach more accurate search effect.By the system of the embodiment of the present invention, depth excavates search term
Each attribute, it is expressed implication and become apparent from.The part of speech of each participle is analyzed by part of speech analysis submodule, to excavate
The search term of different weights.Unconventional search term/sentence is converted into the search term that search engine may search for.Some search
Word character string can be very long, and with very abundant qualifier, search engine, which is done, is easily returned to that the degree of correlation is low to search during plain text matching
Hitch fruit.Original search string can filter out important word to search for, improve the degree of correlation of result by parsing.This
The system of embodiment is analyzed by big data, finds the relatively low search term of recall rate, is carried out depth analysis using machine learning, is dug
Attribute is dug, augment semantics understand system, improve search result recall rate.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for equipment
For applying example, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method
Part explanation.
One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with
The hardware of correlation is instructed to complete by computer program, described program can be stored in a computer read/write memory medium
In, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, the change or replacement that can readily occur in, all should
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be defined by scope of the claims.
Claims (10)
1. search for word treatment method in a kind of searching service, it is characterised in that this method comprises the following steps:
Product data are excavated and crawled, dictionary is extracted from product data;And the word of prior searches word is obtained according to dictionary, analysis
Property sequence and feature list;
Word segmentation processing is carried out to current search word, according to the part of speech sequence and feature list of prior searches word, obtains current search
The part of speech of word, and determine the weight of participle;According to the weight of participle, search term to be searched is determined.
2. in accordance with the method for claim 1, it is characterised in that described excavation and product data are crawled, from product data
Middle extraction dictionary;And the part of speech sequence and feature list of prior searches word are obtained, specifically includes following step according to dictionary, analysis
Suddenly:
Product data are excavated and crawled from public internet network;
All kinds of dictionaries are extracted from the product data of excavation;
Using the word in dictionary, the part of speech sequence of prior searches word is analyzed, and obtains the feature list of prior searches word.
3. in accordance with the method for claim 2, it is characterised in that the part of speech sequence of described analysis prior searches word, including
Following steps:
Sentence in test set is segmented, part of speech then is put on to each participle;
Train 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first parameter Pi represents to hide shape
The prior probability of N number of part of speech of state, between the second parameter A represents N number of part of speech, the state transition probability matrix of front and rear word;3rd
Parameter B represents N number of part of speech to the confusion matrix of M phrase;
Using 3 parameters of training, the part of speech sequence of prior searches word is drawn.
4. in accordance with the method for claim 3, it is characterised in that the feature list of described acquisition prior searches word, including
Following steps:Word segmentation processing is carried out to prior searches word, cutting is analytic unit, and assigns analytic unit attribute, and generation is previous
The feature list of search term;The feature list includes prior searches word, prior searches word literal meaning and analytic unit attribute.
5. according to the method any one of claim 1-4, it is characterised in that described determination search to be searched
Word, specifically include following steps:
Word segmentation processing is carried out to current search word;
Obtain the part of speech of each participle after current search word participle;
According to the part of speech of each participle, assign each participle different weights, determine search term to be searched.
A kind of 6. search term processing system in searching service, it is characterised in that including:
Extraction and analysis module:For excavating and crawling product data, dictionary is extracted from product data;And according to dictionary, divide
Analysis obtains the part of speech sequence and feature list of prior searches word;
Determining module:For carrying out word segmentation processing to current search word, according to the part of speech sequence and feature list of prior searches word,
The part of speech of current search word is obtained, and determines the weight of participle;According to the weight of participle, search term to be searched is determined.
7. according to the system described in claim 6, it is characterised in that described extraction and analysis module includes:
Excavate submodule:For excavating and crawling product data from public internet network;
Extracting sub-module:For extracting all kinds of dictionaries from the product data of excavation;
Analyze submodule:For using the word in dictionary, analyzing the part of speech sequence of prior searches word, and obtain prior searches word
Feature list.
8. according to the system described in claim 7, it is characterised in that described analysis submodule includes:
First participle unit:For the sentence in test set to be segmented, part of speech then is put on to each participle;
Training unit:For training 3 parameters, including the first parameter Pi, the second parameter A and the 3rd parameter B;Wherein, the first ginseng
Number Pi represents the prior probability of N number of part of speech of hidden state, between the second parameter A represents N number of part of speech, the state transfer of front and rear word
Probability matrix;3rd parameter B represents N number of part of speech to the confusion matrix of M phrase;
Analytic unit:For utilizing 3 parameters of training, the part of speech sequence of prior searches word is drawn.
9. according to the system described in claim 8, it is characterised in that described analysis submodule also includes:
Second participle unit:For carrying out word segmentation processing to prior searches word, cutting is analytic unit, and assigns analytic unit category
Property;
Generation unit:For generating the feature list of prior searches word;The feature list includes prior searches word, prior searches
Word literal meaning and analytic unit attribute.
10. according to the system any one of claim 6-9, it is characterised in that described determining module includes:
Segment submodule:For carrying out word segmentation processing to current search word;
Part of speech analyzes submodule:For analyzing the part of speech of each participle after current search word segments;
Determination sub-module:For the part of speech according to each participle, assign each participle different weights, determine search term to be searched.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610785278.3A CN107784019A (en) | 2016-08-30 | 2016-08-30 | Word treatment method and system are searched in a kind of searching service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610785278.3A CN107784019A (en) | 2016-08-30 | 2016-08-30 | Word treatment method and system are searched in a kind of searching service |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107784019A true CN107784019A (en) | 2018-03-09 |
Family
ID=61451268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610785278.3A Pending CN107784019A (en) | 2016-08-30 | 2016-08-30 | Word treatment method and system are searched in a kind of searching service |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107784019A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110083681A (en) * | 2019-04-12 | 2019-08-02 | 中国平安财产保险股份有限公司 | Searching method, device and terminal based on data analysis |
CN110727862A (en) * | 2019-09-24 | 2020-01-24 | 苏宁云计算有限公司 | Method and device for generating query strategy of commodity search |
CN112949287A (en) * | 2021-01-13 | 2021-06-11 | 平安科技(深圳)有限公司 | Hot word mining method, system, computer device and storage medium |
CN113836396A (en) * | 2021-08-31 | 2021-12-24 | 深圳市世强元件网络有限公司 | Method and system for narrowing and retrieving in industry search field |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103106287A (en) * | 2013-03-06 | 2013-05-15 | 深圳市宜搜科技发展有限公司 | Processing method and processing system for retrieving sentences by user |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
CN105808526A (en) * | 2016-03-30 | 2016-07-27 | 北京京东尚科信息技术有限公司 | Commodity short text core word extracting method and device |
CN105893533A (en) * | 2016-03-31 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Text matching method and device |
-
2016
- 2016-08-30 CN CN201610785278.3A patent/CN107784019A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103106287A (en) * | 2013-03-06 | 2013-05-15 | 深圳市宜搜科技发展有限公司 | Processing method and processing system for retrieving sentences by user |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
CN105808526A (en) * | 2016-03-30 | 2016-07-27 | 北京京东尚科信息技术有限公司 | Commodity short text core word extracting method and device |
CN105893533A (en) * | 2016-03-31 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Text matching method and device |
Non-Patent Citations (1)
Title |
---|
赵红丹等: "基于隐马尔科夫模型的词性标注", 《安阳师范学院学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110083681A (en) * | 2019-04-12 | 2019-08-02 | 中国平安财产保险股份有限公司 | Searching method, device and terminal based on data analysis |
CN110083681B (en) * | 2019-04-12 | 2024-02-09 | 中国平安财产保险股份有限公司 | Searching method, device and terminal based on data analysis |
CN110727862A (en) * | 2019-09-24 | 2020-01-24 | 苏宁云计算有限公司 | Method and device for generating query strategy of commodity search |
CN110727862B (en) * | 2019-09-24 | 2022-11-08 | 苏宁云计算有限公司 | Method and device for generating query strategy of commodity search |
CN112949287A (en) * | 2021-01-13 | 2021-06-11 | 平安科技(深圳)有限公司 | Hot word mining method, system, computer device and storage medium |
CN113836396A (en) * | 2021-08-31 | 2021-12-24 | 深圳市世强元件网络有限公司 | Method and system for narrowing and retrieving in industry search field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107797991B (en) | Dependency syntax tree-based knowledge graph expansion method and system | |
CN101876981B (en) | A kind of method and device building knowledge base | |
US9563665B2 (en) | Product search method and system | |
CN105095433B (en) | Entity recommended method and device | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
CN108287843A (en) | A kind of method and apparatus and navigation equipment of interest point information retrieval | |
Shi et al. | Corpus-based semantic class mining: distributional vs. pattern-based approaches | |
CN111831821B (en) | Training sample generation method and device of text classification model and electronic equipment | |
CN103488724A (en) | Book-oriented reading field knowledge map construction method | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN106708966A (en) | Similarity calculation-based junk comment detection method | |
CN106599041A (en) | Text processing and retrieval system based on big data platform | |
CN104933164A (en) | Method for extracting relations among named entities in Internet massive data and system thereof | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
KR20080031262A (en) | Relationship networks | |
CN101799802B (en) | Method and system for extracting entity relationship by using structural information | |
CN107784019A (en) | Word treatment method and system are searched in a kind of searching service | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN102200973A (en) | Equipment and method for generating viewpoint pair with emotional-guidance-based influence relationship | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN109214417A (en) | The method for digging and device, computer equipment and readable medium that user is intended to | |
CN108304377A (en) | A kind of extracting method and relevant apparatus of long-tail word | |
CN104503988A (en) | Searching method and device | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
CN107168953A (en) | The new word discovery method and system that word-based vector is characterized in mass text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180309 |