CN102270212A

CN102270212A - User interest feature extraction method based on hidden semi-Markov model

Info

Publication number: CN102270212A
Application number: CN2011100881918A
Authority: CN
Inventors: 琚春华; 王蓓; 章敏
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2011-04-07
Filing date: 2011-04-07
Publication date: 2011-12-07

Abstract

The invention relates to a user interest feature extraction method based on a Hidden semi-Markov model. The invention aims at providing a user interest feature extraction method which is more accord with actual situations and has better modeling and analyzing ability. The technical scheme of the invention is that: the method comprises the steps of: data collection, data preprocessing, model training and user interest feature extraction. The user interest feature extraction method based on the Hidden semi-Markov model is used for more accurately extracting the user interest features under complex conditions by introducing the Hidden semi-Markov model .

Description

A kind of user interest feature extracting method based on hidden semi-Markov model

Technical field

The present invention relates to machine learning and information extraction technique field, especially a kind of user interest feature extracting method based on hidden semi-Markov model.Be applicable under complex conditions, the user interest feature extracted more accurately by the introducing of hidden semi-Markov model.

Background technology

Since the sixties in 20th century, text message extracts Study on Theory and has obtained continuous development, becomes an important research branch of natural language processing field, and the model that extracts for information about mainly contains 3 classes at present: a kind of model that is based on dictionary; A kind of model that is based on rule is as ontology; Also have a kind of model that is based on statistics, as hidden Markov model (Hidden Markov model, HMM).

Webpage is compared with traditional text, and many characteristics are arranged: amount is big, the normal renewal, and variation is many, and the page a greater part of comprises structurized literal piece, also has hyperlink.For non-structured natural language text is transformed in the structurized information bank, need the collaborative work of multiple natural language processing technique.Information extraction technique can be summed up as automatic word segmentation, mark and the template filling automatically to text.Need to seek text and carry out certain semantic analysis, according to traditional natural language processing technique, the treatment step that roughly should comprise in the information extraction module of Chinese has comprised word segmentation processing, title analysis, grammatical analysis, semantic analysis, scene coupling, consistency analysis, reasoning and judging, template matches filling, or the like.Information extraction technique is as " bridge " treatment technology between unstructured data and the database, and is very crucial and important for web text data digging multilingual, heterogeneous, isomery.

Because HMM has very the statistical basis that is fit to natural language processing, add that it extracts strong robustness, precision height, is easy to set up and advantage such as adaptability is strong, more and more receives researcher's concern.Hidden semi-Markov model (Hidden Semi-Markov Model, HSMM) be a kind of extended model of HMM, overcome the limitation that the hypothesis because of Markov chain causes the HMM modeling to be had, compare with HMM, HSMM is more suitable in describing the hidden Markov process of state duration for distributing arbitrarily.

Summary of the invention

The technical problem to be solved in the present invention is: be provided at a kind of user interest feature extracting method based on hidden semi-Markov model in the e-commerce website, the model based on statistics is more tallied with the actual situation, have better modeling and analysis ability.

The technical solution adopted in the present invention is: the user interest feature extracting method based on hidden semi-Markov model is characterized in that comprising step:

Step 1, data aggregation,

Obtain the information relevant with user characteristics, interest characteristics or demand by user's recessive behavior, wherein interest characteristics obtains by Web server daily record and client data;

Step 2, data pre-service,

Pre-service mainly to user access logs carry out that data purification, User Recognition, session jd, path replenish, the processing of format and event recognition, form the user conversation file;

Step 3, model training,

3.1, tentatively choose a HSMM model that N state arranged, it by hexa-atomic group of parameter lambda=(N, M, π, A, B P) forms, wherein:

N represents the number of state, and its limited state set is S={s ₁, s ₂..., s _N,

M represents the number of observed value, and its limited observation set is V={v ₁, v ₂..., v _M,

The initial probability distribution π of state={ π ₁, π ₂..., π _N, be used to describe observed value sequence O at t=1 moment state q of living in ₁The probability distribution that belongs to each state in the model, i.e. π _i=P (q ₁=s _i),

State transition probability matrix A=[a _Ij] _{N * M}, consider single order HSMM, current state q of living in _iOnly and q _I-1Relevant, i.e. a _Ij=P (q _i=s _j| q _I-1=s _i),

Observed value probability matrix B=[b _j(k)], expression observation sequence O _kBe in state S _jProbability, it is stochastic variable or the distribution of random vector in the observation probability space of each state, represents with the mixed Gaussian distribution function usually:

b_{j} (O_{k}) = Σ_{g = 1}^{G} ω_{jg} N (O, μ_{jg}, U_{jg}) .

In formula: the Gauss unit number that G may comprise for each state; ω _JgBe g mixed Gaussian weight of j state; μ _JgIt is the average of g mixed Gaussian of j state; U _JgBe g mixed Gaussian variance matrix of j state.

State duration density function p _j(d), expression state S _jContinue the probability of d time quantum, represent with single Gaussian distribution:

p _i(d)＝N(d|μ _j，σ _j)，

In the formula: μ _jThe average of representing the j state duration; σ _jThe variance of representing the j state duration, D represents maximum rating duration unit;

3.2, use the BW algorithm and train pretreated sample,

Adopt repeatedly the optimized Algorithm of iteration, count objective function Q of multiplication structure with Lagrange, all HSMM parameters have wherein been comprised as variable, make then that Q is 0 to the inclined to one side inverse of each variable, derive HSMM parameter that Q culminates stylish with respect to the relation between the old HSMM parameter, thereby obtain each estimates of parameters of HSMM model, with the computing that iterates of the funtcional relationship between the new and old HSMM parameter, till the HSMM parameter no longer obvious variation takes place;

3.3, HSMM is carried out initialization,

3.4, ask for HSMM model λ;

Value sequence O and initial model λ=(π, A, B, P of choosing according to the observation _i(d)), by the revaluation formula, try to achieve one group of new argument

With

That is obtained a new model

Promptly obtain by the revaluation formula

Good aspect the expression observed value sequence O, repeat this process than λ, improved model parameter progressively, up to

Convergence promptly no longer obviously increases, this moment

Be the HSMM model of being asked.

Step 4, user interest feature extraction,

4.1, the text of pretreated feature to be extracted is carried out scan text, utilize then composing such as line feed, colon, two line space lattice and separator information the retrtieval sequence be converted to the text sections sequence of mark.The HSMM model λ of integrating step 3 training part outputs calculates the test sample book that obtains after the text sections processing with the Viterbi algorithm, carry out the user interest feature extraction;

4.2, with the text observed value sequence O=O that obtains after the pre-service ₁O ₂... O _TBe input to HSMM model λ, find out the state tag sequence of maximum probability

The observation text that is marked as the dbjective state label is the content that user characteristics extracts.

Data purification described in the step 2 is unwanted data in the deletion mining process; User Recognition is the process that the page with user and request is associated, the wherein main situation of handling a plurality of users by acting server or firewall access website; Session jd is that a user all requests for page in a period of time are decomposed to obtain user conversation; The path process of replenishing is exactly to replenish this locality or omission requested page that proxy server caches caused complete.

The invention has the beneficial effects as follows: this method is by controlling user browsing behavior with the probability of state presence time, the latent state of description interest characteristics and the correlativity of time are more closely combined, and use the characteristic that HSMM produces many observed values sequence, text message is divided into a plurality of text block subregions, the feature of each subregion and one of them observed value sequence are mapped; And utilize HSMM to allow the default characteristic extracting of observed value to have the characteristic information of disappearance behavior.Experiment shows that utilizing HSMM to carry out feature extraction has higher accuracy rate and recall rate than HMM method, and more realistic problem has better modeling and analysis ability.

Description of drawings

Fig. 1 is a workflow diagram of the present invention.

Fig. 2 is that the comprehensive evaluation index of HSMM and HMM compares.

Embodiment

Present embodiment may further comprise the steps based on the user interest feature extracting method of hidden semi-Markov model:

Step 1, experimental data are collected

Data aggregation is a process of obtaining the information relevant with user characteristics, interest characteristics or demand.The present invention mainly reflects its interest by research user's recessive behavior, and interest characteristics can obtain by Web server daily record and client data.

The pre-service of step 2, data

Pre-service mainly to user access logs carry out that data purification, User Recognition, session jd, path replenish, processing such as format and event recognition, form the user conversation file.Data purification is deleted unwanted data in the mining process exactly; User Recognition is the process that the page with user and request is associated, the wherein main situation of handling a plurality of users by acting server or firewall access website; Session jd is that a user all requests for page in a period of time are decomposed to obtain user conversation; The path process of replenishing is exactly to replenish this locality or omission requested page that proxy server caches caused complete.

Pre-service mainly contains following steps for data:

1) webpage format inspection: can remove wherein insignificant Web page annotation of information extraction and part format character by webpage being carried out entire scan, keep the supplementary that the important node mark grasps as information simultaneously.

2) signature: utilize rule that fixed character is described, be convenient to extraction model stationary state or the fixing situation that discharges probability are handled, reduce algorithm operation complicacy, raise the efficiency.

3) participle: if run into longer character string when handling each intra-node information then need to carry out word segmentation processing, can directly utilize the JE word-dividing mode of increasing income, easy API wherein is provided, and can add neologisms.

4) text mark: valuable information in the text is carried out mark, be beneficial to extraction model identification, need artificial careful mark, just can draw more suitably model parameter, have a significant impact for the degree of accuracy of the information of extraction in the extraction model training stage.

Utilize HTMLPaser to generate node tree structure data are carried out preliminary piecemeal, make text-converted become to be easier to the pattern of being handled by information extraction system.HTMLPaser is a Java storehouse of increasing income, support linear or nested parsing html text, according to typesetting format information, information such as separator will be converted to the sequence of being made up of text sections with the good Web log information text sequence of html language mark, and each piecemeal all carries out status indication with html language.

Step 3, model training

Utilize model must be earlier before carrying out information extraction data source by a large amount of accurate marks to the model training, can utilize existing model directly to carry out information extraction in the practical application, simultaneously also will be according to the new situation adjustment model parameter that runs in the information extraction process, this has embodied the self-adjusting characteristic of hidden Markov model.Experimental data derives from the Taobao website, collect user's the behavior of browsing webpage, after the webpage processing, obtain 2000 user behavior text datas, and at random will be wherein 1500 retrtieval is as training set (having carried out model training), other 500 samples are as test set (being used for information extraction).

HSMM is that it allows basic process is a semi-Markov chain, and each state is all had a variable cycle or the residence time between the semicontinuous HMM of continuous and Discrete HMM.In order to overcome the shortcoming of conventional H MM, the various explicit p that use have appearred _i(d) HMM of expression state presence probability distribution.

3.1, earlier tentatively choose a HSMM model that N state arranged, it is by hexa-atomic group of parameter lambda=(N, M, π, A, B, pj (d)) composition, wherein:

N represents the number of state, and its limited state set is S={s ₁, s ₂..., s _N;

M represents the number of observed value, and its limited observation set is V={v ₁, v ₂..., v _M;

The initial probability distribution π of state={ π ₁, π ₂..., π _N; be used for describing observation sequence O (more observed sequences of real process energy; such as according to you have order certain link in the browsing page process, the network address of this link is what is to know, the title of this page also is available) at t=1 state q of living in constantly ₁The probability distribution that belongs to each state in the model, i.e. π _i=P (q ₁=s _i);

State transition probability matrix A=[a _Ij] _{N * M}, consider single order HSMM, current state q of living in _iOnly and q _I-1Relevant, i.e. a _Ij=P (q _i=s _j| q _I-1=s _i);

b_{j} (O_{k}) = Σ_{g = 1}^{G} ω_{jg} N (O, μ_{jg}, U_{jg}),

In formula: the Gauss unit number that G may comprise for each state; ω _JgBe g mixed Gaussian weight of j state; μ _JgIt is the average of g mixed Gaussian of j state; U _JgBe g mixed Gaussian variance matrix of j state;

State duration density function (being used for describing the density function that a state can be represented under the time remaining situation) p _j(d), expression state S _jContinue the probability of d time quantum, represent with single Gaussian distribution:

P _i(d)＝N(d|μ _j，σ _j)，

In the formula: μ _jThe average of representing the j state duration; σ _jThe variance of representing the j state duration, D represents maximum rating duration unit.

The present invention is mainly used in the extraction of user interest behavioural information, to show and the implicit information obtain manner combines, the main source that these several respects of behavior of mainly utilizing the searching keyword of user's inputted search engine, the page that the user browses and user to browse are set out and obtained as user interest information.Mainly choose 7 states, state set S={User, Keys, Title, Time, Marks, Operations, Links}, and in the regular hour section, under Marks and the Operations a series of states are arranged also, set { books, savepage} ∈ Marks, { cut, copy, scroll} ∈ Operations.User interest information state set structure sees Table 1.

Table 1 user interest information state set structure

Keywords	The searching keyword of search engine
		Title	The title of requested webpage
Time	The residence time on the page (second)
		Marks	Marking behavior: increase bookmark (books), preserve the page (savepage)
Operations	Operation behavior: shear (cut), duplicate (copy), drag scroll bar (scroll)
		Links	Link behavior: have when browsing certain page and do not click certain hyperlink

3.2, to the training of HSMM model,

Use BW (Baum-Welch) algorithm and come training step 2 pretreated samples, the Baum-Welch algorithm also is mainly to solve model training and parameter revaluation problem, in fact also be an application of maximum likelihood criterion, it has adopted a kind of optimized Algorithm of repeatedly iteration, count objective function Q of multiplication structure with Lagrange, all HSMM parameters have wherein been comprised as variable, make then that Q is 0 to the inclined to one side inverse of each variable, derive HSMM parameter that Q culminates stylish with respect to the relation between the old model parameter, thereby obtain the estimated value of each parameter of HSMM, with the computing that iterates of the funtcional relationship between the new and old HSMM model parameter, till the HSMM model parameter no longer obvious variation takes place.

3.3, HSMM is carried out initialization,

When obtaining the parameter of HSMM by training dataset, an important problem is exactly choosing of initial model.Different initial models will obtain different training results.Because algorithm is the model parameter that obtains when making P (O| λ) local maximum.Therefore, choosing initial model preferably, make local maximum and the global maximum obtained at last approaching, is highly significant.But this problem does not still have perfect answer so far.Usually adopt the method for some experiences during actual treatment.It is generally acknowledged that the influence that parameter π and A choose initial value is little, can picked at random or evenly choose, as long as satisfy certain random constraints condition.But the initial value of B is bigger to the training influence of HSMM, the general comparatively initial value choosing method of complexity that adopts.The initial model λ here can choose arbitrarily.But because

Arbitrarily

It is the model after λ improves.Again will Use the revaluation formula as initial value, obtain

This has just been avoided choosing of initial value improper to a certain extent.

3.4, ask for HSMM model λ;

With

That is obtained a new model

Promptly obtain by the revaluation formula

Good aspect the expression observed value sequence O, repeat this process than λ, improved model parameter progressively, up to Convergence promptly no longer obviously increases, this moment

Be the HSMM model of being asked;

Step 4, user interest feature extraction,

4.1, the text of pretreated feature to be extracted is carried out scan text, utilize then composing such as line feed, colon, two line space lattice and separator information the retrtieval sequence be converted to the text sections sequence of mark.The HSMM model λ of integrating step 3 training part outputs calculates 500 test sample books that obtain after the text sections processing with the Viterbi algorithm, carry out the user interest feature extraction;

Use the Viterbi algorithm to solve a given observed value sequence O=(O ₁, O ₂..., O _T) and HSMM model λ=(π, A, B, a P _i(d)), on the meaning of the best, determine a state tag sequence

Problem.

This example uses in test accuracy rate (P), recall rate (R) and three indexs of F value as the system performance evaluating standard.Three indexs are defined as follows:

P = \frac{N_{1}}{N_{1} + N_{2}},

R = \frac{N_{1}}{N_{1} + N_{3}}

Wherein, N ₁The example number of the correct identification of expression, N ₂Expression is recognized such other example number, N by mistake ₃The example number of representing to belong to this classification but being recognized other classification by mistake.

Comprehensive evaluation index:

F = \frac{(β^{2} + 1) \times P \times R}{β^{2} \times P + R}

Wherein, parameter beta is used for to accuracy rate P gives different weights with recall rate R, and when β got 1, accuracy rate was endowed identical weight with recall ratio.We get β=1 in the experiment.

Utilize comprehensive evaluation index F respectively HSMM and HMM to be estimated, ask for an interview table 2, table 3.

Table 2 utilizes the HMM model to carry out the test result and the performance index statistical form of feature extraction experiment

	N ₁	N ₂	N ₃	P(％)	R(％)
						Keywords	317	113	70	73.72	81.91
Title	280	103	117	73.11	70.53
						Time	332	98	70	77.21	82.59
Operations	295	143	62	67.35	82.63
						Marks	353	92	55	79.33	86.52
Links	382	51	67	88.22	85.08

Table 3 utilizes the HSMM model to carry out the test result and the performance index statistical form of feature extraction experiment

	N ₁	N ₂	N ₃	P(％)	R(％)
						Keywords	363	79	58	82.13	85.37
Title	308	104	88	74.76	77.78
						Time	402	56	42	87.77	90.54
Operations	391	38	71	91.14	84.63
						Marks	381	67	52	85.04	87.99
Links	443	27	30	94.26	93.66

Claims

1. user interest feature extracting method based on hidden semi-Markov model is characterized in that comprising step:

Step 1, data aggregation,

Step 2, data pre-service,

Step 3, model training,

3.1, tentatively choose a HSMM model that N state arranged, it is by hexa-atomic group of parameter lambda=(N, M, π, A, B, p _j(d)) form, wherein:

b_{j} (O_{k}) = Σ_{g = 1}^{G} ω_{jg} N (O, μ_{jg}, U_{jg}) .

p _j(d)＝N(d|μ _j，σ _j)，

3.2, use the BW algorithm and train pretreated sample,

3.3, HSMM is carried out initialization,

3.4, ask for HSMM model λ;

With

That is obtained a new model

Promptly obtain by the revaluation formula Good aspect the expression observed value sequence O, repeat this process than λ, improved model parameter progressively, up to

Convergence promptly no longer obviously increases, this moment

Be the HSMM model of being asked.

Step 4, user interest feature extraction,

2. according to right 1 described extracting method, it is characterized in that: the data purification described in the step 2 is unwanted data in the deletion mining process; User Recognition is the process that the page with user and request is associated, the wherein main situation of handling a plurality of users by acting server or firewall access website; Session jd is that a user all requests for page in a period of time are decomposed to obtain user conversation; The path process of replenishing is exactly to replenish this locality or omission requested page that proxy server caches caused complete.