CN104572612A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN104572612A
CN104572612A CN201310489328.XA CN201310489328A CN104572612A CN 104572612 A CN104572612 A CN 104572612A CN 201310489328 A CN201310489328 A CN 201310489328A CN 104572612 A CN104572612 A CN 104572612A
Authority
CN
China
Prior art keywords
word
candidate
interior chain
setting
proper vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310489328.XA
Other languages
Chinese (zh)
Inventor
程刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310489328.XA priority Critical patent/CN104572612A/en
Publication of CN104572612A publication Critical patent/CN104572612A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a data processing method and device. The method includes: determining feature vector words of words to be processed; using internal chain words set and occurring in a results page which the words to be processed specially belong to, as candidate internal self-chain words of the words to be processed; performing calculation according to a set recommended score calculating method, and calculating a recommendation score of each candidate internal self-chain word through a feature vector word of each candidate internal self-chain word and the feature vector words of the words to be processed; selecting a set number of candidate internal self-chain words having high recommendation scores, as internal self-chain words related to the words to be processed. The data processing method has the advantage that an internal self-chain word of a word can be automatically mined during processing of the word.

Description

Data processing method and device
Technical field
The application relates to Internet technology, particularly data processing method and device.
Background technology
For making the application's easy understand, first the technical term that the application relates to is described below:
Participle: be that a sequence is cut into word independent one by one.This sequence can be Chinese character sequence, also can be the sequence of Chinese character and proprietary english composition.
Knowledge base: it is the set of all multiple semantic trees.And semantic tree is made up of the set of the identical or close one group of word of semanteme.
Proper vector word: be used for the word of the feature representing some documents, it comprises at least one word.
Interior chain word: be occur in the text of Ask-Answer Community, user can click and the link jumped on other page and descriptive text.It can as the proper vector word of a document.
From interior chain word: the one belonging to interior chain word, be the link and the descriptive text that are used in reference to other entries in same class entry in knowledge base in a certain class entry.
Above the technical term that the application relates to is described.
In the prior art, when carrying out some data processings to the word (being called pending word) in knowledge base, if can automatically recommend out this pending word relevant from interior chain word, user is made from interior chain word, to find oneself interested word from what recommend, initiatively again obtain without the need to user, this improves the word access efficiency of knowledge base on the one hand, also can save on the other hand because user frequently accesses the resource that knowledge base is wasted.But, still do not have a kind of mode to excavate in prior art and recommend pending word relevant from interior chain word.Therefore, a kind of for excavate that pending word is correlated with from the data processing method of interior chain word be current technical matters urgently to be resolved hurrily.
Summary of the invention
This application provides data processing method and device, to realize when processing a certain word in knowledge base, automatic mining go out this word relevant from interior chain word.
The technical scheme that the application provides comprises:
A kind of data processing method, comprising:
Determine the proper vector word of pending word;
Using chain word in the setting that occurs in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word;
The proper vector word of each candidate from interior chain word is determined according to the mode of the proper vector word determining pending word;
Calculate according to the recommender score computing method of setting and utilize each candidate to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word and the proper vector word of described pending word;
Choose the high candidate of a setting quantity recommender score from interior chain word as described pending word be correlated with from interior chain word.
A kind of data processing method, the method comprises:
Using other words in the knowledge base that pre-sets except pending word as the candidate of described pending word from interior chain word;
Obtain each candidate from interior chain word number of times accessed by the user in setting-up time;
Calculate the number of times sum that all words in described knowledge base are accessed by the user in described setting-up time;
According to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word;
Choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
A kind of data processing equipment, this device comprises:
First determining unit, for determining the proper vector word of pending word;
Second determining unit, for chain word in the setting that will occur in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word;
3rd determining unit, the mode for the proper vector word determining pending word according to the first determining unit determines the proper vector word of each candidate from interior chain word;
Computing unit, for calculating according to the recommender score computing method of setting and utilize each candidate to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word and the proper vector word of described pending word;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
A kind of data processing equipment, this device comprises:
Determining unit, for other words in the knowledge base that will pre-set except pending word as the candidate of described pending word from interior chain word;
Acquiring unit, for obtaining each candidate from interior chain word number of times accessed by the user in setting-up time;
First computing unit, for calculating the number of times sum accessed by the user in described setting-up time of all words in described knowledge base;
Second computing unit, for according to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
As can be seen from the above technical solutions, in the present invention, by determining that the proper vector word of pending word and candidate are from interior chain word, the proper vector word of described pending word and each candidate is utilized to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word, choose the high candidate of a setting quantity recommender score from interior chain word as described pending word be correlated with from interior chain word, can realize when processing a certain word, automatic mining goes out the object from interior chain word of this word.
Further, in the present invention, due to can automatically recommend out when processing a certain word this word relevant from interior chain word, user is made from interior chain word, to find oneself interested word from what recommend, initiatively again obtain without the need to user, this improves the word access efficiency of knowledge base on the one hand, also can save on the other hand because user frequently accesses the resource that knowledge base is wasted.
Accompanying drawing explanation
The method flow diagram that Fig. 1 provides for the embodiment of the present invention 1;
The proper vector word determination process flow diagram that Fig. 2 provides for the embodiment of the present invention 1;
The degree of correlation determination process flow diagram that Fig. 3 provides for the embodiment of the present invention 2;
The proper vector word that Fig. 4 provides for the embodiment of the present invention 1 another determine process flow diagram;
The method flow diagram that Fig. 5 provides for the embodiment of the present invention 2;
The structure drawing of device that Fig. 6 provides for the embodiment of the present invention;
Another structure drawing of device that Fig. 7 provides for the embodiment of the present invention.
Embodiment
In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.
Method provided by the invention can when processing a certain word, can automatic mining go out this word relevant from interior chain word, realize when processing a certain word, automatic mining goes out the object from interior chain word of this word.
Below by two embodiments, method provided by the invention is described:
Embodiment 1:
See the method flow diagram that Fig. 1, Fig. 1 provide for the embodiment of the present invention 1.As shown in Figure 1, the method comprises the following steps:
Step 101, determines the proper vector word of pending word.
In the present invention, described pending word can comprise at least one word.
Hereafter emphasis describes the method for the proper vector word how determining pending word, and this step 101 wouldn't repeat.
Step 102, using chain word in the setting that occurs in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word.
In the present invention, pending word is the word in the knowledge base pre-set, and wherein, when arranging knowledge base, the present invention can for the result page that all special setting one is exclusive of each word in knowledge base, for explaining or describing this word.
Based on this, in this step 102, with regard to the setting in knowledge based storehouse, from knowledge base, find the result page that described pending word is exclusive.Wherein, can comprise the word that some have exclusive result page in knowledge base in this result page, for these words, it can jump to its exclusive result page automatically when receiving user and triggering and such as click, and therefore can be described as interior chain word.When occur during this step 102 finds the result page that described pending word is exclusive some foregoing in chain word time, this step 102 just using chain word in this discovery as the candidate of described pending word from interior chain word so that the follow-up candidate from described pending word excavate from interior chain word the higher word of priority ratio as pending word relevant recommend user from interior chain word.
Step 103, determines that according to step 101 mode of the proper vector word of pending word determines the proper vector word of each candidate from interior chain word.
Step 104, calculates according to the recommender score computing method of setting and utilizes each candidate to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word and the proper vector word of described pending word.
Preferably, in above-mentioned steps 103, why determine the proper vector word of candidate from interior chain word and pending word according to same way, object facilitates this step 104 calculated recommendation mark, avoids because the proper vector word that different modes is determined cannot carry out recommender score calculating.
In addition, as the recommender score computing method of setting in this step 104, it can be arranged according to actual conditions, such as, can be set to relatedness computation method, or other modes, and the present invention does not specifically limit.
Step 105, choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
So far, by above-mentioned steps 101 to step 105 can automatic mining go out pending word relevant from interior chain word.
Step 101 in flow process shown in Fig. 1 is determined to the mode of the proper vector word of pending word is described below:
Preferably, following two kinds of modes can be adopted in the present invention to determine the proper vector word of pending word:
Mode 1:
The manner 1 time, step 101 determines that the method for the proper vector word of pending word can comprise the following steps shown in Fig. 2:
Step 201, determines the document of the exclusive result page of described pending word.
Based on above-described, each word in knowledge base has an exclusive result page, and described pending entry is as the word of knowledge base, and it has an exclusive result page certainly.When entering the exclusive result page of described pending word, specify to be easy to determine namely to be called the document of the exclusive result page of described pending word by the document that the exclusive result page of described pending word is corresponding according to existing document.
Step 202, determines to set the word that threshold value and described document have the high degree of correlation, the word determined is defined as the proper vector word of described pending word.
Preferably, the manner 1 time, step 202 specific implementation can comprise flow process as shown in Figure 3:
Step 301, carries out word segmentation processing and denoising interference to described pending word, obtains corresponding result.
In this step 301, pending word is not an independent Chinese character, it can be plural Chinese character and/or proprietary english composition, for this situation, this step 301 can process described pending word according to existing participle mode and denoising conflicting mode, obtains corresponding result.
Step 302, extracts the word of satisfied setting regulation as descriptor from described result.
Also namely, when obtaining after result through step 301, this step 302 just can extract the word of satisfied setting regulation as descriptor from result.Here, setting regulation can be arranged according to actual conditions, from result, such as extracts word as verb and/or noun as descriptor.
Step 303, calculates the degree of correlation of each descriptor and described document, chooses the proper vector word of descriptor as described pending word that setting threshold value and described document have the high degree of correlation.
Preferably, as an embodiment, relevance algorithms Model B M25 algorithm in the present invention, can be adopted to calculate the degree of correlation of descriptor and described document.It should be noted that, this BM25 algorithm just for ease of the application's easy understand for embodiment, be not intended to limit the present invention.
Represent arbitrary descriptor with w, d represents that the document of the exclusive result page of described pending word is example, then BM25 algorithm represents by following formula 1:
score ( w , d ) = IDF ( w ) * f ( w , d ) ( k 1 + 1 ) f ( w , d ) + k 1 ( 1 - b + b | D | avgDL ) ; (formula 1)
Wherein, score (w, d) degree of correlation between descriptor w and described document is represented, f (w, d) occurrence number of descriptor w at the exclusive result page of described pending word is represented, D is the Document Length of described document, the document average length of all result pages that all words are exclusive in avgDL part of speech belonging to described pending word, k 1, b calculates the setup parameter of the degree of correlation, IDF (w) is determined by following formula 2:
IDF ( w ) = log N - n ( w ) + 0.5 n ( w ) + 0.5 ; (formula 2)
Wherein, N represents the total page number of all result pages that all words are exclusive in part of speech belonging to described pending word, comprises the result page quantity of this descriptor w in all result pages that in n (w) part of speech belonging to described pending word, all words are exclusive.
Below for part of speech belonging to pending word for emotion class, the flow process of the degree of correlation of workflow management shown in Fig. 3 is described:
If the exclusive result page of pending word is labeled as d, the Document Length of this result page is 12, the result page sum N=99000 that all words of emotion class are exclusive is belonged in knowledge base, this document average length avgDL belonging to the exclusive result page of all words of emotion class is 10, then, based on flow process shown in Fig. 3, first, the present invention carries out participle to pending word and removes noise jamming process; The word of satisfied setting regulation is extracted afterwards as descriptor from result.Be " honeymoon ", " change-place-reflect " " love " for the descriptor extracted, if it is as shown in table 1 in the occurrence number of the exclusive result page of pending word respectively to comprise " honeymoon ", the result number of pages of " change-place-reflect " " love " three descriptor and " honeymoon ", " change-place-reflect " " love " three descriptor of extracting in all result pages that in emotion class, all words are exclusive respectively:
Table 1
Then, according to formula 2 obtains IDF (honeymoon), IDF (change-place-reflect), IDF (love) is as follows:
IDF (honeymoon)=log ((99000-54000+0.5)/54000+0.5)=0.83;
IDF (change-place-reflect)=log ((99000-2000+0.5)/2000+0.5)=48.49;
IDF (love)=log ((99000-90000+0.5)/90000+0.5)=0.10;
Suppose the k in above-mentioned formula 1 1value 1.5, b value is 0.75, according to the formula 1 and IDF (honeymoon) calculated above, IDF (change-place-reflect), IDF (love) can obtain " honeymoon ", " change-place-reflect " " love " three descriptor with the degree of correlation of document d are respectively:
By calculating above and can drawing, " change-place-reflect " is the highest with the degree of correlation of document d, and " honeymoon " takes second place, and " love " is last.If be only pending selected ci poem according to the rules to get 2 proper vector words, then just by " change-place-reflect ", " honeymoon " as the proper vector word of pending word.
So far, the mode that can complete determines the proper vector word operation of pending word for 1 time.
Below mode 2 is described:
Mode 2:
In the present invention, the manner 2, relative to mode 1, does not need to carry out relatedness computation when determining the proper vector word of pending word, fairly simple.As shown in Figure 4, the manner 2 times, determine in step 101 that the proper vector word of pending word can comprise:
Step 401, finds other words in the knowledge base pre-set with exclusive result page from the result page that described pending word is exclusive.
As mentioned above, each word in knowledge base has an exclusive result page, based on this, when entering the exclusive result page of described pending word, with regard to the word in the exclusive result page of the pending word of knowledge based storehouse sequential analysis, to find the word in knowledge base with exclusive result page.
Step 402, chooses the proper vector word that described pending word made to be decided to be in a setting threshold value word from the word found.
So far, the flow process shown in Fig. 4 is completed.
Be " Zhang Weiran " for pending word; Then based on flow process shown in Fig. 4, just need the word of the pending word of sequential analysis " Zhang Weiran " exclusive result page, therefrom find in the knowledge base pre-set other words with exclusive result page, if this word found has " not abandoning ", " a blue bewitching Ji ", " Chen Guanxi ", " happy male voice ", " Moscow State University ", " Korean " etc., then from the word that this finds, just choose arbitrarily a setting threshold value word, such as choosing two words " does not abandon ", " a blue bewitching Ji ", the proper vector word being decided to be described pending word made in the word this chosen.
Above mode 2 is described.
The proper vector word of pending word is determined by step 101 in flow process shown in Fig. 1 just can be realized with upper type 1 or mode 2.
So far, the detailed description of process step 101 shown in Fig. 1 is completed.
Because candidate in step 103 in flow process shown in Fig. 1 is the same with the proper vector word determination mode of pending word from the proper vector word determination mode of interior chain word, then when step 101 adopts mode 1 to determine the proper vector word of pending word, step 103 adopt similar fashion 1 to determine the mode of the proper vector word of pending word determines the proper vector word of candidate from interior chain word; And when step 101 adopts mode 2 to determine the proper vector word of pending word, step 103 adopt similar fashion 2 to determine the mode of the proper vector word of pending word determines the proper vector word of candidate from interior chain word.
Below process step 104 shown in Fig. 1 is described:
For the recommender score computing method set as relatedness computation method, then in step 104, calculate and utilize each candidate to calculate each candidate from the proper vector word of interior chain word and the proper vector word of described pending word can comprise from the recommender score of interior chain word according to the recommender score computing method of setting:
For each candidate from interior chain word, this candidate is calculated from the degree of correlation between all proper vector words and all proper vector words of described pending word of interior chain word, using the degree of correlation that calculates as the recommender score of this candidate from interior chain word according to the relatedness computation method of setting.
Preferably, in the present invention, all proper vector words of pending word and candidate are represented by eigenvectors matrix from all proper vector words of interior chain word usually.Based on this, as one embodiment of the present of invention, described candidate represents by following formula 3 from the degree of correlation between all proper vector words and all proper vector words of described pending word of interior chain word:
recommend _ score ( y ) = Σ i = 1 n Σ j = 1 m match ( v i , w j ) length ( v ) * length ( w ) ; (formula 3)
Wherein, x is described pending word, y is that arbitrary candidate is from interior chain word, recommend_score (x, y) be all proper vector words and the degree of correlation of arbitrary candidate between all proper vector words of interior chain word y of described pending word x, n is the quantity of all proper vector words of described pending word x, m is the quantity of arbitrary candidate from all proper vector words of interior chain word y, length (v) is the total length of all proper vector words of described pending word x, the arbitrary candidate of length (w) is from the total length of all proper vector words of interior chain word y, match (v i, w j) represent each proper vector word and the matching degree of arbitrary candidate between each proper vector word of interior chain word y of described pending word x, work as v iequal w jtime, match (v i, w j)=1, works as v ibe not equal to w jtime, match (v i, w j)=0.。
Can find out based on formula 3, when pending word and candidate are when interior chain word is identical, this pending word and the degree of correlation of candidate between interior chain word are 1, otherwise, when pending word and candidate are when interior chain word neither one word is identical, pending word and the degree of correlation of candidate between interior chain word are 0.
It should be noted that, above-mentioned formula 3 is just illustrated for the one of relatedness computation when the recommender score computing method set are relatedness computation method, is not intended to limit the present invention.Those skilled in the art can also adopt other modes to calculate the degree of correlation.Further, in the present invention, the recommender score computing method of setting do not limit to relatedness computation method, and it also can be additive method, specifically can arrange according to the actual requirements.
So far, the detailed description of process step 104 shown in Fig. 1 is completed.
Above embodiment 1 provided by the invention is described.
Below embodiment 2 is described:
See the method flow diagram that Fig. 5, Fig. 5 provide for the embodiment of the present invention 2.The present embodiment 2 is compared to the above embodiments 1, do not need to calculate proper vector word for pending word, but depend on other words in the knowledge base pre-set except pending word frequency accessed by the user and determine the relevant from interior chain word of pending word, simpler than embodiment 1, be described in detail below:
As shown in Figure 5, this flow process can comprise the following steps:
Step 501, using other words in the knowledge base that pre-sets except pending word as the candidate of described pending word from interior chain word.
Step 502, obtains each candidate from interior chain word number of times accessed by the user in setting-up time.
Usually, when word arbitrary in knowledge base is accessed, knowledge base self has a kind of writing function, for recording this accessed time wherein, number of times.Based on this, this step 502 depends on the writing function of knowledge base self, is easy to obtain each candidate from interior chain word number of times accessed by the user in setting-up time.Here, the unit of setting-up time can be hour, number of days, minute etc., the present invention does not specifically limit.
Step 503, calculates the number of times sum that all words in described knowledge base are accessed by the user in described setting-up time.
Here, namely all words, also comprise above-mentioned candidate from interior chain word and pending word.Equally, the writing function in knowledge based storehouse self, this step 503 also can be easy to obtain pending word and each candidate from interior chain word number of times accessed by the user in setting-up time, the pending word obtained and each candidate are added from the number of times that interior chain word is accessed by the user in setting-up time, are the number of times sum that all words in described knowledge base are accessed by the user in described setting-up time.
Step 504, according to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word.
The recommender score computing method of setting in this step 504, it can be arranged according to actual conditions, such as, can be set to temperature computing method, or other modes, and the present invention does not specifically limit.
For the recommender score computing method set as temperature computing method, then in step 504, according to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate each candidate described can to comprise from the recommender score of interior chain word:
For each candidate from interior chain word, according to setting temperature computing method and utilize this candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the temperature of this candidate from interior chain word, using the temperature that calculates as the recommender score of this candidate from interior chain word.
Preferably, as one embodiment of the invention, candidate is realized by following formula from the temperature of interior chain word:
hot _ score ( v ) = Σ k = 0 n p ( v ) Σ k = 0 n p ( u ) (formula 4)
Wherein, v represents that arbitrary candidate is from interior chain word, and hot_score (v) represents the temperature of candidate from interior chain word v, represent that candidate is from interior chain word v number of times accessed by the user in setting-up time, represent the number of times sum that in knowledge base, all words are accessed by the user in described setting-up time.
It should be noted that, the one that above-mentioned formula 4 is just calculated for fever thermometer when the recommender score computing method set are temperature computing method is illustrated, and is not intended to limit the present invention.Those skilled in the art can also adopt other modes to calculate the degree of correlation.Further, in the present invention, the recommender score computing method of setting do not limit to temperature computing method, and it also can be additive method, specifically can arrange according to the actual requirements.
Step 505, choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
So far, flow process shown in Fig. 5 is completed.
By flow process shown in Fig. 5, can realize automatic mining go out pending word relevant from interior chain word.Further, preferably, when the recommender score computing method set are temperature computing method, what this pending word excavated was correlated with is also that some are by the hot word of often accessing from interior chain word.
So far, the description of embodiment 2 is completed.
Can find out, the present invention is by above-described embodiment 1 or embodiment 2, and can realize when processing a certain word, automatic mining goes out the object from interior chain word of this word.
Above method provided by the invention is described.
Below device provided by the invention is described:
See the first structure drawing of device that Fig. 6, Fig. 6 provide for the embodiment of the present invention.This application of installation is in the above embodiments, and as shown in Figure 6, this device comprises:
First determining unit, for determining the proper vector word of pending word;
Second determining unit, for chain word in the setting that will occur in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word;
3rd determining unit, the mode for the proper vector word determining pending word according to the first determining unit determines the proper vector word of each candidate from interior chain word;
Computing unit, calculates the recommender score of each candidate described from interior chain word for utilizing the proper vector word of described pending word and each candidate from the proper vector word of interior chain word and according to the recommender score computing method of setting;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
Preferably, described first determining unit determines the document of the exclusive result page of described pending word, determines to set the word that threshold value and described document have the high degree of correlation, the word determined is defined as the proper vector word of described pending word.
Preferably, described first determining unit is when determining that setting threshold value has the word of the high degree of correlation with described document, word segmentation processing and denoising interference are carried out to described pending word, obtain corresponding result, the word of satisfied setting regulation is extracted as descriptor from described result, calculate the degree of correlation of each descriptor and described document, choose the proper vector word of descriptor as described pending word that setting threshold value and described document have the high degree of correlation.
Preferably, in the present invention, described first determining unit is by the degree of correlation of each descriptor described in following formulae discovery and described document:
Wherein, w represents arbitrary descriptor, d represents described document, score (w, d) represents the degree of correlation between descriptor w and described document, f (w, d) occurrence number of descriptor w at the exclusive result page of described pending word is represented, D is the Document Length of described document, the document average length of all result pages that all words are exclusive in avgDL part of speech belonging to described pending word, k 1, b calculates the setup parameter of the degree of correlation, IDF (w) is determined by following formula: n represents the total page number of all result pages that all words are exclusive in part of speech belonging to described pending word, comprises the result number of pages of this descriptor w in all result pages that in n (w) part of speech belonging to described pending word, all words are exclusive.
Preferably, first from the exclusive result page of described pending word, find other words in the knowledge base pre-set with exclusive result page when described first determining unit determines the proper vector word of pending word, from the word found, choose the proper vector word that a setting threshold value word work is decided to be described pending word.
Preferably, described computing unit for each candidate from interior chain word, all proper vector words and the degree of correlation of this candidate between all proper vector words of interior chain word of pending word is calculated, using the degree of correlation that calculates as the recommender score of this candidate from interior chain word according to the relatedness computation method of setting.
Preferably, described computing unit by following formulae discovery candidate from the degree of correlation between all proper vector words and all proper vector words of described pending word of interior chain word:
recommend _ score ( x , y ) = Σ i = 1 n Σ j = 1 m match ( v i , w j ) length ( v ) * length ( w ) ;
Wherein, x is described pending word, y is that arbitrary candidate is from interior chain word, recommend_score (x, y) be all proper vector words and the degree of correlation of arbitrary candidate between all proper vector words of interior chain word y of described pending word x, n is the quantity of all proper vector words of described pending word x, m is the quantity of arbitrary candidate from all proper vector words of interior chain word y, length (v) is the total length of all proper vector words of described pending word x, the arbitrary candidate of length (w) is from the total length of all proper vector words of interior chain word y, match (v i, w j) represent each proper vector word and the matching degree of arbitrary candidate between each proper vector word of interior chain word y of described pending word x, work as v iequal w jtime, match (v i, w j)=1, works as v ibe not equal to w jtime, match (v i, w j)=0.
So far, the structure completing Fig. 6 shown device describes.
As another embodiment of the present invention, present invention also offers the another kind of device independent of Fig. 6 shown device.See the another kind of structure drawing of device that Fig. 7, Fig. 7 provide for the embodiment of the present invention.As shown in Figure 7, this device can comprise:
Determining unit, for other words in the knowledge base that will pre-set except pending word as the candidate of described pending word from interior chain word;
Acquiring unit, for obtaining each candidate from interior chain word number of times accessed by the user in setting-up time;
First computing unit, for calculating the number of times sum accessed by the user in described setting-up time of all words in described knowledge base;
Second computing unit, for according to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
Preferably, described second computing unit is calculating each candidate when the recommender score of interior chain word, first for each candidate from interior chain word, according to setting temperature computing method and utilize this candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the temperature of this candidate from interior chain word, using the temperature that calculates as the recommender score of this candidate from interior chain word.
Preferably, described second computing unit is by the temperature of following formulae discovery candidate from interior chain word:
hot _ score ( v ) = Σ k = 0 n p ( v ) Σ k = 0 n p ( u )
Wherein, v represents that arbitrary candidate is from interior chain word, and hot_score (v) represents the temperature of candidate from interior chain word v, represent that candidate is from interior chain word v number of times accessed by the user in setting-up time, represent the number of times sum that in knowledge base, all words are accessed by the user in described setting-up time.
So far, the structure completing Fig. 7 shown device describes.
As can be seen from the above technical solutions, in the present invention, by determining that the proper vector word of pending word and candidate are from interior chain word, the proper vector word of described pending word and each candidate is utilized to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word, choose the high candidate of a setting quantity recommender score from interior chain word as described pending word be correlated with from interior chain word, can realize when processing a certain word, automatic mining goes out the object from interior chain word of this word.
Further, in the present invention, due to can automatically recommend out when processing a certain word this word relevant from interior chain word, user is made from interior chain word, to find oneself interested word from what recommend, initiatively again obtain without the need to user, this improves the word access efficiency of knowledge base on the one hand, also can save on the other hand because user frequently accesses the resource that knowledge base is wasted.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (17)

1. a data processing method, is characterized in that, the method comprises:
Determine the proper vector word of pending word;
Using chain word in the setting that occurs in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word;
The proper vector word of each candidate from interior chain word is determined according to the mode of the proper vector word determining pending word;
Calculate according to the recommender score computing method of setting and utilize each candidate to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word and the proper vector word of described pending word;
Choose the high candidate of a setting quantity recommender score from interior chain word as described pending word be correlated with from interior chain word.
2. method according to claim 1, is characterized in that, describedly determines that the proper vector word of pending word comprises:
Determine the document of the exclusive result page of described pending word;
Determine to set the word that threshold value and described document have the high degree of correlation;
The word determined is defined as the proper vector word of described pending word.
3. method according to claim 2, is characterized in that, describedly determines to set the word that threshold value and described document have the high degree of correlation and comprises:
Word segmentation processing and denoising interference are carried out to described pending word, obtains corresponding result;
The word of satisfied setting regulation is extracted as descriptor from described result;
Calculate the degree of correlation of each descriptor and described document;
Choose the descriptor that setting threshold value and described document have the high degree of correlation.
4. method according to claim 3, is characterized in that, the degree of correlation of each descriptor described and described document is by following formulae discovery:
Wherein, w represents arbitrary descriptor, d represents described document, score (w, d) represents the degree of correlation between descriptor w and described document, f (w, d) occurrence number of descriptor w at the exclusive result page of described pending word is represented, D is the Document Length of described document, the document average length of all result pages that all words are exclusive in avgDL part of speech belonging to described pending word, k 1, b calculates the setup parameter of the degree of correlation, IDF (w) is determined by following formula: n represents the total page number of all result pages that all words are exclusive in part of speech belonging to described pending word, comprises the result number of pages of this descriptor w in all result pages that in n (w) part of speech belonging to described pending word, all words are exclusive.
5. method according to claim 1, is characterized in that, describedly determines that the proper vector word of pending word comprises:
Other words in the knowledge base pre-set with exclusive result page are found from the result page that described pending word is exclusive;
The proper vector word that described pending word made to be decided to be in a setting threshold value word is chosen from the word found.
6. method according to claim 1, it is characterized in that, the described recommender score computing method according to setting calculate and utilize each candidate to calculate each candidate from the proper vector word of interior chain word and the proper vector word of described pending word and comprise from the recommender score of interior chain word:
For each candidate from interior chain word, this candidate is calculated from the degree of correlation between all proper vector words and all proper vector words of described pending word of interior chain word, using the degree of correlation that calculates as the recommender score of this candidate from interior chain word according to the relatedness computation method of setting.
7. method according to claim 6, is characterized in that, described candidate passes through following formulae discovery from the degree of correlation between all proper vector words and all proper vector words of described pending word of interior chain word:
recommend _ score ( x , y ) = Σ i = 1 n Σ j = 1 m match ( v i , w j ) length ( v ) * length ( w ) ;
Wherein, x is described pending word, y is that arbitrary candidate is from interior chain word, recommend_score (x, y) be all proper vector words and the degree of correlation of arbitrary candidate between all proper vector words of interior chain word y of described pending word x, n is the quantity of all proper vector words of described pending word x, m is the quantity of arbitrary candidate from all proper vector words of interior chain word y, length (v) is the total length of all proper vector words of described pending word x, the arbitrary candidate of length (w) is from the total length of all proper vector words of interior chain word y, match (v i, w j) represent each proper vector word and the matching degree of arbitrary candidate between each proper vector word of interior chain word y of described pending word x, work as v iequal w jtime, match (v i, w j)=1, works as v ibe not equal to w jtime, match (v i, w j)=0.
8. a data processing method, is characterized in that, the method comprises:
Using other words in the knowledge base that pre-sets except pending word as the candidate of described pending word from interior chain word;
Obtain each candidate from interior chain word number of times accessed by the user in setting-up time;
Calculate the number of times sum that all words in described knowledge base are accessed by the user in described setting-up time;
According to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word;
Choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
9. method according to claim 8, it is characterized in that, the described recommender score computing method according to setting also utilize this candidate number of times sum that all words are accessed by the user in described setting-up time in the described knowledge base of interior chain word number of times accessed by the user in setting-up time and calculating to calculate this candidate to comprise from the recommender score of interior chain word:
For each candidate from interior chain word, according to setting temperature computing method and utilize this candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the temperature of this candidate from interior chain word, using the temperature that calculates as the recommender score of this candidate from interior chain word.
10. method according to claim 9, is characterized in that, described candidate passes through following formulae discovery from the temperature of interior chain word:
hot _ score ( v ) = Σ k = 0 n p ( v ) Σ k = 0 n p ( u )
Wherein, v represents that arbitrary candidate is from interior chain word, and hot_score (v) represents the temperature of candidate from interior chain word v, represent that candidate is from interior chain word v number of times accessed by the user in setting-up time, represent the number of times sum that in knowledge base, all words are accessed by the user in described setting-up time.
11. 1 kinds of data processing equipments, is characterized in that, this device comprises:
First determining unit, for determining the proper vector word of pending word;
Second determining unit, for chain word in the setting that will occur in the result page that described pending word is exclusive as the candidate of described pending word from interior chain word;
3rd determining unit, the mode for the proper vector word determining pending word according to the first determining unit determines the proper vector word of each candidate from interior chain word;
Computing unit, for calculating according to the recommender score computing method of setting and utilize each candidate to calculate the recommender score of each candidate from interior chain word from the proper vector word of interior chain word and the proper vector word of described pending word;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
12. devices according to claim 11, it is characterized in that, the document of described pending word exclusive result page is first determined when described first determining unit determines the proper vector word of pending word, determine to set the word that threshold value and described document have the high degree of correlation, the word determined is defined as the proper vector word of described pending word.
13. devices according to claim 12, it is characterized in that, described first determining unit is when determining that setting threshold value has the word of the high degree of correlation with described document, word segmentation processing and denoising interference are carried out to described pending word, obtain corresponding result, the word of satisfied setting regulation is extracted as descriptor from described result, calculate the degree of correlation of each descriptor and described document, choose the proper vector word of descriptor as described pending word that setting threshold value and described document have the high degree of correlation.
14. devices according to claim 11, it is characterized in that, described first determining unit finds other words in the knowledge base pre-set with exclusive result page when determining the proper vector word of pending word from the exclusive result page of described pending word, chooses the proper vector word that a setting threshold value word work is decided to be described pending word from the word found.
15. devices according to claim 11, it is characterized in that, described computing unit is calculating each candidate when the recommender score of interior chain word, first for each candidate from interior chain word, all proper vector words and the degree of correlation of this candidate between all proper vector words of interior chain word of pending word is calculated, using the degree of correlation that calculates as the recommender score of this candidate from interior chain word according to the relatedness computation method of setting.
16. 1 kinds of data processing equipments, is characterized in that, this device comprises:
Determining unit, for other words in the knowledge base that will pre-set except pending word as the candidate of described pending word from interior chain word;
Acquiring unit, for obtaining each candidate from interior chain word number of times accessed by the user in setting-up time;
First computing unit, for calculating the number of times sum accessed by the user in described setting-up time of all words in described knowledge base;
Second computing unit, for according to setting recommender score computing method and utilize each candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the recommender score of each candidate described from interior chain word;
Choose unit, for choose the high candidate of a setting quantity recommender score from interior chain word as pending word be correlated with from interior chain word.
17. devices according to claim 16, it is characterized in that, described second computing unit is calculating each candidate when the recommender score of interior chain word, first for each candidate from interior chain word, according to setting temperature computing method and utilize this candidate number of times sum that all words are accessed by the user in described setting-up time in interior chain word number of times accessed by the user in setting-up time and described knowledge base to calculate the temperature of this candidate from interior chain word, using the temperature that calculates as the recommender score of this candidate from interior chain word.
CN201310489328.XA 2013-10-18 2013-10-18 Data processing method and device Pending CN104572612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310489328.XA CN104572612A (en) 2013-10-18 2013-10-18 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310489328.XA CN104572612A (en) 2013-10-18 2013-10-18 Data processing method and device

Publications (1)

Publication Number Publication Date
CN104572612A true CN104572612A (en) 2015-04-29

Family

ID=53088716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310489328.XA Pending CN104572612A (en) 2013-10-18 2013-10-18 Data processing method and device

Country Status (1)

Country Link
CN (1) CN104572612A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195607A1 (en) * 2000-05-17 2008-08-14 Matsushita Electric Industrial Co., Ltd. Information recommendation apparatus and information recommendation system
CN101432714A (en) * 2004-09-14 2009-05-13 A9.Com公司 Methods and apparatus for automatic generation of recommended links
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
CN103150382A (en) * 2013-03-14 2013-06-12 中国科学院计算技术研究所 Automatic short text semantic concept expansion method and system based on open knowledge base

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195607A1 (en) * 2000-05-17 2008-08-14 Matsushita Electric Industrial Co., Ltd. Information recommendation apparatus and information recommendation system
CN101432714A (en) * 2004-09-14 2009-05-13 A9.Com公司 Methods and apparatus for automatic generation of recommended links
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
CN103150382A (en) * 2013-03-14 2013-06-12 中国科学院计算技术研究所 Automatic short text semantic concept expansion method and system based on open knowledge base

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN113919347B (en) * 2021-12-14 2022-04-05 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data

Similar Documents

Publication Publication Date Title
CN108304378B (en) Text similarity computing method, apparatus, computer equipment and storage medium
CN109635296B (en) New word mining method, device computer equipment and storage medium
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
CN103150382B (en) Automatic short text semantic concept expansion method and system based on open knowledge base
CN104008166B (en) Dialogue short text clustering method based on form and semantic similarity
CN104750798B (en) Recommendation method and device for application program
Rekabsaz et al. Exploration of a threshold for similarity based on uncertainty in word embedding
CN103020295B (en) A kind of problem label for labelling method and device
CN103123624B (en) Determine method and device, searching method and the device of centre word
CN107102993B (en) User appeal analysis method and device
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN110008474B (en) Key phrase determining method, device, equipment and storage medium
US20140379719A1 (en) System and method for tagging and searching documents
CN103870461A (en) Topic recommendation method, device and server
Gao et al. Text classification research based on improved Word2vec and CNN
CN111753527A (en) Data analysis method and device based on natural language processing and computer equipment
CN103150331A (en) Method and device for providing search engine tags
US20200387815A1 (en) Building training data and similarity relations for semantic space
CN107798004B (en) Keyword searching method and device and terminal
CN108388556A (en) The method for digging and system of similar entity
CN104572612A (en) Data processing method and device
CN104298786B (en) A kind of image search method and device
CN115374849A (en) Enterprise related patent retrieval method, device, equipment and medium
CN109145261A (en) A kind of method and apparatus generating label

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150429