CN103279529A - Unstructured data retrieval method and system - Google Patents

Unstructured data retrieval method and system Download PDF

Info

Publication number
CN103279529A
CN103279529A CN2013102105709A CN201310210570A CN103279529A CN 103279529 A CN103279529 A CN 103279529A CN 2013102105709 A CN2013102105709 A CN 2013102105709A CN 201310210570 A CN201310210570 A CN 201310210570A CN 103279529 A CN103279529 A CN 103279529A
Authority
CN
China
Prior art keywords
search results
task
data
branch
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102105709A
Other languages
Chinese (zh)
Inventor
鄂海红
宋美娜
韩晶
许可
宋俊德
黎燕
毕建鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN2013102105709A priority Critical patent/CN103279529A/en
Publication of CN103279529A publication Critical patent/CN103279529A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides an unstructured data retrieval method which comprises the following steps of collecting data of user behaviors, processing the data of the user behaviors regularly so as to combine the task attribute of the data of the user behaviors in a preset time period to a task list, using keyword search to obtain a plurality of search results according to search requests of a user, calculating the task mark, the access frequency mark and the edition time duration mark of each search result, wherein the task mark refers to the similarity of the task attribute of each search result and the task attribute in the task list, calculating data popularity of the search results based on the task mark, the access frequency mark and the edition time duration mark, and rearranging the search results according to the data popularity calculation. The unstructured data retrieval method can improve retrieval efficiency and retrieval accuracy of unstructured data, and the invention further provides an unstructured data retrieval system.

Description

Unstructured data search method and system
Technical field
The present invention relates to field of computer technology, relate in particular to a kind of unstructured data search method and system.
Background technology
Big data age has accumulated unstructured datas such as a large amount of office documents, PDF, video in the enterprise, these data come from the multiple operation of the enterprise staff course of work, have by employee oneself establishment, what have comes from mail, have then from network download.Want the file that search needs in a large amount of unstructured datas that accumulate, need also need to spend considerable time through repeatedly search trial.For strengthening the unstructured data retrieval effectiveness, emerge a lot of researchs at the search rank method, the research that has is by linking to improve retrieval rank effect for setting up between the file, the research that has improves effectiveness of retrieval again by the recording user search history, also have research by allowing the user add oneself the memory of target data to be helped improve recall precision, existing research is carried out under " data are equality " this prerequisite substantially in the rank to Search Results, do not consider the relation of data and user behavior to the influence of retrieval rank, the prior art scheme relates to less in the data importance problem simultaneously.
Unstructured data is retrieved the major programme of rank at present both at home and abroad, 1) by excavating single reference situation and the access frequency of file, the algorithm of newly-built desktop resource link is proposed; 2) proposed rank algorithm based on study, its efficient is much better than the rank algorithm based on the file base attribute; 3) propose desktop searching method based on user's memory, when search, improve mode to the memory (as filename, last visit time) of file destination by the user, improve retrieval rank efficient; 4) propose to come the locator data resource based on task, how to utilize task when not considering retrieval; 5) propose to locate based on the desktop resource of User Activity analysis, except the Automatic Extraction user task, also fetch the support fuzzy search by excavating the desktop resource chain.
Existing solution to unstructured data retrieval rank mainly is conceived to data itself, does not consider the relation of data and user behavior to the influence of retrieval rank, and the prior art scheme relates to less in the data importance problem simultaneously.Yet in fact, any operation of data always is in certain task of user, if the task context to data identifies, list names advance to the similar data of the current task of user search behavior with task attribute in the Search Results, obviously can more allow the user satisfied.In addition, the user finishes in a job or the task process, some data can be by frequent operation (for example project demands document in project), and another part data were only operated (may be one piece of technical article that comes automatic network) for several times, just belonged to its importance difference of different pieces of information of same task.
Summary of the invention
The present invention is intended to one of solve the problems of the technologies described above at least.
For this reason, one object of the present invention is to propose a kind of unstructured data search method, and this method can promote recall precision and the retrieval accuracy of unstructured data.
Another object of the present invention is to propose a kind of unstructured data searching system.
To achieve these goals, the embodiment of first aspect present invention discloses the unstructured data search method, may further comprise the steps: gather user behavior data; Regularly handling described user behavior data merges in the task list with the task attribute with the user behavior data in the predetermined amount of time; Use key search to obtain a plurality of Search Results according to user's searching request; Calculate task branch, the access times of each Search Results and divide and editor's duration branch the similarity of the task attribute that wherein said task branch is described each Search Results and the task attribute in the described task list; Based on described task branch, described access times branch and described editor's duration branch described a plurality of Search Results being carried out the data temperature calculates; And calculate according to the data temperature described a plurality of Search Results are resequenced.
Unstructured data search method according to the embodiment of the invention, can promote the recall precision of unstructured data, not only has the accuracy of using key word to retrieve, also by calculating the factors such as task similarity, the significance level in task, key word matching degree, access times and editor's duration of unstructured data, effectively promote retrieval accuracy and with the matching degree of retrieval purpose.
In addition, unstructured data search method according to the above embodiment of the present invention can also have following additional technical characterictic:
In some instances, further comprise step: show the described a plurality of Search Results after resequencing.
In some instances, described user behavior data is to obtain according to described user's behavior daily record is analyzed.
In some instances, further comprise step: calculate the debut ranking branch of each Search Results, wherein divide based on described task branch, described access times branch, described editor's duration branch and described debut ranking described a plurality of Search Results are carried out the calculating of data temperature.
In some instances, described data temperature computing formula:
Heat_score=p*taskScore* (t 1+ t 2* accessScore+t 3* edittimeScore)+and q*initScore, wherein, p, q, t1, t2, t3 are weighted values.
In some instances: p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
In some instances, also comprise: be adjusted at according to the data type of each Search Results and application scenarios and carry out the weight of data temperature when calculating.
In some instances, further comprise step: described a plurality of Search Results are carried out cluster; Respectively each Search Results in described each cluster result is sorted.
In some instances, the step that described a plurality of Search Results are carried out cluster specifically comprises: the relevance of obtaining each Search Results and task in described a plurality of Search Results; According to described relevance each Search Results in described a plurality of result for retrieval is carried out cluster.
The embodiment of second aspect present invention discloses the unstructured data searching system, comprising: acquisition module is used for gathering user behavior data; Processing module, be used for regularly handling described user behavior data and merge to task list with the task attribute with the user behavior data in the predetermined amount of time, and task branch, the access times of calculating each Search Results are divided and editor's duration branch, the similarity of the task attribute that wherein said task branch is described each Search Results and the task attribute in the described task list, and based on described task branch, described access times branch and described editor's duration branch described a plurality of Search Results are carried out the data temperature and calculate; Retrieval module be used for using key search to obtain a plurality of Search Results according to user's searching request, and described a plurality of Search Results is resequenced in calculating according to the data temperature.
Unstructured data searching system according to the embodiment of the invention, can promote the recall precision of unstructured data, not only has the accuracy of using key word to retrieve, also by calculating the factors such as task similarity, the significance level in task, key word matching degree, access times and editor's duration of unstructured data, effectively promote retrieval accuracy and with the matching degree of retrieval purpose.
In addition, unstructured data searching system according to the above embodiment of the present invention can also have following additional technical characterictic:
In some instances, also comprise: display module is used for showing the described a plurality of Search Results after resequencing.
In some instances, described user behavior data is to obtain according to described user's behavior daily record is analyzed.
In some instances, described processing module also is used for calculating the debut ranking branch of each Search Results, wherein divides based on described task branch, described access times branch, described editor's duration branch and described debut ranking described a plurality of Search Results are carried out the calculating of data temperature.
In some instances, described data temperature computing formula:
Heat_score=p*taskScore* (t 1+ t 2* accessScore+t 3* edittimeScore)+and q*initScore, wherein, p, q, t1, t2, t3 are weighted values.
In some instances: p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
In some instances, described processing module also is used for being adjusted at according to the data type of each Search Results and application scenarios and carries out the weight of data temperature when calculating.
In some instances, also comprise: the cluster module is used for described a plurality of Search Results are carried out cluster, respectively each Search Results in described each cluster result is sorted by described retrieval module.
In some instances, described cluster module is carried out cluster to described a plurality of Search Results and is specifically comprised: the relevance of obtaining each Search Results and task in described a plurality of Search Results; According to described relevance each Search Results in described a plurality of result for retrieval is carried out cluster.
The aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or the additional aspect of the present invention and advantage be from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein,
Fig. 1 is the process flow diagram of unstructured data search method according to an embodiment of the invention;
Fig. 2 is the structural drawing of unstructured data searching system according to an embodiment of the invention; And
Fig. 3 is the retrieving synoptic diagram of unstructured data searching system according to an embodiment of the invention.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein identical or similar label is represented identical or similar elements or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.On the contrary, embodiments of the invention comprise spirit and interior all changes, modification and the equivalent of intension scope that falls into institute's additional claims.
In description of the invention, it will be appreciated that term " first ", " second " etc. only are used for describing purpose, and can not be interpreted as indication or hint relative importance.In description of the invention, need to prove that unless clear and definite regulation and restriction are arranged in addition, term " links to each other ", " connection " should do broad understanding, for example, can be fixedly connected, also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can link to each other indirectly by intermediary.For the ordinary skill in the art, can concrete condition understand above-mentioned term concrete implication in the present invention.In addition, in description of the invention, except as otherwise noted, the implication of " a plurality of " is two or more.
Describe and to be understood that in the process flow diagram or in this any process of otherwise describing or method, expression comprises module, fragment or the part of code of the executable instruction of the step that one or more is used to realize specific logical function or process, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.
Below in conjunction with unstructured data search method and the system of accompanying drawing description according to the embodiment of the invention.
Fig. 1 is the process flow diagram of unstructured data search method according to an embodiment of the invention.As shown in Figure 1, this unstructured data search method comprises the steps:
Step S101: gather user behavior data.
Wherein, user behavior data refers to the data that the user generates the operation behavior of unstructured data, for example: the user to the editor of unstructured data, the behavioral data of operation such as browse.
In one embodiment of the invention, user behavior data is to obtain according to user's behavior daily record is analyzed.Particularly, the user is stored in the journal file the behavioral data of the operation of unstructured data, referred to herein as the behavior daily record.The user can therefrom extract above-mentioned user behavior data by user's behavior daily record is analyzed.
Step S102: regularly the process user behavioral data merges in the task list with the task attribute with the user behavior data in the predetermined amount of time.
Specifically, " human behavior dynamics " studies show that, people's behavior (for example the user is to the operation behavior of unstructured data) can be regarded a series of task of handling as, and concentrate in a period of time and to finish a certain task, visible user's operation behavior is very relevant with the task of carrying out in the recent period.Therefore, can think that user behavior data is relevant with one or a series of task.
Therefore, regularly the process user behavioral data merges in the predefined task list with the task attribute with the user behavior data in the predetermined amount of time.In above-mentioned example, predetermined amount of time is such as, but not limited to 2 days, namely gets union for the task attribute of user interactive data (user behavior data) in a couple of days, and is updated in the task list.For example, user behavior data A is relevant with task B, then task B is updated in the task list.
Need to prove, unstructured data need identify in advance with attribute, for example adopts the unstructured data galactic model that unstructured data is identified its feature with attribute (for example comprising: task (task attribute), file access number of times, file editor duration etc. under the file).
For example, the attribute of unstructured data adopts the unstructured data galactic model to be described.As shown in table 1, the attribute of definition unstructured data is fi, thereby identifies the feature of this unstructured data with attribute.For example typical attribute comprises:
Table 1
Figure BDA00003277988300071
Step S103: use key search to obtain a plurality of Search Results according to user's searching request.
For example: the examination reply PPT of user's WKG working project A, want with reference to other PPT in the project implementation this moment, therefore can use key word " project A; PPT " to retrieve, thereby obtain a plurality of Search Results, wherein, Search Results is a plurality of unstructured datas.
Step S104: task branch, the access times of calculating each Search Results are divided and editor's duration branch, and wherein the task branch is the task attribute of each Search Results and the similarity of the task attribute in the task list.The access times branch for example obtains the access times of the unstructured data of this Search Results by the user.Editor's duration branch for example obtains the edit session of the unstructured data of this Search Results by the user.
Certainly, in other example of the present invention, also can calculate the debut ranking branch of each Search Results, and based on task branch, access times branch, editor's duration branch and debut ranking branch a plurality of Search Results be carried out the data temperature and calculate.Wherein, debut ranking divides and can generate according to the rank that above-mentioned user obtains a plurality of Search Results by key search, and for example, rank is more forward, and its corresponding debut ranking divides more high.
Step S105: based on task branch, access times branch and editor's duration branch a plurality of Search Results are carried out the data temperature and calculate.
Specifically, the data temperature (being data temperature score value) of a file (for example unstructured data) is represented the significance level of these data in affiliated task, and this score value comes COMPREHENSIVE CALCULATING by access times, editor's duration, the task matching degree of file.
For example, establishing sim(fileTask, recentTask) is file f iThe similarity of task attribute vector f ileTask and recent task vector recentTask, then
taskScore=sim(fileTask,recentTask) (1)
If file f iAccess times be a i, A={a j| 0<j<n and fi.taskScore=fj.taskScore}, then
accessScore=a i/2Max A (2)
If file f iEditor's duration be et i,, ET={et j| 0<j<n and fi.taskScore=fj.taskScore}, then
edittimeScore=et i/2Max ET (3)
Data temperature computing formula then:
heat_score=p*taskScore*(t 1+t 2*accessScore+t 3*edittimeScore)+q*initScore (4)
Wherein, p, q, t1, t2, t3 are weighted values, in one embodiment of the invention, and p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
In addition, carrying out the data temperature by above-mentioned formula when calculating for dissimilar unstructured datas, can adjust the weights of attribute scores when calculating temperature such as access times, editor's duration, task matching degree according to data characteristics and application scenarios, namely be adjusted at according to the data type of each Search Results and application scenarios and carry out the weight of data temperature when calculating, to reach best rank effect.
As a concrete example, the degree of correlation of the task of carrying out in the recent period for the file of understanding Search Results and user need record and analyze, thereby calculate recent task vector user journal.
By calculating the set F of the file that the user visited in the recent period, by the task attribute of file among the F make up recent task vector recentTask=(rtask1, rtask2 ...).
After submit queries, extract user's key word of the inquiry, be designated as vectorial userQuery=(keyw1, keyw2 ...), wherein keyw1 and keyw2 represent key word of the inquiry.With keyword vector userQuery submit to search engine (such as, the search engine of Windows system) after the retrieval, return initial retrieval result set InitF, each destination file all has task attribute, can be designated as a vector f ileTask=(ftask1, ftask2 ...), wherein ftask1 and ftask2 represent the mark of the task attribute that this document has.Divide taskScore, access times to divide accessScore and editor's duration to divide edittimeScore COMPREHENSIVE CALCULATING data temperature according to task.
Step S106: a plurality of Search Results are resequenced in calculating according to the data temperature.For example search result rank that can data temperature score value is higher is forward.
In addition, in the data temperature is calculated, because access times and editor's duration are numeric types, its minimax value span is big, in order to reduce excessive too small property value to the excessive influence of rank score value, for example, the journal file access times that generated by software are very big, but usually and the user task relation less.
Therefore, embodiments of the invention can at first carry out cluster to a plurality of Search Results, then respectively each Search Results in each cluster result is sorted, particularly, the step that a plurality of Search Results are carried out cluster specifically comprises: the relevance of obtaining each Search Results and task in a plurality of Search Results; According to relevance each Search Results in a plurality of result for retrieval is carried out cluster.For example, can adopt some existing cluster modes, at first destination file is divided into 3 grades according to the task dependencies score value, A and the ET to the identical result for retrieval of task rank gets maximal value then, finally calculates access times score value and editor's duration score value respectively.
Further, after resequencing, this method also can comprise step: show the described a plurality of Search Results after resequencing, thereby make things convenient for the user to check.
Unstructured data search method according to the embodiment of the invention, can promote the recall precision of unstructured data, not only has the accuracy of using key word to retrieve, also by calculating the factors such as task similarity, the significance level in task, key word matching degree, access times and editor's duration of unstructured data, effectively promote retrieval accuracy and with the matching degree of retrieval purpose.
Fig. 2 is unstructured data searching system according to an embodiment of the invention.As shown in Figure 2, the unstructured data searching system according to the embodiment of the invention comprises: acquisition module 210, processing module 220 and retrieval module 230.
Particularly, in conjunction with the retrieval flow of this unstructured data searching system shown in Figure 3, acquisition module 210 is used for gathering user behavior data.Wherein, user behavior data refers to the data that the user generates the operation behavior of unstructured data, for example: the user to the editor of unstructured data, the behavioral data of operation such as browse.
In one embodiment of the invention, user behavior data is to obtain according to user's behavior daily record is analyzed.Particularly, the user is stored in the behavior daily record storehouse of being made up of journal file the behavioral data of the operation of unstructured data.The user can therefrom extract above-mentioned user behavior data by user's behavior daily record is analyzed.As shown in Figure 2, as a concrete example, behavior daily record, unstructured data and task list (being recent task list) etc. all can be stored in the memory module 240.
Processing module 220 is used for regular process user behavioral data and merges to task list with the task attribute with the user behavior data in the predetermined amount of time, and task branch, the access times of calculating each Search Results are divided and editor's duration branch, wherein the task branch is the task attribute of each Search Results and the similarity of the task attribute in the task list, and based on task branch, access times branch and editor's duration branch a plurality of Search Results is carried out the data temperature and calculate.
Specifically, " human behavior dynamics " studies show that, people's behavior (for example the user is to the operation behavior of unstructured data) can be regarded a series of task of handling as, and concentrate in a period of time and to finish a certain task, visible user's operation behavior is very relevant with the task of carrying out in the recent period.Therefore, can think that user behavior data is relevant with one or a series of task.
Therefore, processing module 220 regularly the process user behavioral data merge in the predefined task list with the task attribute with the user behavior data in the predetermined amount of time.In above-mentioned example, predetermined amount of time is such as, but not limited to 2 days, namely gets union for the task attribute of user interactive data (user behavior data) in a couple of days, and is updated in the task list.For example, user behavior data A is relevant with task B, then task B is updated in the task list.
Need to prove, unstructured data need identify in advance with attribute, for example adopts the unstructured data galactic model that unstructured data is identified its feature with attribute (for example comprising: task (task attribute), file access number of times, file editor duration etc. under the file).
For example, the attribute of unstructured data adopts the unstructured data galactic model to be described.As shown in table 1, the attribute of definition unstructured data is fi, thereby identifies the feature of this unstructured data with attribute.
In above-mentioned example, the access times branch for example obtains the access times of the unstructured data of this Search Results by the user.Editor's duration branch for example obtains the edit session of the unstructured data of this Search Results by the user.
Certainly, in other example of the present invention, processing module 220 also can be calculated the debut ranking branch of each Search Results, wherein, based on task branch, access times branch, editor's duration branch and debut ranking branch a plurality of Search Results is carried out the data temperature and calculates.Wherein, debut ranking divides and can generate according to the rank that above-mentioned user obtains a plurality of Search Results by key search, and for example, rank is more forward, and its corresponding debut ranking divides more high.
The data temperature (being data temperature score value) of a file (for example unstructured data) is represented the significance level of these data in affiliated task, and this score value comes COMPREHENSIVE CALCULATING by access times, editor's duration, the task matching degree of file.
For example, establishing sim(fileTask, recentTask) is file f iThe similarity of task attribute vector f ileTask and recent task vector recentTask, then
taskScore=sim(fileTask,recentTask) (1)
If file f iAccess times be a i, A={a j| 0<j<n and fi.taskScore=fj.taskScore}, then
accessScore=a i/2Max A (2)
If file f iEditor's duration be et i,, ET={et j| 0<j<n and fi.taskScore=fj.taskScore}, then
edittimeScore=et i/2Max ET (3)
Data temperature computing formula then:
heat_score=p*taskScore*(t 1+t 2*accessScore+t 3*edittimeScore)+q*initScore (4)
Wherein, p, q, t1, t2, t3 are weighted values, in one embodiment of the invention, and p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
In addition, carrying out the data temperature by above-mentioned formula when calculating for dissimilar unstructured datas, can adjust the weights of attribute scores when calculating temperature such as access times, editor's duration, task matching degree according to data characteristics and application scenarios, be that processing module 220 also is used for being adjusted at according to the data type of each Search Results and application scenarios and carries out the weight of data temperature when calculating, to reach best rank effect.
As a concrete example, the degree of correlation of the task of carrying out in the recent period for the file of understanding Search Results and user need record and analyze, thereby calculate recent task vector user journal.
By calculating the set F of the file that the user visited in the recent period, by the task attribute of file among the F make up recent task vector recentTask=(rtask1, rtask2 ...).
After submit queries, extract user's key word of the inquiry, be designated as vectorial userQuery=(keyw1, keyw2 ...), wherein keyw1 and keyw2 represent key word of the inquiry.With keyword vector userQuery submit to search engine (such as, the search engine of Windows system) after the retrieval, return initial retrieval result set InitF, each destination file all has task attribute, can be designated as a vector f ileTask=(ftask1, ftask2 ...), wherein ftask1 and ftask2 represent the mark of the task attribute that this document has.Divide taskScore, access times to divide accessScore and editor's duration to divide edittimeScore COMPREHENSIVE CALCULATING data temperature according to task.
Retrieval module 230 is used for using key search to obtain a plurality of Search Results according to user's searching request, and a plurality of Search Results are resequenced in calculating according to the data temperature.For example search result rank that can data temperature score value is higher is forward.
For example: the examination reply PPT of user's WKG working project A, want with reference to other PPT in the project implementation this moment, therefore can use key word " project A; PPT " to retrieve, thereby obtain a plurality of Search Results, wherein, Search Results is a plurality of unstructured datas.
In addition, in the data temperature is calculated, because access times and editor's duration are numeric types, its minimax value span is big, in order to reduce excessive too small property value to the excessive influence of rank score value, for example, the journal file access times that generated by software are very big, but usually and the user task relation less.
Therefore, embodiments of the invention also provide cluster module (not shown), this cluster module is used for a plurality of Search Results are carried out cluster, respectively each Search Results in each cluster result is sorted by retrieval module 230, particularly, the cluster module specifically comprises the step that a plurality of Search Results carry out cluster: the relevance of obtaining each Search Results and task in a plurality of Search Results; According to relevance each Search Results in a plurality of result for retrieval is carried out cluster.For example, can adopt some existing cluster modes, at first destination file is divided into 3 grades according to the task dependencies score value, A and the ET to the identical result for retrieval of task rank gets maximal value then, finally calculates access times score value and editor's duration score value respectively.
Further, this system also comprises: the display module (not shown) is used for showing the described a plurality of Search Results after resequencing.
Unstructured data searching system according to the embodiment of the invention, can promote the recall precision of unstructured data, not only has the accuracy of using key word to retrieve, also by calculating the factors such as task similarity, the significance level in task, key word matching degree, access times and editor's duration of unstructured data, effectively promote retrieval accuracy and with the matching degree of retrieval purpose.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in the storer and by software or firmware that suitable instruction execution system is carried out.For example, if realize with hardware, the same in another embodiment, in the available following technology well known in the art each or their combination realize: have for the discrete logic of data-signal being realized the logic gates of logic function, special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
In the description of this instructions, concrete feature, structure, material or characteristics that the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means in conjunction with this embodiment or example description are contained at least one embodiment of the present invention or the example.In this manual, the schematic statement to above-mentioned term not necessarily refers to identical embodiment or example.And concrete feature, structure, material or the characteristics of description can be with the suitable manner combination in any one or more embodiment or example.
Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification to these embodiment that scope of the present invention is by claims and be equal to and limit.

Claims (18)

1. a unstructured data search method is characterized in that, may further comprise the steps:
Gather user behavior data;
Regularly handling described user behavior data merges in the task list with the task attribute with the user behavior data in the predetermined amount of time;
Use key search to obtain a plurality of Search Results according to user's searching request;
Calculate task branch, the access times of each Search Results and divide and editor's duration branch the similarity of the task attribute that wherein said task branch is described each Search Results and the task attribute in the described task list;
Based on described task branch, described access times branch and described editor's duration branch described a plurality of Search Results being carried out the data temperature calculates; And
Described a plurality of Search Results are resequenced in calculating according to the data temperature.
2. method according to claim 1 is characterized in that, further comprises step: show the described a plurality of Search Results after resequencing.
3. method according to claim 1 is characterized in that, described user behavior data is to obtain according to described user's behavior daily record is analyzed.
4. method according to claim 1 is characterized in that, further comprises step:
Calculate the debut ranking branch of each Search Results, wherein divide based on described task branch, described access times branch, described editor's duration branch and described debut ranking described a plurality of Search Results are carried out the calculating of data temperature.
5. method according to claim 4 is characterized in that, described data temperature computing formula:
heat_score=p*taskScore*(t 1+t 2*accessScore+t 3*edittimeScore)+q*initScore,
Wherein, p, q, t1, t2, t3 are weighted values.
6. method according to claim 5 is characterized in that, p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
7. method according to claim 6 is characterized in that, also comprises:
Be adjusted at according to the data type of each Search Results and application scenarios and carry out the weight of data temperature when calculating.
8. method according to claim 1 is characterized in that, further comprises step:
Described a plurality of Search Results are carried out cluster;
Respectively each Search Results in described each cluster result is sorted.
9. method according to claim 8 is characterized in that, the step that described a plurality of Search Results are carried out cluster specifically comprises:
Obtain the relevance of each Search Results and task in described a plurality of Search Results;
According to described relevance each Search Results in described a plurality of result for retrieval is carried out cluster.
10. a unstructured data searching system is characterized in that, comprising:
Acquisition module is used for gathering user behavior data;
Processing module, be used for regularly handling described user behavior data and merge to task list with the task attribute with the user behavior data in the predetermined amount of time, and task branch, the access times of calculating each Search Results are divided and editor's duration branch, the similarity of the task attribute that wherein said task branch is described each Search Results and the task attribute in the described task list, and based on described task branch, described access times branch and described editor's duration branch described a plurality of Search Results are carried out the data temperature and calculate;
Retrieval module be used for using key search to obtain a plurality of Search Results according to user's searching request, and described a plurality of Search Results is resequenced in calculating according to the data temperature.
11. system according to claim 10 is characterized in that, also comprises:
Display module is used for showing the described a plurality of Search Results after resequencing.
12. system according to claim 10 is characterized in that, described user behavior data is to obtain according to described user's behavior daily record is analyzed.
13. system according to claim 10, it is characterized in that, described processing module also is used for calculating the debut ranking branch of each Search Results, wherein divides based on described task branch, described access times branch, described editor's duration branch and described debut ranking described a plurality of Search Results are carried out the calculating of data temperature.
14. system according to claim 13 is characterized in that, described data temperature computing formula:
heat_score=p*taskScore*(t 1+t 2*accessScore+t 3*edittimeScore)+q*initScore,
Wherein, p, q, t1, t2, t3 are weighted values.
15. system according to claim 14 is characterized in that, p:q:t1:t2:t3=95:5:0.9:0.07:0.03.
16. system according to claim 15 is characterized in that, described processing module also is used for being adjusted at according to the data type of each Search Results and application scenarios carries out the weight of data temperature when calculating.
17. system according to claim 10 is characterized in that, also comprises:
The cluster module is used for described a plurality of Search Results are carried out cluster, respectively each Search Results in described each cluster result is sorted by described retrieval module.
18. system according to claim 17 is characterized in that, described cluster module is carried out cluster to described a plurality of Search Results and is specifically comprised:
Obtain the relevance of each Search Results and task in described a plurality of Search Results;
According to described relevance each Search Results in described a plurality of result for retrieval is carried out cluster.
CN2013102105709A 2013-05-30 2013-05-30 Unstructured data retrieval method and system Pending CN103279529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102105709A CN103279529A (en) 2013-05-30 2013-05-30 Unstructured data retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102105709A CN103279529A (en) 2013-05-30 2013-05-30 Unstructured data retrieval method and system

Publications (1)

Publication Number Publication Date
CN103279529A true CN103279529A (en) 2013-09-04

Family

ID=49062048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102105709A Pending CN103279529A (en) 2013-05-30 2013-05-30 Unstructured data retrieval method and system

Country Status (1)

Country Link
CN (1) CN103279529A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750713A (en) * 2013-12-27 2015-07-01 阿里巴巴集团控股有限公司 Method and device for sorting search results
WO2018192496A1 (en) * 2017-04-20 2018-10-25 腾讯科技(深圳)有限公司 Trend information generation method and device, storage medium and electronic device
CN109657050A (en) * 2018-12-20 2019-04-19 湖南晖龙集团股份有限公司 A kind of unstructured data retrieval ranking optimization algorithm of temperature sensitivity
CN111859150A (en) * 2020-08-03 2020-10-30 广州知弘科技有限公司 Terminal information recommendation method based on big data
CN111914876A (en) * 2020-06-10 2020-11-10 华南理工大学 Method, system, device and storage medium for measuring similarity distance between users
CN112612961A (en) * 2020-12-28 2021-04-06 完美世界(北京)软件科技发展有限公司 Information searching method and device, storage medium and computer equipment
CN113778858A (en) * 2021-08-05 2021-12-10 深圳开源互联网安全技术有限公司 Component detection method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101410830A (en) * 2003-10-24 2009-04-15 微软公司 System and method for storing and retrieving XML data encapsulated as an object in a database store
CN101599995A (en) * 2009-07-13 2009-12-09 中国传媒大学 The directory distribution method and the network architecture towards high-concurrency retrieval system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101410830A (en) * 2003-10-24 2009-04-15 微软公司 System and method for storing and retrieving XML data encapsulated as an object in a database store
CN101599995A (en) * 2009-07-13 2009-12-09 中国传媒大学 The directory distribution method and the network architecture towards high-concurrency retrieval system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩晶 等: "HotRank: 热度敏感的非结构化数据检索排名算法", 《计算机应用研究》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750713A (en) * 2013-12-27 2015-07-01 阿里巴巴集团控股有限公司 Method and device for sorting search results
WO2018192496A1 (en) * 2017-04-20 2018-10-25 腾讯科技(深圳)有限公司 Trend information generation method and device, storage medium and electronic device
CN109657050A (en) * 2018-12-20 2019-04-19 湖南晖龙集团股份有限公司 A kind of unstructured data retrieval ranking optimization algorithm of temperature sensitivity
CN111914876A (en) * 2020-06-10 2020-11-10 华南理工大学 Method, system, device and storage medium for measuring similarity distance between users
CN111859150A (en) * 2020-08-03 2020-10-30 广州知弘科技有限公司 Terminal information recommendation method based on big data
CN112612961A (en) * 2020-12-28 2021-04-06 完美世界(北京)软件科技发展有限公司 Information searching method and device, storage medium and computer equipment
CN112612961B (en) * 2020-12-28 2024-02-02 完美世界(北京)软件科技发展有限公司 Information searching method, device, storage medium and computer equipment
CN113778858A (en) * 2021-08-05 2021-12-10 深圳开源互联网安全技术有限公司 Component detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103279529A (en) Unstructured data retrieval method and system
US7730060B2 (en) Efficient evaluation of object finder queries
CN104903894B (en) System and method for distributed networks database query engine
CN100483407C (en) Document information management system and document information management method
CN108647276B (en) Searching method
CN104620239A (en) Adaptive query optimization
US20070260586A1 (en) Systems and methods for selecting and organizing information using temporal clustering
CN102033910A (en) Enterprise search engine technology based on multiple data resources
Mirza et al. Practicability of dataspace systems
Khemmarat et al. Fast top-k path-based relevance query on massive graphs
CN103198136A (en) Sequence-association-based query method for personal computer files
CN103914566A (en) Search result display method and search result display device
KR20160120583A (en) Knowledge Management System and method for data management based on knowledge structure
Zheng et al. Efficient retrieval of top-k most similar users from travel smart card data
Sassi et al. Supporting ontology adaptation and versioning based on a graph of relevance
Abdullah et al. A sequential data preprocessing tool for data mining
CN103324640A (en) Method and device for determining search result file, as well as equipment
Bar‐Ilan et al. A method for measuring the evolution of a topic on the Web: The case of “informetrics”
CN113468166A (en) Metadata processing method and device, storage medium and server
Zhou et al. Olap on search logs: an infrastructure supporting data-driven applications in search engines
Hussan et al. An optimized user behavior prediction model using genetic algorithm on mobile web structure
Zhang et al. A Web Site Classification Approach Based On Its Topological Structure.
Sheokand et al. Best effort query answering in dataspaces on unstructured data
Kanza et al. Combined geo-social search: computing top-k join queries over incomplete information
Devezas et al. FEUP at TREC 2017 OpenSearch Track Graph-Based Models for Entity-Oriented

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination