CN104166651A

CN104166651A - Data searching method and device based on integration of data objects in same classes

Info

Publication number: CN104166651A
Application number: CN201310182427.3A
Authority: CN
Inventors: 郎皓; 欧海峰; 张丙奇; 孙健
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-05-16
Filing date: 2013-05-16
Publication date: 2014-11-26
Anticipated expiration: 2033-05-16
Also published as: CN107844565B; CN107844565A; CN104166651B

Abstract

The invention relates to a data searching method and device based on integration of data objects in the same classes. The data searching method comprises the steps that a search request from a user is received, and one or more data objects matched with the search request are searched for among the data objects to be searched for; each found data object is analyzed to obtain a data label of each data object; the obtained data labels are matched; one or more data objects matched with the data labels are integrated into a same-class data object set, and the search result is fed back to the user. According to the data searching method and device based on integration of the data objects in the same classes, the massive data objects are integrated in a classified mode, the data objects in the same classes are obtained, one of the fed-back multiple same-class data objects is displayed in a search engine, so that the accuracy and fed-back rate of data search are improved, and the diversity of the searching results is increased.

Description

The method and apparatus of the data search based on homogeneous data object is integrated

Technical field

The application relates to data search field, relates in particular to a kind of method and apparatus of the data search based on homogeneous data object is integrated.

Background technology

Along with the arriving in cloud epoch, large data have attracted increasing concern, and large data technique does not lie in grasp mass data/data object, become the needed data of user and be more conceived to reach collection, process and arrange within rational time.In network, exist a large amount of data, utilize fully these data, can bring great convenience for user's life.User can be by using search engine to carry out data search, in order to obtain expecting the data of acquisition.Taking data search as example, search engine captures the webpage in internet in advance, after the webpage to captured carries out pre-service, just can provide retrieval service.Wherein, most important is exactly the keyword extracting in webpage, and other also comprise removes repeated pages, participle, judgement type of webpage, analyzes hyperlink, calculates the importance degree/richness etc. of webpage.

In the time carrying out data search, search engine is just according to the key word of user's input, retrieve the occurrence high with this key word correlativity, but in this process, the Search Results enormous amount matching with described key word, and include the every field of social life, thus cause Search Results quality low, as: be unfavorable for that user uses, poor accuracy.

If the means that employing information is integrated, search engine can carry out the mass data object of its crawl the processing such as content is selected, analyzed, classification, and the scope that can dwindle data search increases the specific aim of Search Results.But due to the ambiguity existing between data (as: the corresponding different field of same key word), cause the accuracy of Search Results low; Or key word exists other expression methods (too net of Ethernet, second), cause Search Results to return not comprehensive.

For example, key word " Ethernet " is carried out to data search, in search results pages, there will be the Search Results relevant to " Ethernet ", but " Ethernet " and " second is net too " is the key word of same meaning different expression, owing to there is not any incidence relation between both keyword, the Search Results relevant to " second is net too " can appear in search results pages, causes a part of Search Results to fail to be retrieved, reduce Search Results quality, as: the return rate of Search Results.

And, because search engine has carried out the processing such as content is selected, analyzed, classification to the data/data object of magnanimity, in the time returning to Search Results, in search results pages, can show multiple same or analogous data objects, so just cause the waste of Search Results.For example, in every one page search results pages, can only show 20 Search Results, but in these 20 Search Results, have 10 to be same or analogous data object, user has to repeatedly click lower one page so, to check different data objects.

Summary of the invention

The application's fundamental purpose is to provide a kind of method and apparatus of the data search based on homogeneous data object is integrated, while carrying out data search to solve the search engine of use prior art, because data volume is excessive, and between data object and data object, there is not relevance, and the low-quality problem of Search Results occurring.

In order to solve the problems of the technologies described above, the application's object is achieved through the following technical solutions:

The application provides a kind of method of the data search based on homogeneous data object is integrated, comprise the following steps: receive the searching request from user, one or more data objects that search and described searching request match in all data objects to be searched; Each in described one or more data objects that analysis searches, to obtain the data label of data object described in each; The described data label obtaining is mated; One or more data objects that described data label is matched are integrated into homogeneous data object composition, and are back to user as Search Results.

Preferably, according in the method described in the application, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.

Preferably, according in the method described in the application, can also comprise: to all data objects to be searched, integrate and process in advance, to determine the corresponding one or more homogeneous data objects of data object to be searched described in each, to obtain data object mapping relations table.

Preferably, according in the method described in the application, to all data objects to be searched, integrate and process in advance, comprising: the second data label in each data object and the second data label classification distribution table are excavated to processing; The second data label in each data object is carried out to the second data label excavation, generate the synon set of the second data label of all data objects; The first data label in each data object is carried out to the first data label excavation, generate the first data label synonym set of all data objects; The first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label.

Preferably, according in the method described in the application, described the second data label synonym comprises: same item now, has different the second data labels and has multiple data objects of identical the first data label; Described the first data label synonym comprises: the first multiple similar data label in same data object.

Preferably, according in the method described in the application, the first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label, comprise: if a data object only has first data label and described the first data label only to have co-occurrence with the second unique data label, set up the mapping relations of described the first data label and described the second data label.

Preferably, according in the method described in the application, to all data objects to be searched, integrate and process in advance, comprise: extract one or more the second data labels in same data object, to obtain one or more candidate's the second data labels, and one or more candidates the second data label extracting is carried out to disambiguation; Based on the rule of configuration, extract the first data label in multiple data objects, and to multiple the first data label normalizeds that extract; Synon the second data label or the first data label are normalized each other; According to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carry out the second data label completion.

Preferably, according in the method described in the application, one or more candidates the second data label extracting is carried out to disambiguation, comprise: the classification distribution table based on the second data label, obtain the number of times that described candidate's the second data label occurs in described classification, if number of times is greater than default threshold value, think the second data label of described data object; And/or, if multiple candidate's the second data labels appear in a data object, being chosen in the second data label classification distribution table, described second data label that occurrence number is maximum is as the second data label of described data object.

Preferably, according in the method described in the application, can comprise: in search results pages, show one of them of multiple data objects in described homogeneous data combination, wherein, described homogeneous data combination comprises: multiple data objects of homogeneous data object each other.

Preferably, according in the method described in the application, described homogeneous data object can comprise: in same item now, have the second data label of identical or synonym and have multiple data objects of the first data label of identical or synonym.

The application also provides a kind of device of the data search based on homogeneous data object is integrated, comprise: receive and search module, for receiving the searching request from user, one or more data objects that search and described searching request match in all data objects to be searched; Acquisition module, for analyzing each of described one or more data objects of searching, to obtain the data label of data object described in each; Matching module, for mating the described data label obtaining; Integrate and return to module, being integrated into homogeneous data object composition for one or more data objects that described data label is matched, and being back to user as Search Results.

Preferably, according in the device described in the application, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.

Preferably, according in the device described in the application, can also comprise: pretreatment module, be used for all data objects to be searched, integrate and process in advance, to determine the corresponding one or more homogeneous data objects of data object to be searched described in each, to obtain data object mapping relations table.

Preferably, according in the device described in the application, described pretreatment module is also configured to: the second data label in each data object and the second data label classification distribution table are excavated to processing; The second data label in each data object is carried out to the second data label excavation, generate the synon set of the second data label of all data objects; The first data label in each data object is carried out to the first data label excavation, generate the first data label synonym set of all data objects; The first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label; If a data object only has first data label and described the first data label only to have co-occurrence with the second unique data label, set up the mapping relations of described the first data label and described the second data label.

Preferably, according in the device described in the application, described the second data label synonym comprises: now similar, and there are different the second data labels and there are multiple data objects of identical the first data label; Stating the first data label synonym comprises: the first multiple similar data label in same data object.

Preferably, according in the device described in the application, described pretreatment module is also configured to: extract one or more the second data labels in same data object, to obtain one or more candidate's the second data labels, and one or more candidates the second data label extracting is carried out to disambiguation; Based on the rule of configuration, extract the first data label in multiple data objects, and to multiple the first data label normalizeds that extract; Synon the second data label or the first data label are normalized each other; According to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carry out the second data label completion; Classification distribution table based on the second data label, obtains the number of times that described candidate's the second data label occurs in described classification, if number of times is greater than default threshold value, thinks the second data label of described data object; If and/or multiple candidate's the second data labels appear in a data object, be chosen in the second data label classification distribution table, described second data label that occurrence number is maximum is as the second data label of described data object.

Preferably, according in the device described in the application, described integration with return to module and be also configured to: in search results pages, show one of them of multiple data objects in the combination of described homogeneous data, wherein, described homogeneous data combination comprises: multiple data objects of homogeneous data object each other; Described homogeneous data object comprises: in same item now, have the second data label of identical or synonym and have multiple data objects of the first data label of identical or synonym.

Compared with prior art, according to the application's technical scheme, there is following beneficial effect:

The application utilizes the important label/attribute such as the first data label, the second data label of data object, in advance to the integration of classifying of mass data object, and between homogeneous data object, set up association, improve accuracy and the return rate of data search, thereby promoted the quality of Search Results.

Multiple homogeneous data objects that the application returns to search engine are integrated processing, and in search results pages, only show in the plurality of homogeneous data object, thereby make search results pages show a greater variety of data objects, increased the diversity of Search Results, better user experience.

Brief description of the drawings

Accompanying drawing described herein is used to provide further understanding of the present application, forms the application's a part, and the application's schematic description and description is used for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:

Fig. 1 is the process flow diagram of the method for the data search based on homogeneous data object is integrated of the embodiment of the present application;

Fig. 2 is the process flow diagram of the step of the pre-integration processing to homogeneous data object of the embodiment of the present application;

Fig. 3 is the process flow diagram to the data mining processing under all data objects to be searched execution lines of the embodiment of the present application;

Fig. 4 is the embodiment of the present application all data objects to be searched is carried out corresponding normalization and shone upon the process flow diagram of processing; And

Fig. 5 is the structural drawing of the device of the data search based on homogeneous data object is integrated of the embodiment of the present application.

Embodiment

The application's main thought is, utilize the data label (attribute) comprising in searched data object to distinguish homogeneous data object and inhomogeneity data object, the data label (for example: the first data label and the second data label etc.) comprising in usage data object, the mass data object in integrated database in advance.For example: to shining upon between the second data label of same implication different expression, as, " Ethernet " and " second is net too ", to shining upon between difference the first data label comprising in same data object, the different pieces of information object with same the first data label is shone upon etc., based on the mapping relations between data object and data label, to obtain multiple data objects of homogeneous data object each other.And, in data search, based on the integration in advance of mass data object, according to user's searching request, as utilized " key word " searching request (Key), obtain the data object matching with this key word in database in, can also obtain the homogeneous data object of this data object matching, thereby accuracy and the return rate of data search are improved, and, can also integrate processing to multiple homogeneous data objects, in search results pages, only show in the plurality of homogeneous data object, thereby make search results pages show a greater variety of data objects, increase the diversity of Search Results.

For making the application's object, technical scheme and advantage clearer, below in conjunction with drawings and the specific embodiments, the application is described in further detail.

According to the application's embodiment, provide a kind of method of the data search based on homogeneous data object is integrated.With reference to the process flow diagram of method of the data search based on homogeneous data is integrated of figure 1 the embodiment of the present application.

At step S102 place, receive the searching request from user, and carry out search.Wherein, described searching request is for searching out at all data objects to be searched the one or more data objects that match with this searching request.

This searching request can comprise key word or network linking etc., carries out search according to searching request, to find the data object matching with this key word, or finds this network linking one or more data objects pointed etc.User, by sending this searching request, can in many data objects, obtain and this key word or the data object that matches with the content of this network linking representative, and this data object matching can be one or more.

Described one or more data object can be stored in database with the form of data file.Wherein, each data object in described one or more data objects comprises various data labels, as the first data label, the second data label etc.The first data label, the second data label are to represent to represent two kinds of diverse features or attribute, and this description is in order to divide into two kinds of features but not definition.Be stored in data file to be searched in database, need to have corresponding data structure to organize and integrate, integrality, high-quality and the high-level efficiency of its search of guarantee, this is described integrating below in the processing of homogeneous data object.

At step S104 place, by each data object in the one or more data objects that search is analyzed, obtain the data label of described each data object.Wherein, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.

In other words, can be by each data object in the one or more data objects that search be analyzed, obtain the first data label and/or second data label of described each data object.

In described each data object, comprise multiple data labels, as the title of data object, memory location, index number etc.Each data object can comprise the first data label, the second data label, and described the first data label and the second data label be the different data label of characterization data object respectively.Wherein, the first data label and the second data label can be determined a data object, so, in the application's embodiment, using the first data label and the second data label as the basis of integrating homogeneous data.For example,, in employee information table, using employee ID(12345) as the first data label, using name (Zhang San) as the second data label, this employee ID(12345) and name (Zhang San) can determine an employee Zhang San in this employee information table.

In mass data object, can utilize described the first data label and the second data label can determine a data Properties of Objects, determine the homogeneous data object of this data object, again for example, can be using commodity as a data object, using the article No. of these commodity as the first data label, using the brand word of these commodity as the second data label, can be by the commodity that search be analyzed, to obtain article No. and the brand word of these commodity, can mate with magnanimity commodity by the article No. of this acquisition and brand word, thereby obtain the similar commodity of these commodity, for example, commodity " shell head sport footwear " can be determined in commodity article No. " 1111 " and commodity brand word " Nike ", so in magnanimity commodity, can be by commodity article No. " 1111 " and commodity brand word " Nike " be mated with magnanimity commodity, thereby obtaining multiple is the commodity with money with " shell head sport footwear ".By the first data label obtaining and/or the second data label step of mating with mass data, specifically can be referring to step S106.

At step S106 place, according to the data label obtaining, as, the first data label and/or the second data label, mate the described data label obtaining, to obtain the one or more similar data object matching with described each data object.

Thus, can obtain the more data object that matches with searching request, thereby improve the comprehensive of data search, promote the quality of Search Results, the data search service of providing convenience for user.

At step S108 place, be a homogeneous data combination by described each data object and corresponding one or more homogeneous data objects integration (polymerization) thereof, and return to described user.

In other words, one or more data objects that described data label (the first data label and/or described the second data label) can be matched are integrated into homogeneous data object composition, and are back to user as Search Results.

Described homogeneous data combination comprises multiple data objects of homogeneous data object each other, and user can check each the homogeneous data object in the multiple homogeneous data objects that wherein comprise by this homogeneous data combination.

In one embodiment, can in search results pages, only show one of them of multiple data objects in homogeneous data combination, and hiding other homogeneous data objects, in the time need to showing other homogeneous data objects that this is hidden, can trigger one for showing the operation of these other homogeneous data objects of hiding, for example,, by triggering the modes such as a button.

Further, Search Results is by the data object matching with described searching request, and described in one or more homogeneous data objects corresponding to data object that match return to described user.Like this, in returning to the data object matching with searching request, return to the homogeneous data object corresponding with this data object, accuracy and the return rate of data search are improved, and the concept that uses homogeneous data to combine, condenses together of a sort data object, can in search results pages, show a greater variety of data objects, increase the diversity of Search Results.

Wherein, can, according to data label, as the first data label and/or the second data label, match more homogeneous data objects, be that the data structure of integrating based on homogeneous data object realizes.The integration process of similar data object will be specifically described below.

The process flow diagram of the pre-integration processing to homogeneous data object of the embodiment of the present application as shown in Figure 2.

Step S202, integrates processing in advance to all data objects to be searched, to determine the corresponding one or more homogeneous data objects of data object to be searched described in each, obtains data object mapping relations table.

Carry out the step that this integrates processing in advance, object is to obtain the one or more homogeneous data objects corresponding with each data object.Wherein, described homogeneous data object comprises, in identical class now, has the second data label of identical or synonym and has multiple data objects of the first data label of identical or synonym.

This step of integrating in advance processing is by the data mining processing to mass data object under line, and based on this, data object is excavated to result, the second data label on execution line and the extraction of the first data label, and corresponding normalization and mapping processing, the final mapping relations that obtain between data object, the first data label, the second data label, thereby, same item of a sort data object is now integrated together, obtains the incidence relation of the homogeneous data object after integrating.

First, based on mass data object (as several hundred million orders of magnitude), carry out the data mining processing under line, for example all data objects to be searched are carried out to the data mining processing under line, preferred mode, as shown in Figure 3, excavate the mapping table of the second data label table, the second data label synonym table, the second data label classification distribution table, the first data label synonym table and the first data label to the second data label.

Step S302, excavates processing to the second data label in each data object and the second data label classification distribution table.

Can extract the second data label in each data object, generate the set of the second data label of all data objects, as, the second data label table formed; And set based on this second data label, by all data objects in database, carry out classification division, the the second data label classification that obtains all data objects distributes, as, form the second data label classification distribution table, the various classification as paper database: GEOGRAPHIC ATTRIBUTES, life kind etc.Various classification as merchandising database: clothing, clock and watch class etc.

Preferably, to same class now, in the data file of each data object, add up all different the second data labels that it has, and the second all data label of all data objects is formed to the set of the second data label, and add up number of times or frequency that each second data label occurs.According to all second data labels of all data objects, can form the second data label table of all data objects, now the number of times or the frequency that occur in the class of difference classification according to the second data label of each data object, can form the second data label classification distribution table of all the second data labels, wherein, in this second data label classification distribution table, comprise multiple classifications, each class multiple the second data labels now, and the number of times of each the second data label (or frequency).

For example: GEOGRAPHIC ATTRIBUTES data object city now, in its file, extract school bus, subway, three the second different data labels of car, all occur in the file of " city " this object now in same GEOGRAPHIC ATTRIBUTES, " school bus ", " subway ", " car " three second data labels are all put into the second data label set, like this, same class now with inhomogeneity now all second data labels of all data objects all extract form the second data label set, can store by tabular form.And also need these the second data labels, as number of times or frequency (frequency) that in GEOGRAPHIC ATTRIBUTES object " city " now, different the second data labels " school bus ", " subway ", " car " occur are separately added up the classification distribution relation that forms the second data label." school bus " occurs 15 times, " subway " occurs 10 times, " car " occurs 20 times, and queue up from big to small, and other classifications as the file of " the large-scale articles for use of the family " data object of " life kind " in, the number of times that " school bus " occurs is 0, and the number of times that " subway " occurs is 0, and the number of times that " car " occurs is 20.Thus, " school bus ", " subway " these two second data labels belong to " geography " class now, and " car " can be corresponding in " geography ", " life " class now.Classification, such the second data label, this second data label are now added up to the number of times or the frequency (frequency) that occur and preserved by the second data label classification distribution table.

Wherein, the number of times that the second data label is occurred or the statistics of frequency, can be in the data file that comprises data object, the information of relevant the second data label that statistics occurs, for example: the second data label comprising in data object attribute information, title word segmentation result etc.

Step S304, can carry out the second data label excavation to the second data label in each data object, generates the synon set of the second data label of all data objects, as, form the second data label synonym table.

In same item now, the second data label in each data object is carried out to synonym and excavate processing, for example, extract same class the second data label and first data label of all data objects now.Be considered as right once jointly the occurring of the second data label synonym by thering is difference the second data label comprising in two data objects of identical the first data label.For example: data object M ₁there is label a(the second data label) and coding B(the first data label); Data object M ₂there is label A (the second data label) and coding B(the first data label), can think that label A and label a are synonym, data object M ₁with data object M ₂be exactly right once jointly the occurring of synonym.After this, can be based on the second data label classification distribution table, add up the right common occurrence number of all the second data label synonyms.Sort from high to low according to the second synon common occurrence number of data label (or frequency), can preferentially generate the second data label synonym table to the second data label synonym of high reps and come and preserve.

This excavation processing obtains same item now, has different the second data labels and has multiple data object associations of identical the first data label, forms the synon set of the second data label, forms the synon multiple data objects of the second data label each other.Such as, if there is the first identical data label between multiple data objects, but there is the second different data labels, described the second different data label can be called to the second data label synonym, can also preserve by the form with the second data label synonym table.

Step S306, can carry out the first data label excavation to the first data label in each data object, generates the first data label synonym set of all data objects, as the first data label synonym table.

For example, can extract multiple first data labels of (as its data file) in present individual data object, for example, multiple the first data labels that comprise in the heading message of file, if wherein two the first data labels meet that length is identical and prefix is identical, think the first data label synonym pair, finally, by the first all data label synonyms pair, aggregate into the first data label synonym bunch (synonym set), can adopt the first data label synonym table form to preserve.Like this, by the first multiple similar data label in same data object, formed the synon set of the first data label.

More specifically, in a data object, can comprise multiple the first data labels, as, a certain data object A comprises the first data label " 1110 ", " 1111 ", and, these two first data labels meet that length is identical and prefix is identical, and the first data label " 1110 " and " 1111 " can become the first data label synonym.Can form one first data label synonym table by such mode.

Step S308, can excavate the first data label in each data object and the second data label, generates the mapping relations of the first data label to the second data label.

For example, extract the first data label and the second data label in the data file of all data objects, according to described the second data label classification distribution table, add up same the first data label and different the second data label co-occurrence (the jointly occurring) number of times (or frequency) at data object, wherein, if a data object only has first data label and this first data label only to have co-occurrence mistake with the second unique data label, set up the mapping relations of this first data label and this second data label, as set up the mapping table of the first data label to the second data label, and preserve, so that second data tag information that may lack in the feature of some data object of completion, as, some data object may occur only having the first data label and the situation that there is no the second data label: in certain data object, only have coding " 11 " (the first data label) and without label (the second data label), but once there is and only occurred the situation with label " A " (the first data label) co-occurrence in coding " 11 ", mapping.The information of second data label this feature of such mapping in can completion data object.Thereby, in the time of response searching request, can provide the recall rate (find rate, search full rate, return rate) of the same class data object of polymerization.

For example: data object A only includes first data label " 1110 ", and do not comprise (lacking) second data label, according to the first data label of all data objects that extract and the second data label, at same class now, only there is co-occurrence with the second data label " BB " of data object B in this first data label " 1110 ", in other words, in all data objects, only have data object B to comprise the first data label " 1110 " and the second data label " BB ", only there is co-occurrence with the second data label " BB " of data object B in the first data label " 1110 " of data object A, in this case, can set up the mapping relations of the first data label " 1110 " to the second data label " BB ".

Further, the mapping table of the second data label table based on above-mentioned data mining polymerization, the second data label synonym table, the first data label synonym table, the first data label to the second data label, the classification of the second data label distributes, can all data objects on line be normalized and be shone upon, embody thus the mapping relations between data object, the first data label and the second data label.

According to the above-mentioned data mining result to mass data object, can form the incidence relation of the homogeneous data object of initial integration,, in same item now, there is the second data label of identical or synonym and there are multiple data objects of the first data label of identical or synonym.

Secondly, for a certain data object (or each data object), based on each set (table) of excavating under line, in the heading message and attribute information of this data object (data file), extract the second data label and the first data label, and, according to data mining result, all data objects to be searched are carried out to corresponding normalization (unification) and mapping processing, and final integration belongs to of a sort data object, as shown in Figure 4.Further optimizing similar data object integrates.The classification distribution table of optimization data object, the mapping relations of optimization data object, the second data label, the first data label etc.

Step S402, extracts one or more the second data labels in same data object, to obtain one or more candidate's the second data labels.Heading message participle to a certain data object (attribute information equally can), then by certain participle fragment (set) coupling the second data label table, if mated with the second data label in the second data label table completely, using the second data label as candidate.For example, in a data object, comprise the second data label " A ", the second data label " B ", the second data label " C ", and in the second data label table, only comprise the second data label " A ", the second data label " B ", candidate's the second data label using the second data label " A ", the second data label " B " as this data object.

Based on the second data label classification distribution table, add up number of times or frequency that different the second data labels occur, sort from high in the end according to number of times or frequency, the second data label that can be using the candidate of high reps or high frequency the second data label as this data object.

Step S404, carries out disambiguation to one or more candidates the second data label extracting.

The disambiguation processing that candidate's the second data label is carried out comprises that according to each candidate's the second data label occurrence number in classification distribution table or frequency filter out candidate's the second data label of conforming to a predetermined condition the second data label as data object.

In a concrete embodiment, classification under specified data object, classification distribution table based on the second data label, obtain the number of times (or frequency) that this candidate's second data label occurs in this classification, if number of times is greater than default threshold value (such as 1 time), think the second data label of this data object.In another embodiment, if multiple candidate's the second data labels appear in a data object, be chosen in the second data label classification distribution table, second data label of occurrence number maximum (frequency maximum) is as the second data label of this data object.For example, candidate's second data label of a known data object comprises the second data label " A " and the second data label " B ", according to the second data label classification distribution table, class under this data object now, the number of times that the second data label " A " occurs is 1000 times, the number of times that the second data label " B " occurs is 1 time, the second data label " A " can be defined as to the second data label of this data object.

Step S406, the second data label synonym normalization.The second data label after disambiguation can be based on excavating under line the second data label synonym table, the second data label of the data object of extraction is rewritten to normalization the second data label.For example, when the second data label " A " occurred 500 times in the second data label classification distribution table, the synonym " B " of this second data label " A " occurs 20 times in the second data label classification distribution table, the second data label " B " can be changed to the second data label " A ".

For example, a certain data object is commodity, and the second data label of commodity can comprise the brand word of commodity.Same commodity, may there is different literary styles in its brand word, comprise the synonym of brand word and the form of wrongly writing.For example, the brand word of a certain commodity is " new bolune ", and this brand word exists synonym " New Balance " and " new balance ", or the form of wrongly writing " newbalance ", or writes a Chinese character in simplified form " nb " etc.Can be according to the brand word (the second data label) after the synonym table of brand word (the second data label synonym table) and disambiguation, rewrite the second data label (brand word) extracting, a unified most suitable brand word, as the second data label of these commodity, uses " New Balance " second data label as these commodity as unified.

Step S408, according to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carries out the second data label completion.

If a data object has only extracted the first data label and has not extracted the second data label, the i.e. situation of the second data label disappearance, and, the first data label extracting, completely can with the first data label to the second data label mapping table of excavating under same class object line in the first data label match, in the first data label to the second data label mapping table excavating, obtain the second data label of this data object from this line, for this data object being aggregated in the set of corresponding same class data object, further, this the second data label can be write in this data object that lacks the second data label.

Step S410, based on the rule of configuration, extracts the first data label in multiple data objects.Based on configuration rule in heading message, the attribute information etc. of data file, the first data label of extracted data object.For example, configuration regular expression extracts the first data label of certain data object.

Step S412, to multiple the first data label normalizeds that extract.For example, in a data object, comprise identical data number, as " 1110 ", with different son numbering as " 001 ", " 002 ", son numbering is removed, to reach the normalization of the first data label: " 1110-001 " and " 1110-002 " is normalized under " 1110 ", or is normalized in the lump under the first identical data label " 1110-001 ".

Taking the commercial articles searching of search mass data object as example, the first data label of extracting in data object commodity is as article No., based on separator cutting, article No. " 537889-001 " is " 537889 " and " 001 " two parts based on "-" cutting, and main article No. " 537889 " is above considered as to the article No. after normalization.

Step S414, the first data label synonym normalization.After the first data label normalized, based on the first data label synonym table of excavating under line, the first data label of the data object of extraction to be rewritten, unification is first data label.For example, when the first data label " 1110 " occurred 500 times in the second data label classification distribution table, the synonym " 1111 " of this first data label " 1110 " occurs 20 times in the classification distribution table of data object, the first data label " 1111 " can be changed to the first data label " 1110 ".

Operation each table based on data mining under line on line, by each data object based on the most ever-present the first data label and the second data label in its data label, integrate, synon the second data label or the first data label are normalized each other, according to the second data label table, the second data label synonym table, the first data label synonym table, the second data label classification distribution table, the first data label to the second data label mapping table, determine in a certain class now, which data object should be integrated into homogeneous data object, and unified its second data label and the first data label, so that search coupling.

According to the data mining under line, with normalization, the completion processing on line, can obtain the mapping relations between data object, the first data label, this three of the second data label, can form data object mapping relations table, so that in the time of data search, according to this data object mapping relations table search homogeneous data object.

Step S204, the data object mapping relations table obtaining is integrated in storage in advance.This storage comprises that the data object mapping relations table of storing after the integration by obtaining after completion normalization on excavation and line under line is in database.

In to the integration processing procedure in advance of data object, form the second data label table, the second data label classification distribution table, the second data label synonym table, the first data label synonym table, the first data label to the second data label mapping table, and the result store that described various tables (set) are processed as pre-integration is in database, like this, can in data search, call at any time, to improve system arithmetic speed.

By the application's the method that data object is integrated in advance, associated by setting up between homogeneous data object, and in Search Results, show homogeneous data object, can provide more fully data for user, improve accuracy and the return rate of data search, thereby promoted the quality of Search Results.

The application also provides a kind of device of the data search of integrating based on homogeneous data.

As shown in Figure 5, be the structure drawing of device of data search of integrating based on homogeneous data of the embodiment of the present application.

According in the device 500 described in the application, can comprise and receiving and search module 501, acquisition module 503, matching module 505, integrate and return to module 507.The enforcement of each step of the corresponding said method of modules.

Wherein, receive and search module 501, for receiving the searching request from user, and carry out search, wherein, the one or more data objects of described searching request for matching in all data object to be searched search and described searching request.

Acquisition module 503, be used for each data object of the described one or more data objects that search by analysis, obtain the data label of data object described in each, wherein, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.So, described acquisition module 503 can be for obtaining the first data label and/or second data label of data object described in each.

Matching module 505, for mating according to the described data label (the first data label and/or described the second data label) obtaining,, each data object in the one or more data objects that search is done to further coupling, to obtain the one or more homogeneous data objects corresponding with described each data object.

Integrate and return to module 507, being integrated into homogeneous data object composition for one or more data objects that described data label (the first data label and/or described the second data label) is matched, and being back to user as Search Results.Wherein, the combination of described homogeneous data comprises: multiple data objects of homogeneous data object each other, in search results pages, can show one of them of multiple data objects in described homogeneous data combination.

In the device 500 described in the application, also comprise pretreatment module 509 and memory module 511.

Wherein, pretreatment module 509, for all data objects to be searched are integrated to processing in advance, determines the corresponding one or more homogeneous data objects of data object to be searched described in each, to obtain data object mapping relations table.

Particularly, this pretreatment module 509 is carried out data object normalization and the mapping on data mining and the line under line to all data objects to be searched.

In the time of the data mining of carrying out under line, described pretreatment module 509 can be excavated processing to the second data label in each data object and the second data label classification distribution table.

Described pretreatment module 509 can be carried out the second data label excavation to the second data label in each data object, generates the synon set of the second data label of all data objects.Wherein, described the second data label synonym comprises: same item now, has different the second data labels and has multiple data objects of identical the first data label.

Described pretreatment module 509 can be carried out the first data label excavation to the first data label in each data object, generates the first data label synonym set of all data objects.Wherein, described the first data label synonym comprises: the first multiple similar data label in same data object.

Described pretreatment module 509 can be excavated the first data label in each data object and the second data label, generates the mapping relations of the first data label to the second data label.Particularly, if a data object only has first data label and this first data label only to have co-occurrence with the second unique data label, set up the mapping relations of this first data label and this second data label.

In the time carrying out the data object normalization on line and shine upon, this pretreatment module 509 is configured to: extract one or more the second data labels in same data object, to obtain one or more candidate's the second data labels, and one or more candidates the second data label extracting is carried out to disambiguation.Further, the classification distribution table based on the second data label, obtains the number of times that described candidate's the second data label occurs in this classification, if number of times is greater than default threshold value, thinks the second data label of this data object.In another kind of embodiment, if multiple candidate's the second data labels appear in a data object, be chosen in the second data label classification distribution table, described second data label that occurrence number is maximum is as the second data label of described data object.

Pretreatment module 509 is also configured to: based on the rule of configuration, extract the first data label in multiple data objects, and to multiple the first data label normalizeds that extract; Synon the second data label or the first data label are normalized each other; According to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carry out the second data label completion.

These pretreatment module 509 objects are all data objects to be searched (mass data object) to carry out integration processing in advance, to obtain homogeneous data object, described homogeneous data object is included in same item now, has the second data label of identical or synonym and has multiple data objects of the first data label of identical or synonym.And, integrating in advance in the process of processing, can obtain the mapping relations of data object, the first data label, the second data label, form data object mapping relations table.

Memory module 511, the data object mapping relations table obtaining for storing integration in advance.In the time carrying out data search, can pass through this data object mapping relations table, directly match the homogeneous data object corresponding with the data object searching.

So, utilize the key character/attribute such as the first data label, the second data label of data object, in advance to the integration of classifying of mass data object, and between homogeneous data object, set up associated, improve accuracy and the return rate of data search, thereby promoted the quality of Search Results.

And, multiple homogeneous data objects that search engine is returned are integrated processing, and can in search results pages, only show in the plurality of homogeneous data object, make search results pages show a greater variety of data objects, increase the diversity of Search Results, better user experience.

Because the embodiment of the step in the embodiment of the included modules of the described the application's of Fig. 5 device and the application's method is corresponding, owing to Fig. 1-Fig. 4 being described in detail, so for not fuzzy the application, no longer the detail of modules is described at this.

Each embodiment in this instructions is general, and the mode of going forward one by one that adopts is described, and what each embodiment stressed is and the difference of other embodiment, between each embodiment identical similar part mutually referring to.

The application can describe in the general context of computer executable instructions, for example program module or unit.Usually, program module or unit can comprise and carry out particular task or realize routine, program, object, assembly, data structure of particular abstract data type etc.In general, program module or unit can be realized by software, hardware or both combinations.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment being connected by communication network.In distributed computing environment, program module or unit can be arranged in the local and remote computer-readable storage medium including memory device.

Finally, also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, commodity or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, commodity or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment that comprises described key element and also have other identical element.

Those skilled in the art should understand, the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware implementation example, completely implement software example or the form in conjunction with the embodiment of software and hardware aspect.And the application can adopt the form at one or more upper computer programs of implementing of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) that wherein include computer usable program code.

Applied principle and the embodiment of specific case to the application herein and set forth, the explanation of above embodiment is just for helping to understand the application's method and main thought thereof; , for one of ordinary skill in the art, according to the application's thought, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application meanwhile.

In a typical configuration, computing equipment comprises one or more processors (CPU), input/output interface, network interface and internal memory.Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.

Computer-readable medium comprises that permanent and impermanency, removable and non-removable media can realize information storage by any method or technology.Information can be module or other data of computer-readable instruction, data structure, program.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, can be used for the information that storage can be accessed by computing equipment.According to defining herein, computer-readable medium does not comprise non-temporary computer readable media (transitory media), as data-signal and the carrier wave of modulation.

Claims

1. a method for the data search based on homogeneous data object is integrated, is characterized in that, comprising:

Receive the searching request from user, one or more data objects that search and described searching request match in all data objects to be searched;

Each in described one or more data objects that analysis searches, to obtain the data label of data object described in each;

The described data label obtaining is mated;

One or more data objects that described data label is matched are integrated into homogeneous data object composition, and are back to user as Search Results.

2. method according to claim 1, is characterized in that, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.

3. method according to claim 2, it is characterized in that, also comprise: to all data objects to be searched, integrate and process in advance, to determine the corresponding one or more homogeneous data objects of data object to be searched described in each, to obtain data object mapping relations table.

4. method according to claim 3, is characterized in that, to all data objects to be searched, integrates and processes in advance, comprising:

The second data label in each data object and the second data label classification distribution table are excavated to processing;

The second data label in each data object is carried out to the second data label excavation, generate the synon set of the second data label of all data objects;

The first data label in each data object is carried out to the first data label excavation, generate the first data label synonym set of all data objects;

The first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label.

5. method according to claim 4, is characterized in that,

Described the second data label synonym comprises: same item now, has different the second data labels and has multiple data objects of identical the first data label;

Described the first data label synonym comprises: the first multiple similar data label in same data object.

6. method according to claim 4, it is characterized in that, the first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label, comprise: if a data object only has first data label and described the first data label only to have co-occurrence with the second unique data label, set up the mapping relations of described the first data label and described the second data label.

7. method according to claim 3, is characterized in that, to all data objects to be searched, integrates and processes in advance, comprising:

Extract one or more the second data labels in same data object, to obtain one or more candidate's the second data labels, and one or more candidates the second data label extracting is carried out to disambiguation;

Based on the rule of configuration, extract the first data label in multiple data objects, and to multiple the first data label normalizeds that extract;

Synon the second data label or the first data label are normalized each other;

According to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carry out the second data label completion.

8. method according to claim 7, is characterized in that, one or more candidates the second data label extracting is carried out to disambiguation, comprising:

Classification distribution table based on the second data label, obtains the number of times that described candidate's the second data label occurs in described classification, if number of times is greater than default threshold value, thinks the second data label of described data object; And/or, if multiple candidate's the second data labels appear in a data object, being chosen in the second data label classification distribution table, described second data label that occurrence number is maximum is as the second data label of described data object.

9. the method for claim 1, is characterized in that, comprising:

In search results pages, show one of them of multiple data objects in the combination of described homogeneous data, wherein, described homogeneous data combination comprises: multiple data objects of homogeneous data object each other.

10. method according to claim 2, is characterized in that, described homogeneous data object comprises: in same item now, have the second data label of identical or synonym and have multiple data objects of the first data label of identical or synonym.

The device of 11. 1 kinds of data searchs based on homogeneous data object is integrated, is characterized in that, comprising:

Receive and search module, for receiving the searching request from user, one or more data objects that search and described searching request match in all data objects to be searched;

Acquisition module, for analyzing each of described one or more data objects of searching, to obtain the data label of data object described in each;

Matching module, for mating the described data label obtaining;

Integrate and return to module, being integrated into homogeneous data object composition for one or more data objects that described data label is matched, and being back to user as Search Results.

12. devices according to claim 11, is characterized in that, described data label comprises the first data label and the second data label, the first data label and the second data label different attributive character of identified data object respectively.

13. devices according to claim 12, is characterized in that, also comprise:

Pretreatment module, for to all data objects to be searched, integrates and processes in advance, to determine the corresponding one or more homogeneous data objects of data object to be searched described in each, to obtain data object mapping relations table.

14. devices according to claim 13, is characterized in that, described pretreatment module is also configured to:

The first data label in each data object and the second data label are excavated, generate the mapping relations of the first data label to the second data label;

If a data object only has first data label and described the first data label only to have co-occurrence with the second unique data label, set up the mapping relations of described the first data label and described the second data label.

15. devices according to claim 14, is characterized in that, described the second data label synonym comprises: same item now, has different the second data labels and has multiple data objects of identical the first data label; Described the first data label synonym comprises: the first multiple similar data label in same data object.

16. devices according to claim 13, is characterized in that, described pretreatment module is also configured to:

Synon the second data label or the first data label are normalized each other;

According to the first data label building and the mapping relations of the second data label, to lacking the data object of the second data label, carry out the second data label completion;

Classification distribution table based on the second data label, obtains the number of times that described candidate's the second data label occurs in described classification, if number of times is greater than default threshold value, thinks the second data label of described data object; And/or

If there are multiple candidate's the second data labels in a data object, be chosen in the second data label classification distribution table, described second data label that occurrence number is maximum is as the second data label of described data object.

17. devices according to claim 12, is characterized in that, described integration with return to module and be also configured to:

In search results pages, show one of them of multiple data objects in the combination of described homogeneous data, wherein, described homogeneous data combination comprises: multiple data objects of homogeneous data object each other;

Described homogeneous data object comprises: in same item now, have the second data label of identical or synonym and have multiple data objects of the first data label of identical or synonym.