CN106126692A - The searching method of a kind of sample data and device - Google Patents

The searching method of a kind of sample data and device Download PDF

Info

Publication number
CN106126692A
CN106126692A CN201610499925.4A CN201610499925A CN106126692A CN 106126692 A CN106126692 A CN 106126692A CN 201610499925 A CN201610499925 A CN 201610499925A CN 106126692 A CN106126692 A CN 106126692A
Authority
CN
China
Prior art keywords
sample data
field
search
sample
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610499925.4A
Other languages
Chinese (zh)
Inventor
石鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Beijing Qianxin Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Beijing Qianxin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Beijing Qianxin Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610499925.4A priority Critical patent/CN106126692A/en
Publication of CN106126692A publication Critical patent/CN106126692A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses searching method and the device of a kind of sample data.The method includes: collect sample data from each data source;Formatting each sample data collected, each sample data after formatting stores in sample database;Receive the search word that client sends, this search word is converted to querying condition;The sample data meeting described querying condition is searched from described sample database;The sample data found is back to client be shown.According to this scheme, do not limited by the data form of each data source self in search procedure, it is possible to search efficiency is greatly improved, reduces search cost of labor;And querying condition is based on the search word generation of client transmission in search procedure, this querying condition can be the search to common search characteristics, can also be to the search not pre-defining field, the search procedure making sample data has suitable motility, meets different types of search need.

Description

The searching method of a kind of sample data and device
Technical field
The present invention relates to Internet technical field, be specifically related to searching method and the device of a kind of sample data.
Background technology
Antivirus software daily operation, threaten the field such as Data mining, the search of sample data occupies vital How status, for background analysis personnel, excavated more similar sample data by a sample data, or, how root Excavate according to a querying condition that to meet the sample data of this querying condition be a kind of scene that analysis personnel are frequently encountered by.How Quickly and easily realize by 0 to 1, then the search mining process by 1 to one class sample data, become and analyze restriction efficiency Key factor.
Existing sample searches scheme, can only scan for, such as according to known a certain specific curing characteristic: according to literary composition Part MD5 searches for, according to email/ domain name/ip address searching, according to file Hash/kill software famous/the specific bar such as report poison quantity Part is searched for, and according to the dns search etc. of file Hash/malicious file link, traditional scheme has the drawback that 1, querying condition Underaction, limits the means that sample data is excavated.Such as, user wants that searching for all powershell.exe that created enters The sample of journey, but several ways of search mentioned above the most do not support this function;For another example user want to search for all with The mail sample of adnexa, above-mentioned several ways of search are not supported.2, Search Results shows concentration not;Even if more than search Mode can by support multiple search for sample data flexibly in the way of, but be separate between way of search, in order to excavate One class sample data, analyst will search for one time between each search system, and serious dispersion analysis person's energy, reduction are dug Pick efficiency.3, for the different business line of the sample data in different pieces of information source, even same company, also can be each own respective Data form, be difficult to the sample data having effective means can integrate different-format, let alone offer flexibly Function of search.
Summary of the invention
In view of the above problems, it is proposed that the present invention in case provide one overcome the problems referred to above or at least in part solve on State searching method and the device of the sample data of problem.
According to one aspect of the present invention, it is provided that the searching method of a kind of sample data, the method includes:
Sample data is collected from each data source;
Formatting each sample data collected, each sample data after formatting stores sample database In;
Receive the search word that client sends, this search word is converted to querying condition;
The sample data meeting described querying condition is searched from described sample database;
The sample data found is back to client be shown.
Alternatively, described from each data source collect sample data include:
Reptile is utilized to crawl sample data from each data source;
And/or,
Utilize reptile to crawl daily record from each data source, utilize distributed document to process framework batch and resolve each data source Daily record, from the daily record of each data source obtain sample data.
Alternatively, described each sample data to collecting formats and includes: each sample data that will collect It is converted into the sample data of specified format;
Described this search word is converted to querying condition includes: this search word is converted to the querying condition of specified format.
Alternatively, the described sample data that each sample data collected is converted into specified format includes:
For each sample data,
Extract from this sample data and meet pre-conditioned field;
For each field extracted, from this sample data, extract the value of this field, by this field and this field Value form the two dimensional character that this field is corresponding;
Obtain after the characteristic set of two dimensional character composition corresponding for each field extracted is changed as this sample data The sample data of specified format.
Alternatively, the described querying condition that this search word is converted to specified format includes:
For each search word, extracting and meet pre-conditioned field from this search word, extracting from this search word should The value of field, the appointment obtained after the two dimensional character being made up of the value of this field He this field is changed as this search word The querying condition of form.
Alternatively, the method farther includes: set up tagged word phase library, and described tagged word phase library includes multiple tagged word Section;
Described extraction from this sample data meets pre-conditioned field and includes: according to described tagged word phase library, traversal The field that this sample data is comprised, extracts the field hitting described tagged word phase library;
Described extraction from this search word meets pre-conditioned field and includes: according to described tagged word phase library, traversal should The field that search word is comprised, extracts the field hitting described tagged word phase library.
Alternatively, described tagged word phase library includes one or more feature field following:
Represent that data creation crosses the field of appointment process, represent the packet field containing macrodoce, represent data access mistake The field of appointed website, the field representing addresses of items of mail, the field representing domain name, the field of expression IP address, expression URL address Field.
Alternatively, the method farther includes:
Every prefixed time interval, again collect feature field and add in described tagged word phase library, to described tagged word Phase library is updated;
After described property data base is updated, re-execute what described each sample data to collecting formatted Operation.
Alternatively, described lookup from described sample database meets the sample data of described querying condition and includes:
Travel through each sample data in described sample database;
For each sample data, travel through the two dimensional character comprised in this sample data, if there is with described inquiry bar The two dimensional character that two dimensional character in part is identical, determines that this sample data meets described querying condition.
Alternatively, described sample database includes: distributed document processes the distributed file system in framework.
Alternatively, the sample data found being back to before client is shown described, the method is further Including:
Obtain the data form adapting to client;
The form of the sample data found is converted to adapt to the data form of client;
The most described client that the sample data found is back to is shown and includes: by that find and through form The sample data of conversion is back to client and is shown.
According to another aspect of the present invention, it is provided that the searcher of a kind of sample data, this device includes:
Sample Data Collection unit, is suitable to collect sample data from each data source;
Sample data processing unit, each sample data being suitable to collect described Sample Data Collection unit carries out form Changing, each sample data after formatting stores in sample database;
Search interactive unit, is suitable to receive the search word that client sends, this search word is converted to querying condition concurrent Give search query unit;
Described search query unit, is suitable to search the sample data meeting described querying condition from described sample database And return to described search interactive unit;
Described search interactive unit, is suitable to that the sample data that described search query unit finds is back to client and enters Row is shown.
Alternatively, described Sample Data Collection unit, be suitable to utilize reptile to crawl sample data from each data source;With/ Or, utilize reptile to crawl daily record from each data source, utilize distributed document to process framework batch and resolve the day of each data source Will, obtains sample data from the daily record of each data source.
Alternatively, described sample data processing unit, be suitable to be converted into each sample data collected specifying lattice The sample data of formula;
Described search interactive unit, is suitable to be converted to the search word received the querying condition of specified format.
Alternatively, described sample data processing unit, be suitable to for each sample data, from this sample data, extract symbol Close pre-conditioned field;For each field extracted, from this sample data, extract the value of this field, by this field The two dimensional character that this field is corresponding is formed with the value of this field;Spy by two dimensional character composition corresponding for each field extracted The sample data of the specified format that collection cooperation obtains after changing for this sample data.
Alternatively, described search interactive unit, be suitable to for each search word, extract from this search word and meet default bar The field of part, extracts the value of this field from this search word, the two dimensional character being made up of the value of this field He this field The querying condition of the specified format obtained after changing as this search word.
Alternatively, this device farther includes: tagged word phase library sets up unit;
Described tagged word phase library sets up unit, is adapted to set up tagged word phase library, and described tagged word phase library includes multiple spy Levy field;
Described sample data processing unit, is suitable to, according to described tagged word phase library, travel through the word that this sample data is comprised Section, extracts the field hitting described tagged word phase library;
Described search interactive unit, is suitable to according to described tagged word phase library, travels through the field that this search word is comprised, will life Described in the field of tagged word phase library extract.
Alternatively, described tagged word phase library includes one or more feature field following:
Represent that data creation crosses the field of appointment process, represent the packet field containing macrodoce, represent data access mistake The field of appointed website, the field representing addresses of items of mail, the field representing domain name, the field of expression IP address, expression URL address Field.
Alternatively, described tagged word phase library sets up unit, is further adapted for every prefixed time interval, again collects feature Field is added in described tagged word phase library, is updated described tagged word phase library;
Described sample data processing unit, is further adapted for after described property data base is updated, and re-executes described The operation that each sample data collected is formatted.
Alternatively, described search query unit, be suitable to each sample data traveling through in described sample database;For each Sample data, travels through the two dimensional character comprised in this sample data, if there is with the two dimensional character phase in described querying condition Same two dimensional character, determines that this sample data meets described querying condition.
Alternatively, described sample database includes: distributed document processes the distributed file system in framework.
Alternatively, described search interactive unit, it is further adapted for, described, the sample data found is back to client Before end is shown, obtain the data form adapting to client;The form of the sample data found is converted to adaptation Data form in client;Sample data that is that find and that change through form is back to client be shown.
According to technical scheme, by the sample data unified integration of each data source to sample database, During scanning for, the search word that client sends is converted to querying condition, according to this querying condition from sample data Storehouse is searched sample data, the sample data found is returned to client as Search Results and shows.According to this scheme, The sample data of each data source is integrated into consolidation form, not by the data form of each data source self in search procedure Limit, it is achieved that disposably meet the unified search interface of the sample data of querying condition from the search of each data source, it is possible to big Width improves search efficiency, reduces search cost of labor;And querying condition is based on the search that client sends in search procedure Word generates, and this querying condition can be the search to common search characteristics, it is also possible to be to the search not pre-defining field, The search procedure making sample data has suitable motility, meets different types of search need.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow above and other objects of the present invention, the feature and advantage can Become apparent, below especially exemplified by the detailed description of the invention of the present invention.
Accompanying drawing explanation
By reading the detailed description of hereafter preferred implementation, various other advantage and benefit common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical parts.In the accompanying drawings:
Fig. 1 shows the flow chart of the searching method of a kind of sample data;
Fig. 2 shows the schematic diagram of the searcher of a kind of sample data;
Fig. 3 shows the schematic diagram of the searcher of a kind of sample data.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing shows the disclosure Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should be by embodiments set forth here Limited.On the contrary, it is provided that these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Fig. 1 shows the flow chart of the searching method of a kind of sample data.Such as Fig. 1 institute Showing, the method includes:
Step S110, collects sample data from each data source.
Step S120, formats each sample data collected, and each sample data after formatting stores In sample database.
For the sample data in different pieces of information source, each own respective data form, do not support unified search Journey, to this end, the sample data of different data format is formatted by this step, has unified the form of sample data, it is simple to after The expansion of continuous search procedure.
Step S130, receives the search word that client sends, this search word is converted to querying condition.
Step S140, searches the sample data meeting described querying condition from described sample database.
Step S150, is back to client by the sample data found and is shown.
Visible, the method shown in Fig. 1, by the sample data unified integration of each data source to sample database, is being carried out During search, the search word that client sends is converted to querying condition, according to this querying condition from sample database Search sample data, the sample data found is returned to client as Search Results and shows.According to this scheme, will be each The sample data of individual data source is integrated into consolidation form, not by the limiting of data form of each data source self in search procedure System, it is achieved that disposably meet the unified search interface of the sample data of querying condition from the search of each data source, it is possible to significantly Improve search efficiency, reduce search cost of labor;And querying condition is based on the search word that client sends in search procedure Generating, this querying condition can be the search to common search characteristics, it is also possible to is to the search not pre-defining field, makes The search procedure obtaining sample data has suitable motility, meets different types of search need.
In one embodiment of the invention, step S110 collects sample data from each data source to include: utilize and climb Worm crawls sample data from each data source;And/or, utilize reptile to crawl daily record from each data source, utilize distributed document Process framework batch and resolve the daily record of each data source, from the daily record of each data source, obtain sample data.Such as, utilization is climbed Worm crawls sample data from appointed website, utilizes reptile to crawl sample number from specified file (such as the body matter of a report) According to, it is also possible to crawl, from data source, the daily record that this data source is corresponding first with reptile, utilize Hadoop framework batch to resolve daily record, Obtaining the sample data of data source, the concrete form of data source does not limits, as long as text envelope valuable to search procedure Breath, originates regardless of it, can be carried out collecting.
In one embodiment of the invention, sample database includes: distributed document processes the distributed literary composition in framework Part system such as HDFS, or memory database such as Redis data base.
In one embodiment of the invention, each sample data collected formatted by step S120 include: The each sample data collected is converted into the sample data of specified format, and concrete transformation process may is that for often Individual sample data, extracts from this sample data and meets pre-conditioned field;For each field extracted, from this sample The value of this field of extracting data, is formed, with the value of this field, the two dimensional character that this field is corresponding by this field;To extract The sample of the specified format that the characteristic set of the two dimensional character composition that each field of going out is corresponding obtains after changing as this sample data Notebook data.Such as, for a sample data, the pre-conditioned field that meets extracted from this sample data includes: word Section a and field b, the value extracting field a from this sample data is " true ", and the two dimension that field a and its value are constituted is special Levy the metadata of (field a, true) substantially key-value form, in the case of one, when field b is the parent word of field a Duan Shi, comprises the metadata that field a is constituted with its value, if the value of field b is " field a:true in the value of field b;Word Section c:10010 ", two dimensional character (field b, field a:true that field b is constituted with its value;Field c:10010) substantially same Sample is the metadata of key-value form, the two dimensional character that field a is constituted with its value and field b and the two of its value composition Characteristic set { (field a, true), (field b, field a:true of dimensional feature composition;Field c:10010) } in, it can be seen that (field b, field a:true;Field c:10010) in be included (field a, true), in this case, the most permissible Characteristic set is directly simplified to (field b, field a:true;Field c:10010) form is as the specified format being converted to Sample data;In the case of another kind, when field b and field a are coordinations, the value phase of the value of field b and field a Mutually independent, as the value of field b be "www.microsoft.com", two dimensional character that field b and its value are constituted (field b,www.microsoft.com) it is the metadata of key-value form equally, two dimensional character that field a is constituted with its value and word The characteristic set of the two dimensional character composition that section b and its value are constituted (field a, true), (field b,www.microsoft.com) as the sample data of the specified format being converted to.It can be seen that extract from sample data The difference of the relation between the field gone out, causes the sample data of the specified format being finally converted to be slightly different, but essence All it is made up of metadata, it is simple to carrying out of follow-up unified search process.
Correspondingly, then this search word is converted to querying condition by step S130 include: be converted to this search word specify The querying condition of form, concrete transformation process may is that for each search word, extracts and meet default bar from this search word The field of part, extracts the value of this field from this search word, the two dimensional character being made up of the value of this field He this field The querying condition of the specified format obtained after changing as this search word.Such as, extract from a search word meet default The field of condition can be one or more, when extracting multiple field, such as: " type " and " attachment ", and extracts The value going out " type " is " Email ", and the two dimensional character of composition is (type, Email), for the metadata of key-value form, The value of " attachment " is " true ", and the two dimensional character of composition is (attachment, true), is key-value equally The metadata of form, this search word conversion after the specified format obtained querying condition for (type, Email), (attachment, true) }, it can be seen that this querying condition reflects that being intended to search meets the bar of " with the mail of adnexa " The sample data of part.
Through above-described embodiment to step S120 and the explanation of step S130, the sample data consolidation form that will collect For the sample data of specified format, the search word received from client is converted to the querying condition of specified format, described appointment The sample data of form is corresponding with the querying condition of described specified format, and the querying condition according to specified format can be straight Connect and check whether the sample data of specified format meets search need, i.e. step S140 is searched from sample database and meet institute The sample data stating querying condition includes: travel through each sample data in described sample database;For each sample data, time Go through the two dimensional character comprised in this sample data, special if there is the two dimension identical with the two dimensional character in described querying condition Levy, determine that this sample data meets described querying condition.In one embodiment of the invention, include multiple when querying condition Two dimensional character, example query condition as mentioned in the above is { (type, Email), (attachment, true) }, then at traversal sample The when of database, for each sample data, check in this sample data and whether include (type, Email), be then, then Check in this sample data and whether comprise (attachment, true), be then, determine that this sample data meets querying condition, instead It, do not meet querying condition.
Further, the method shown in Fig. 1 also includes: set up tagged word phase library, and this feature field storehouse includes multiple spy Levy field;The most above-mentioned extraction from this sample data meets pre-conditioned field and includes: according to described tagged word phase library, traversal The field that this sample data is comprised, extracts the field hitting described tagged word phase library.And it is above-mentioned from this search word Middle extraction meets pre-conditioned field and includes: according to described tagged word phase library, travels through the field that this search word is comprised, will life Described in the field of tagged word phase library extract.Wherein, the feature field that this feature field storehouse includes can be common Curing characteristic field, as represented the field of file cryptographic Hash, representing the field of file name, the expression field of addresses of items of mail, table The field showing domain name, the field etc. representing URL address indicate the field of sample data essential information, it is also possible to be to have quite spirit Activity uncured feature field, as represent data creation cross appointment process field, represent packet containing macrodoce field, Represent that data access is crossed the field of appointed website, represented that the field being designated antivirus engine and reporting an error, expression accessed appointment dynamically The information characteristics of the multiple description sample data such as the field of domain name and/or the field of the behavior characteristics of description sample data.
According to the present embodiment, it is possible to it is envisioned that the scope of tagged word phase library can directly influence the search that this programme provides Whether process can farthest meet the search need of client, it is therefore desirable to expand tagged word phase library in time And renewal.Then this programme farther includes: every prefixed time interval, again collects feature field and adds described feature field to In storehouse, described tagged word phase library is updated;After property data base is updated, re-execute the described various kinds to collecting The operation that notebook data formats.Specifically, the present embodiment can be by getting the spy of renewal to the collection of search word Levy field.
The implementation process of this programme is described with a specific example, and in this example, user wants to search for all micro- Soft antivirus engine reports the sample data of Locky, and the search word receiving client transmission is " scans Microsoft Result Locky ", this search word is converted to querying condition is: (result, Locky) or (scans Microsoft Result, Locky), wherein Microsoft is the parent field of result, and scans is the parent field of Microsoft, and this is looked into The meaning of inquiry condition is: result field under search scans field, under Microsoft field, that value is Locky;Root According to this querying condition, sample database is traveled through, check whether each sample data meets this querying condition.Specify lattice for one The content of the sample data of formula is:
By traveling through this sample data it is recognised that this sample data comprises " result ": " RansomWin32/ Locky.lrfn ", i.e. comprise (result, Locky), and the parent of this result field is Microsoft field, The parent of Microsoft field is scans field, i.e. under scans field, under Microsoft field, value is Locky Result field, meet querying condition, this sample data is returned to client, and client takes this sample according to querying condition Other all information associated by notebook data, as the cryptographic Hash " md5 " of this sample data, download address " download_url ", The value of each field under under scans field, Microsoft field, under scans field, SUPERAntispyware word The value of each field under Duan, meets search need.
Specifically, in one embodiment of the invention, open up the sample data found is back to client Before showing, this programme farther includes: obtain the data form adapting to client;The form of the sample data found is turned It is changed to adapt to the data form of client;Then the sample data found is back to client and is shown bag by step S150 Include: sample data that is that find and that change through form is back to client and is shown.
In another example, user wants to search for all sample datas that have accessed f3322.org DDNS, receives To client send search word be " Domain f3322.org ", be converted to specified format querying condition be (Domain, F3322.org), the meaning of this querying condition is: search value is the domain field of f3322.org;According to this querying condition Sample database is traveled through, checks whether each sample data meets this querying condition.The sample data of one specified format Content be:
By traveling through this sample data it is recognised that this sample data comprises " Domain ": " ssxx33.f3322.org*218.244.134.107*112.213.125.52* ", i.e. comprises (Domain, f3322.org), I.e. value is the domain field of f3322.org, meets querying condition, and this sample data is returned to client, client root Other all information associated by this sample data are taken, such as cryptographic Hash " md5 ", the type of this sample data according to querying condition The value of the fields such as " type ", uplink time " up_load time ", meets search need.
In other examples, it is also possible to sample data is specified in search, comprises url field, and this url word in this sample data The value of section comprises " 115.239.230.228 " field (only illustrating, do not limit);Can also search for specifying sample Data, comprise in this sample data that to comprise " LSQZA.swf " field in url field, and the value of this url field (the most for example Illustrate, do not limit) etc.;The user of client using to a sample data any one in terms of description information as inquiry bar Part, will obtain meeting all related informations of the sample data of this querying condition, and search procedure is the most convenient effectively, searches for dimension Extensively, it is possible to meet the search need of multi-form.
Wherein, user, when client scans for input, can be inputted by the form of Page Template, it is also possible to Input is scanned for by other forms.
Fig. 2 shows the schematic diagram of the searcher of a kind of sample data.Such as Fig. 2 institute Showing, the searcher 200 of this sample data includes:
Sample Data Collection unit 210, is suitable to collect sample data from each data source.
Sample data processing unit 220, each sample data being suitable to collect described Sample Data Collection unit is carried out Formatting, each sample data after formatting stores in sample database.
Search interactive unit 240, is suitable to receive the search word that client sends, this search word is converted to querying condition also It is sent to search query unit 230.
Search query unit 230, is suitable to search from sample database meet the sample data of querying condition and return to Search interactive unit 240.
Search interactive unit 240, is further adapted for that the sample data that search query unit 230 finds is back to client and enters Row is shown.
Visible, the device shown in Fig. 2 is cooperated, by the sample data unified integration of each data source by each unit In sample database, during scanning for, the search word that client sends is converted to querying condition, looks into according to this Inquiry condition searches sample data from sample database, and the sample data found is returned to client also as Search Results Show.According to this scheme, the sample data of each data source is integrated into consolidation form, not by each data source in search procedure The restriction of the data form of self, it is achieved that disposably meet the unification of the sample data of querying condition from the search of each data source Searching interface, it is possible to search efficiency is greatly improved, reduces search cost of labor;And querying condition is based on visitor in search procedure The search word that family end sends generates, and this querying condition can be the search to common search characteristics, it is also possible to be to the most in advance The search of definition field so that the search procedure of sample data has suitable motility, meets different types of search need.
In one embodiment of the invention, Sample Data Collection unit 210, be suitable to utilize reptile to climb from each data source Sampling notebook data;And/or, utilize reptile to crawl daily record from each data source, utilize distributed document to process framework batch and resolve The daily record of each data source, obtains sample data from the daily record of each data source.
In one embodiment of the invention, sample database includes: distributed document processes the distributed literary composition in framework Part system.
In one embodiment of the invention, search for interactive unit 240, be further adapted in the sample data that will find It is back to before client is shown, obtain the data form adapting to client;The form of sample data that will find Be converted to adapt to the data form of client;Sample data that is that find and that change through form is back to client enter Row is shown.
In one embodiment of the invention, sample data processing unit 220, be suitable to each sample data that will collect It is converted into the sample data of specified format.Search interactive unit 240, is suitable to the search word received is converted to specified format Querying condition.
In one embodiment of the invention, sample data processing unit 220, be suitable to for each sample data, from this Sample data is extracted and meets pre-conditioned field;For each field extracted, from this sample data, extract this word The value of section, is formed, with the value of this field, the two dimensional character that this field is corresponding by this field;The each field correspondence that will extract The characteristic set of two dimensional character composition change as this sample data after the sample data of specified format that obtains.
Then search interactive unit 240, is suitable to for each search word, extracts and meet pre-conditioned word from this search word Section, extracts the value of this field from this search word, using the two dimensional character that is made up of the value of this field He this field as this The querying condition of the specified format obtained after search word conversion.
Then search query unit 230, are suitable to travel through each sample data in sample database;For each sample data, Travel through the two dimensional character comprised in this sample data, special if there is the two dimension identical with the two dimensional character in described querying condition Levy, determine that this sample data meets described querying condition.
Fig. 3 shows the schematic diagram of the searcher of a kind of sample data.This sample The searcher 300 of notebook data includes: Sample Data Collection unit 310, sample data processing unit 320, search interactive unit 340, search query unit 330 and tagged word phase library set up unit 350.
Wherein, Sample Data Collection unit 310, sample data processing unit 320, search interactive unit 340, search inquiry Unit 330 has and the Sample Data Collection unit 210 shown in Fig. 2, sample data processing unit 220, search interactive unit 240, the corresponding identical function of search query unit 230, identical part does not repeats them here.
Tagged word phase library sets up unit 350, is adapted to set up tagged word phase library, and this feature field storehouse includes multiple tagged word Section.
Sample data processing unit 320, is suitable to, for each sample data, according to tagged word phase library, travel through this sample number According to the field comprised, the field of hit tagged word phase library is extracted;For each field extracted, from this sample number According to the value of middle this field of extraction, this field form, with the value of this field, the two dimensional character that this field is corresponding;To extract The characteristic set of two dimensional character composition corresponding to each field change as this sample data after the sample of specified format that obtains Data.
Search interactive unit 340, is suitable to for each search word, according to tagged word phase library, travels through this search word and comprised Field, by hit tagged word phase library field extract, from this search word, extract the value of this field, will be by this field The querying condition of the specified format obtained after changing as this search word with the two dimensional character of the value composition of this field.
Specifically, tagged word phase library includes one or more feature field following: represent that data creation crosses appointment process Field, represent that packet, containing the field of macrodoce, represents that data access crosses the field of appointed website, represents the word of addresses of items of mail Section, the field of expression domain name, the field of expression IP address, the field of expression URL address.
Further, tagged word phase library sets up unit 350, is further adapted for every prefixed time interval, again collects spy Levy field to add in described tagged word phase library, tagged word phase library is updated;Then sample data processing unit 320, enters one Step is suitable to after property data base is updated, and re-executes the operation formatting each sample data collected.
It should be noted that the corresponding phase of each embodiment of each embodiment of Fig. 2-Fig. 3 shown device and method shown in Fig. 1 With, the most it is described in detail, does not repeats them here.
In sum, the present invention provide technical scheme by the sample data unified integration of each data source to sample data In storehouse, during scanning for, the search word that client sends is converted to querying condition, according to this querying condition from sample Database is searched sample data, the sample data found is returned to client as Search Results and shows.Foundation This scheme, is integrated into consolidation form by the sample data of each data source, not by the number of each data source self in search procedure Restriction according to form, it is achieved that the unified search of the sample data disposably meeting querying condition from the search of each data source connects Mouthful, it is possible to search efficiency is greatly improved, reduces search cost of labor;And querying condition is based on client and sends out in search procedure The search word sent generates, and this querying condition can be the search to common search characteristics, it is also possible to be to not pre-defining word The search of section so that the search procedure of sample data has suitable motility, meets different types of search need.
It should be understood that
Algorithm and display are not intrinsic to any certain computer, virtual bench or miscellaneous equipment relevant provided herein. Various fexible units can also be used together with based on teaching in this.As described above, construct required by this kind of device Structure be apparent from.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various Programming language realizes the content of invention described herein, and the description done language-specific above is to disclose this Bright preferred forms.
In description mentioned herein, illustrate a large amount of detail.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case of not having these details.In some instances, it is not shown specifically known method, structure And technology, in order to do not obscure the understanding of this description.
Similarly, it will be appreciated that one or more in order to simplify that the disclosure helping understands in each inventive aspect, exist Above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.But, the method for the disclosure should not be construed to reflect an intention that i.e. required guarantor The application claims feature more more than the feature being expressly recited in each claim protected.More precisely, as following Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore, The claims following detailed description of the invention are thus expressly incorporated in this detailed description of the invention, the most each claim itself All as the independent embodiment of the present invention.
Those skilled in the art are appreciated that and can carry out the module in the equipment in embodiment adaptively Change and they are arranged in one or more equipment different from this embodiment.Can be the module in embodiment or list Unit or assembly are combined into a module or unit or assembly, and can put them in addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit excludes each other, can use any Combine all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where method or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can be carried out generation by providing identical, equivalent or the alternative features of similar purpose Replace.
Although additionally, it will be appreciated by those of skill in the art that embodiments more described herein include other embodiments Some feature included by rather than further feature, but the combination of the feature of different embodiment means to be in the present invention's Within the scope of and form different embodiments.Such as, in the following claims, embodiment required for protection appoint One of meaning can mode use in any combination.
The all parts embodiment of the present invention can realize with hardware, or to run on one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that and can use in practice Microprocessor or digital signal processor (DSP) realize in the searcher of sample data according to embodiments of the present invention The some or all functions of some or all parts.The present invention is also implemented as performing method as described herein Part or all equipment or device program (such as, computer program and computer program).Such reality The program of the existing present invention can store on a computer-readable medium, or can be to have the form of one or more signal. Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or with any other form There is provided.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference marks that should not will be located between bracket is configured to limitations on claims.Word " comprises " and does not excludes the presence of not Arrange element in the claims or step.Word "a" or "an" before being positioned at element does not excludes the presence of multiple such Element.The present invention and can come real by means of including the hardware of some different elements by means of properly programmed computer Existing.If in the unit claim listing equipment for drying, several in these devices can be by same hardware branch Specifically embody.Word first, second and third use do not indicate that any order.These word explanations can be run after fame Claim.
The invention discloses A1, the searching method of a kind of sample data, wherein, the method includes:
Sample data is collected from each data source;
Formatting each sample data collected, each sample data after formatting stores sample database In;
Receive the search word that client sends, this search word is converted to querying condition;
The sample data meeting described querying condition is searched from described sample database;
The sample data found is back to client be shown.
A2, method as described in A1, wherein, described collect sample data from each data source and include:
Reptile is utilized to crawl sample data from each data source;
And/or,
Utilize reptile to crawl daily record from each data source, utilize distributed document to process framework batch and resolve each data source Daily record, from the daily record of each data source obtain sample data.
A3, method as described in A1, wherein,
Described each sample data to collecting formats and includes: each sample data collected be converted into The sample data of specified format;
Described this search word is converted to querying condition includes: this search word is converted to the querying condition of specified format.
A4, method as described in A3, wherein, the described sample that each sample data collected is converted into specified format Notebook data includes:
For each sample data,
Extract from this sample data and meet pre-conditioned field;
For each field extracted, from this sample data, extract the value of this field, by this field and this field Value form the two dimensional character that this field is corresponding;
Obtain after the characteristic set of two dimensional character composition corresponding for each field extracted is changed as this sample data The sample data of specified format.
A5, method as described in A4, wherein, the described querying condition that this search word is converted to specified format includes:
For each search word, extracting and meet pre-conditioned field from this search word, extracting from this search word should The value of field, the appointment obtained after the two dimensional character being made up of the value of this field He this field is changed as this search word The querying condition of form.
A6, method as described in A4 or A5, wherein,
The method farther includes: set up tagged word phase library, and described tagged word phase library includes multiple feature field;
Described extraction from this sample data meets pre-conditioned field and includes: according to described tagged word phase library, traversal The field that this sample data is comprised, extracts the field hitting described tagged word phase library;
Described extraction from this search word meets pre-conditioned field and includes: according to described tagged word phase library, traversal should The field that search word is comprised, extracts the field hitting described tagged word phase library.
A7, method as described in A6, wherein, described tagged word phase library includes one or more feature field following:
Represent that data creation crosses the field of appointment process, represent the packet field containing macrodoce, represent data access mistake The field of appointed website, the field representing addresses of items of mail, the field representing domain name, the field of expression IP address, expression URL address Field.
A8, method as described in A6, wherein, the method farther includes:
Every prefixed time interval, again collect feature field and add in described tagged word phase library, to described tagged word Phase library is updated;
After described property data base is updated, re-execute what described each sample data to collecting formatted Operation.
A9, method as described in A5, wherein, described search the sample meeting described querying condition from described sample database Notebook data includes:
Travel through each sample data in described sample database;
For each sample data, travel through the two dimensional character comprised in this sample data, if there is with described inquiry bar The two dimensional character that two dimensional character in part is identical, determines that this sample data meets described querying condition.
A10, method as described in A1, wherein,
Described sample database includes: distributed document processes the distributed file system in framework.
A11, method as described in A1, wherein,
The sample data found being back to before client is shown described, the method farther includes:
Obtain the data form adapting to client;
The form of the sample data found is converted to adapt to the data form of client;
The most described client that the sample data found is back to is shown and includes: by that find and through form The sample data of conversion is back to client and is shown.
The invention also discloses B12, the searcher of a kind of sample data, wherein, this device includes:
Sample Data Collection unit, is suitable to collect sample data from each data source;
Sample data processing unit, each sample data being suitable to collect described Sample Data Collection unit carries out form Changing, each sample data after formatting stores in sample database;
Search interactive unit, is suitable to receive the search word that client sends, this search word is converted to querying condition concurrent Give search query unit;
Described search query unit, is suitable to search the sample data meeting described querying condition from described sample database And return to described search interactive unit;
Described search interactive unit, is suitable to that the sample data that described search query unit finds is back to client and enters Row is shown.
B13, device as described in B12, wherein,
Described Sample Data Collection unit, is suitable to utilize reptile to crawl sample data from each data source;And/or, utilize Reptile crawls daily record from each data source, utilizes distributed document to process framework batch and resolves the daily record of each data source, from respectively The daily record of individual data source obtains sample data.
B14, device as described in B12, wherein,
Described sample data processing unit, is suitable to be converted into each sample data collected the sample of specified format Data;
Described search interactive unit, is suitable to be converted to the search word received the querying condition of specified format.
B15, device as described in B14, wherein,
Described sample data processing unit, is suitable to for each sample data, and from this sample data, extraction meets default The field of condition;For each field extracted, from this sample data, extract the value of this field, by this field and this word The value of section forms the two dimensional character that this field is corresponding;Characteristic set by two dimensional character composition corresponding for each field extracted The sample data of the specified format obtained after changing as this sample data.
B16, device as described in B14, wherein,
Described search interactive unit, is suitable to for each search word, extracts and meet pre-conditioned word from this search word Section, extracts the value of this field from this search word, using the two dimensional character that is made up of the value of this field He this field as this The querying condition of the specified format obtained after search word conversion.
B17, device as described in B15 or B16, wherein, this device farther includes: tagged word phase library sets up unit;
Described tagged word phase library sets up unit, is adapted to set up tagged word phase library, and described tagged word phase library includes multiple spy Levy field;
Described sample data processing unit, is suitable to, according to described tagged word phase library, travel through the word that this sample data is comprised Section, extracts the field hitting described tagged word phase library;
Described search interactive unit, is suitable to according to described tagged word phase library, travels through the field that this search word is comprised, will life Described in the field of tagged word phase library extract.
B18, device as described in B17, wherein, described tagged word phase library includes one or more feature field following:
Represent that data creation crosses the field of appointment process, represent the packet field containing macrodoce, represent data access mistake The field of appointed website, the field representing addresses of items of mail, the field representing domain name, the field of expression IP address, expression URL address Field.
B19, device as described in B17, wherein,
Described tagged word phase library sets up unit, is further adapted for every prefixed time interval, again collects feature field and adds It is added in described tagged word phase library, described tagged word phase library is updated;
Described sample data processing unit, is further adapted for after described property data base is updated, and re-executes described The operation that each sample data collected is formatted.
B20, device as described in B16, wherein,
Described search query unit, is suitable to each sample data traveling through in described sample database;For each sample number According to, travel through the two dimensional character comprised in this sample data, if there is identical with the two dimensional character in described querying condition two Dimensional feature, determines that this sample data meets described querying condition.
B21, device as described in B12, wherein, described sample database includes: distributed document process in framework point Cloth file system.
B22, device as described in B12, wherein,
Described search interactive unit, is further adapted for the sample data found is back to client opening up described Before showing, obtain the data form adapting to client;Be converted to adapt to client by the form of the sample data found Data form;Sample data that is that find and that change through form is back to client be shown.

Claims (10)

1. a searching method for sample data, wherein, the method includes:
Sample data is collected from each data source;
Formatting each sample data collected, each sample data after formatting stores in sample database;
Receive the search word that client sends, this search word is converted to querying condition;
The sample data meeting described querying condition is searched from described sample database;
The sample data found is back to client be shown.
The most the method for claim 1, wherein collect sample data from each data source described in include:
Reptile is utilized to crawl sample data from each data source;
And/or,
Utilize reptile to crawl daily record from each data source, utilize distributed document to process framework batch and resolve the day of each data source Will, obtains sample data from the daily record of each data source.
The most the method for claim 1, wherein
Described each sample data to collecting formats and includes: be converted into each sample data collected specifying The sample data of form;
Described this search word is converted to querying condition includes: this search word is converted to the querying condition of specified format.
4. method as claimed in claim 3, wherein, described is converted into specified format by each sample data collected Sample data includes:
For each sample data,
Extract from this sample data and meet pre-conditioned field;
For each field extracted, from this sample data, extract the value of this field, by taking of this field and this field Value forms the two dimensional character that this field is corresponding;
The finger obtained after the characteristic set of two dimensional character composition corresponding for each field extracted is changed as this sample data The sample data of the formula that fixes.
5. method as claimed in claim 4, wherein, the described querying condition that this search word is converted to specified format includes:
For each search word, extract from this search word and meet pre-conditioned field, from this search word, extract this field Value, the specified format that obtains after the two dimensional character being made up of the value of this field He this field is changed as this search word Querying condition.
6. a searcher for sample data, wherein, this device includes:
Sample Data Collection unit, is suitable to collect sample data from each data source;
Sample data processing unit, each sample data being suitable to collect described Sample Data Collection unit formats, Each sample data after formatting stores in sample database;
Search interactive unit, is suitable to receive the search word that client sends, this search word is converted to querying condition and is sent to Search query unit;
Described search query unit, is suitable to search from described sample database meet the sample data of described querying condition and return Back to described search interactive unit;
Described search interactive unit, is suitable to that the sample data that described search query unit finds is back to client and opens up Show.
7. device as claimed in claim 6, wherein,
Described Sample Data Collection unit, is suitable to utilize reptile to crawl sample data from each data source;And/or, utilize reptile Crawl daily record from each data source, utilize distributed document to process framework batch and resolve the daily record of each data source, from each number According to the daily record in source obtains sample data.
8. device as claimed in claim 6, wherein,
Described sample data processing unit, is suitable to be converted into each sample data collected the sample number of specified format According to;
Described search interactive unit, is suitable to be converted to the search word received the querying condition of specified format.
9. device as claimed in claim 8, wherein,
Described sample data processing unit, is suitable to for each sample data, and from this sample data, extraction meets pre-conditioned Field;For each field extracted, from this sample data, extract the value of this field, by this field and this field Value forms the two dimensional character that this field is corresponding;Using the characteristic set of corresponding for each field extracted two dimensional character composition as The sample data of the specified format obtained after the conversion of this sample data.
10. device as claimed in claim 8, wherein,
Described search interactive unit, is suitable to for each search word, extracts and meet pre-conditioned field from this search word, from This search word extracts the value of this field, using the two dimensional character that is made up of the value of this field He this field as this search word The querying condition of the specified format obtained after conversion.
CN201610499925.4A 2016-06-29 2016-06-29 The searching method of a kind of sample data and device Pending CN106126692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610499925.4A CN106126692A (en) 2016-06-29 2016-06-29 The searching method of a kind of sample data and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610499925.4A CN106126692A (en) 2016-06-29 2016-06-29 The searching method of a kind of sample data and device

Publications (1)

Publication Number Publication Date
CN106126692A true CN106126692A (en) 2016-11-16

Family

ID=57284611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610499925.4A Pending CN106126692A (en) 2016-06-29 2016-06-29 The searching method of a kind of sample data and device

Country Status (1)

Country Link
CN (1) CN106126692A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569329A (en) * 2019-10-28 2019-12-13 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN111177133A (en) * 2019-12-24 2020-05-19 集奥聚合(北京)人工智能科技有限公司 Processing insertion method for multivariate data
CN113987324A (en) * 2021-10-21 2022-01-28 北京达佳互联信息技术有限公司 Data processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298690A (en) * 2013-07-19 2015-01-21 国际商业机器公司 Method and device for building index structure for relational database table and method and device for conducting inquiring
CN104714946A (en) * 2013-12-11 2015-06-17 田鹏 Large-scale Web log analysis system based on NoSQL
CN104750810A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Data querying and processing method based on configuration
CN105187607A (en) * 2014-06-12 2015-12-23 宇龙计算机通信科技(深圳)有限公司 Message processing method and system
CN105589936A (en) * 2015-12-11 2016-05-18 航天恒星科技有限公司 Data query method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298690A (en) * 2013-07-19 2015-01-21 国际商业机器公司 Method and device for building index structure for relational database table and method and device for conducting inquiring
CN104714946A (en) * 2013-12-11 2015-06-17 田鹏 Large-scale Web log analysis system based on NoSQL
CN105187607A (en) * 2014-06-12 2015-12-23 宇龙计算机通信科技(深圳)有限公司 Message processing method and system
CN104750810A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Data querying and processing method based on configuration
CN105589936A (en) * 2015-12-11 2016-05-18 航天恒星科技有限公司 Data query method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569329A (en) * 2019-10-28 2019-12-13 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
WO2021082463A1 (en) * 2019-10-28 2021-05-06 深圳市商汤科技有限公司 Data processing method and apparatus, electronic device and storage medium
TWI755890B (en) * 2019-10-28 2022-02-21 大陸商深圳市商湯科技有限公司 Data processing method, electronic device and computer-readable storage medium
CN110569329B (en) * 2019-10-28 2022-08-02 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN111177133A (en) * 2019-12-24 2020-05-19 集奥聚合(北京)人工智能科技有限公司 Processing insertion method for multivariate data
CN113987324A (en) * 2021-10-21 2022-01-28 北京达佳互联信息技术有限公司 Data processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3451201B1 (en) Processing malicious communications
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
Martinez-Caro et al. A comparative study of web content management systems
US8972372B2 (en) Searching code by specifying its behavior
CA2779366C (en) Method and system for processing information of a stream of information
Pradhan Science mapping and visualization tools used in bibliometric & scientometric studies: An overview
CN102831252B (en) A kind of method for upgrading index data base and device, searching method and system
US20110016104A1 (en) Centralized web-based system for automatically executing search engine optimization principles for one, or more website(s)
US20090240638A1 (en) Syntactic and/or semantic analysis of uniform resource identifiers
Zhang et al. Developing a dark web collection and infrastructure for computational and social sciences
Tassone et al. Visualizing digital forensic datasets: a proof of concept
CN102930058A (en) Method and device for realizing search in address field of browser
CN103647767A (en) Website information display method and apparatus
CN106126692A (en) The searching method of a kind of sample data and device
Kluyver et al. Taxonome: a software package for linking biological species data
Marques et al. DNS dataset for malicious domains detection
Norman et al. taxadb: A high‐performance local taxonomic database interface
Krabbe et al. Patent searching using free search tools
Goldfarb et al. Enhancing the Discoverability and Interoperability of Multi-Disciplinary Semantic Repositories.
CN110069489A (en) A kind of information processing method, device, equipment and computer readable storage medium
Gupta et al. Information integration techniques to automate incident management
Simon-Nagy et al. Attack Graph Implementation in Graph Database
Minnie et al. Meta search engines for information retrieval on multiple domains
Krstićev Information retrieval using a middleware approach
Soleimanian et al. Search Engine Optimization based on Effective Factors of Ranking in Web Sites: A‎ Review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Applicant after: Beijing Qihu Technology Co., Ltd.

Applicant after: Qianxin Technology Group Co., Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Applicant before: Beijing Qihu Technology Co., Ltd.

Applicant before: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20161116

RJ01 Rejection of invention patent application after publication