CN106844553B

CN106844553B - Data detection and expansion method and device based on sample data

Info

Publication number: CN106844553B
Application number: CN201611264829.8A
Authority: CN
Inventors: 汤奇峰; 李炳辉
Original assignee: Zamplus Advertising Shanghai Co ltd
Current assignee: Zamplus Advertising Shanghai Co ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-05-01
Anticipated expiration: 2036-12-30
Also published as: CN106844553A

Abstract

A data detection and expansion method and device based on sample data, the method includes the following steps: determining the sample data based on at least one piece of data in a database, wherein the database stores a plurality of pieces of data acquired by detecting mass data; searching in the mass data based on the sample data to obtain matched data matched with the sample data in the mass data; processing the matching data to obtain a matching rule, and updating a fingerprint database, wherein the matching rule obtained historically is stored in the fingerprint database; and performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database. The technical scheme provided by the invention can more accurately and efficiently carry out global and systematic analysis and processing on the mass data.

Description

Data detection and expansion method and device based on sample data

Technical Field

The invention relates to the technical field of internet, in particular to a data detection and expansion method and device based on sample data.

Background

With the rapid development of internet technology, internet websites and the number of people who surf the internet in China are all rapidly increased, and with the rapid growth of netizens and the increasing abundance of internet resources, access log data generated on the internet is rapidly expanded to form mass data, so that how to detect, discover and expand required data information from the mass data becomes the important task of the current information processing party.

At present, methods for discovering and expanding required data from mass data mainly focus on the following two methods: first, the data is manually checked, and a user accessing a Uniform Resource Locator (URL) of each website or Application program (APP, for example, Application software loaded in a mobile phone) on the internet is manually analyzed and summarized to obtain a series of matching rules, and then the matching rules are matched with mass data resources on the internet, so as to extract and expand the data to obtain the required data. Second, it is an Application Programming Interface (API) query method, which calls an Interface of the other party as needed through a document description of an API provider to obtain the required data.

Although these two methods can satisfy the user's desire to find and expand a specific type of data from a large amount of data to some extent, they have respective unavoidable drawbacks. For the manual data checking mode, a large amount of manpower is needed to manually perform related analysis and statistics in actual operation, and the detection and expansion efficiency is low; the API query mode depends on the document description provided by the API provider and has uncertainty.

On the other hand, the existing data discovery and expansion methods including the two methods finally obtain data on certain specific websites. However, due to the rapid expansion of the website scale in the internet and the fact that the construction modes of many websites and APPs for URLs do not establish uniform standards and rules, data acquired by the existing method is only a small part of mass data, which is not beneficial for users to perform global and systematic analysis and processing on the mass data, and affects the accuracy of data acquired by the users through detection and expansion.

Disclosure of Invention

The invention solves the technical problem that the prior art can not carry out global and systematic analysis and processing on mass data in a more accurate and efficient mode.

To solve the above technical problem, an embodiment of the present invention provides a data detection and expansion method based on sample data, including the following steps: determining the sample data based on at least one piece of data in a database, wherein the database stores a plurality of pieces of data acquired by detecting mass data; searching in the mass data based on the sample data to obtain matched data matched with the sample data in the mass data; processing the matching data to obtain a matching rule, and updating a fingerprint database, wherein the matching rule obtained historically is stored in the fingerprint database; and performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database.

Optionally, the determining the sample data based on at least one piece of data in the database includes the following steps: and selecting a preset amount of data from the database, and taking the characteristic information of the preset amount of data as the sample data.

Optionally, the feature information includes: the feature identification codes of the preset amount of data; or a regular expression determined according to the preset amount of data.

Optionally, searching in the mass data based on the sample data to obtain matching data in the mass data, where the matching data matches the sample data, includes the following steps: and searching data with the same characteristic information as the sample data in the mass data, and taking the data with the same characteristic information as the matching data.

Optionally, when searching in the mass data based on the sample data, if a preset limiting condition exists, searching in a part of data defined by the preset limiting condition in the mass data to obtain the matching data.

Optionally, the processing the matching data to obtain the matching rule, and updating the fingerprint database includes the following steps: carrying out structuralization processing on the matched data to obtain standard data arranged according to a preset format; generating the matching rule based on the standard data and removing duplication; and updating the fingerprint database based on the matching rule after the duplication removal.

Optionally, generating the matching rule based on the standard data and removing duplication includes the following steps: converting the standard data into the matching rule according to the preset format; and removing repeated items in the matching rule obtained by conversion to obtain the matching rule after duplication removal.

Optionally, updating the fingerprint database based on the deduplicated fingerprint includes the following steps: comparing the de-duplicated matching rules with the matching rules in the fingerprint database to remove duplicate items for the second time; and updating the matching rule after the repeated items are removed twice to the fingerprint database.

Optionally, the data is an internet access record.

An embodiment of the present invention further provides a data detection and expansion device based on sample data, including: the determining module is used for determining the sample data based on at least one piece of data in a database, and the database stores a plurality of pieces of data acquired by detecting mass data; the searching module is used for searching in the mass data based on the sample data so as to obtain matched data matched with the sample data in the mass data; the updating module is used for processing the matching data to obtain a matching rule and updating a fingerprint database, and the matching rule obtained historically is stored in the fingerprint database; and the extraction module is used for performing matching extraction on the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database.

Optionally, the determining module includes: and the selection submodule is used for selecting data with preset quantity from the database and taking the characteristic information of the data with the preset quantity as the sample data.

Optionally, the searching module includes: and the first searching submodule is used for searching the data with the same characteristic information as the sample data in the mass data and taking the data with the same characteristic information as the matching data.

Optionally, the search module further includes a second search submodule, where the second search submodule is configured to search, when searching for the mass data based on the sample data, if a preset limiting condition exists, a part of data defined by the preset limiting condition in the mass data, so as to obtain the matching data.

Optionally, the update module includes: the processing submodule is used for carrying out structural processing on the matched data so as to obtain standard data arranged according to a preset format; the generation submodule is used for generating the matching rule based on the standard data and removing duplication; and the updating submodule is used for updating the fingerprint database based on the matching rule after the duplication is removed.

Optionally, the generating sub-module includes: the conversion unit is used for converting the standard data into the matching rule according to the preset format; and the duplication removing unit is used for removing repeated items in the converted matching rule to obtain the duplicated matching rule.

Optionally, the update sub-module includes: the comparison unit is used for comparing the matching rule after the duplication removal with the matching rule in the fingerprint database so as to remove repeated items for the second time; and the updating unit is used for updating the matching rule after the repeated items are removed twice to the fingerprint database.

Optionally, the data is an internet access record.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

firstly, determining sample data according to at least one piece of data in a database, searching in mass data based on the sample data to detect and obtain matched data matched with the sample data from the mass data, then processing the matched data to obtain a matching rule so as to update a fingerprint database, finally, matching and extracting in the mass data based on the updated fingerprint database so as to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database so as to realize data detection and expansion based on the sample data. Compared with the existing data discovery and expansion scheme mainly based on manual or API inquiry, the technical scheme of the embodiment of the invention generates the matching rule based on the sample data, performs matching extraction on the original data source (namely mass data) according to the matching rule to expand the database, determines the sample data from the expanded database and repeats the steps to finally form a closed loop circulation flow. By the technical scheme provided by the invention, global and systematic analysis and processing of mass data can be more accurately and efficiently carried out.

Further, a preset amount of data is selected from the database, the characteristic information of the preset amount of data is used as the sample data, the sample data is used as a template to be detected in the mass data, so that the data matched with the sample data is obtained to expand the database, the data stored in the database are ensured to be the data with the same characteristic information, and the use requirement of a user for finding and collecting specific types of data from the mass data is met.

Drawings

FIG. 1 is a flow chart of a data detection and expansion method based on sample data according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a data detection and expansion method based on sample data according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a data detection and expansion method based on sample data according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of a character matching tree constructed by the data detection and expansion method based on sample data according to the embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data detection and expansion apparatus based on sample data according to a fourth embodiment of the present invention.

Detailed Description

As mentioned in the background, the existing methods for discovering and expanding data required by users from mass data are still limited to two ways of manual retrieval or API query. However, the former method needs a lot of manpower to manually analyze and count the data; the latter is not adaptable to global analysis and processing of data.

In order to solve the technical problem, according to the technical scheme, sample data is determined according to at least one piece of data in a database, the sample data is searched in mass data based on the sample data, matched data matched with the sample data is obtained by detection from the mass data, then the matched data is processed to obtain a matching rule, so that a fingerprint database is updated, finally, matching extraction is carried out on the mass data based on the updated fingerprint database, so that data matched with the matching rule in the updated fingerprint database in the mass data is obtained, the data obtained by matching is expanded to the database, and data detection and expansion based on the sample data are achieved.

Those skilled in the art understand that as internet users expand, the proliferation of internet sites and the rapid increase in internet bandwidth, more and more users generate more and more internet user behavior (i.e., internet access records) on more and more sites. And the behaviors are recorded in a log form by various data collectors and stored as data (namely mass data). The technical scheme of the embodiment of the invention generates the matching rule based on the sample data, performs matching extraction on the original data source (namely mass data) according to the matching rule to expand the database, then determines the sample data from the expanded database and repeats the steps, and finally forms a closed loop circulation flow. By the technical scheme provided by the invention, global and systematic analysis and processing of mass data can be more accurately and efficiently carried out.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 1 is a flowchart of a data detection and expansion method based on sample data according to a first embodiment of the present invention. Wherein the data may be an internet access record.

Specifically, in this embodiment, step S101 is first executed to determine the sample data based on at least one piece of data in a database, where a plurality of pieces of data obtained by probing from mass data are stored in the database. More specifically, the mass data may be historical data obtained from the internet, such as historical internet access records of all users, or internet access records of selected users during selected periods. In a preferred embodiment, the amount of the sample data may be set individually according to the data processing capability of the hardware or software implementing the embodiment of the present invention, for example, the amount of the sample data may be between 1 ten thousand and 10 ten thousand. Preferably, the data may be represented in a Uniform Resource Locator (URL), or the data may be represented in the form of one or more URLs (refer URL), user agents (user agents), cookies, etc., and those skilled in the art may also change further embodiments according to actual needs, which is not described herein again.

And then, step S102 is carried out, and the mass data is searched based on the sample data to obtain matched data matched with the sample data in the mass data. Specifically, the matching may refer to that the matching data and the sample data have the same rule. Preferably, this step may be performed simultaneously or sequentially on at least one device cluster, wherein the device cluster may be coupled by one or more computers. In a preferred embodiment, the mass data may be dispersed to a computer composed of a plurality of clusters for processing, and then matching data matched by the computers in each cluster is summarized, for example, the dispersion processing and the summarization of the mass data may be implemented by a mapping specification (Mapreduce) task based on a Distributed system infrastructure (Hadoop Distributed file system).

Step S103 is executed next, the matching data is processed to obtain a matching rule, and a fingerprint library is updated, and the matching rule obtained in history is stored in the fingerprint library. Specifically, the matching rule is used to describe a rule that the sample data and the matching data have in common. More specifically, the fingerprint database is used for storing the matching rules extracted from the matching data after the technical scheme of the embodiment of the invention is executed historically. Those skilled in the art understand that the subsequent iteration operation can be better promoted by continuously enriching the fingerprint database, so that the technical scheme of the embodiment of the invention can obtain more data in mass data based on the updated fingerprint database in a matching manner.

And finally, executing step S104, performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database. In a preferred embodiment, the mass data is processed item by item based on the updated fingerprint database, and the matching result item by item is sorted and recorded, so as to update the data obtained by matching to the database, thereby realizing effective expansion of the volume of the database.

In a variation of this embodiment, after the step S104 is executed, the step S101 may be executed again based on the expanded database, so as to generate more sample data based on the expanded database, further detect and obtain more matching data in the mass data, and finally further expand the database.

Thus, the scheme of the first embodiment is adopted, the matching rule is generated based on the sample data, the matching extraction is performed on the original data source (namely mass data) according to the matching rule so as to expand the database, then the sample data is determined from the expanded database, and the steps are repeated. By the technical scheme of the embodiment of the invention, a closed-loop iterative processing mechanism can be formed, which is beneficial to more accurately and efficiently analyzing and processing the mass data globally and systematically by a user.

Fig. 2 is a flowchart of a data detection and expansion method based on sample data according to a second embodiment of the present invention. Specifically, in this embodiment, step S201 is first executed to select a preset number of data from the database, and use the feature information of the preset number of data as the sample data. More specifically, the preset number is determined by a user according to a data processing capability of hardware or software that performs the embodiment of the present invention. Preferably, the feature information may be a feature identification code of the preset amount of data. For example, when the data is URL information of a commodity, the feature identification code may be an identification code (ID) of the commodity, and the identification code may be extracted from the URL information corresponding to the commodity.

And then, step S202 is performed to search data having the same characteristic information as the sample data in the mass data, and use the data having the same characteristic information as the matching data. Preferably, for the mass data also represented by the URL, the URL of each mass data may be divided into three matching locations (host, path, and query) according to a structure, and the selected matching location is compared with the sample data in a manner of selecting one or two or all matching locations, so as to search for data having the same characteristic information as the sample data from the mass data. Preferably, for sample data using the feature identification code as the feature information, data having the same feature information as the sample data can be searched for in the massive data according to different matching rules.

In a preferred example, the host locations in the URLs of the mass data may be matched, and the host locations of the URLs of the mass data and the sample data may have the same feature identification code by searching in a manner that the left side includes matching. Preferably, the left-side containing matching may mean that the left side of the character string of the position to be matched (i.e. the host position in the foregoing preferred example) completely matches the feature identification code of the sample data. For example, if the host location of the URL of a certain data in the mass data includes the character string item _44123_ abcde, the character string may be considered to completely match the sample data represented by the feature identification code item _44123, so as to determine that the mass data and the sample data have the same feature information.

Step S203 is executed next, the matching data is processed to obtain matching rules, and a fingerprint library is updated, the fingerprint library storing the matching rules obtained in history. Specifically, a person skilled in the art may refer to step S103 in the embodiment shown in fig. 1, which is not described herein again. Preferably, the matching rule is used for filtering and extracting common features of a plurality of matching data.

And finally, executing step S204, performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database. Specifically, a person skilled in the art may refer to step S104 in the embodiment shown in fig. 1, which is not described herein again. In a preferred embodiment, all data included in the mass data are matched item by item according to the matching position, for example, matching may be performed according to a matching sequence of a host, a path, and a query. Specifically, it is first determined that the host portion of the URL of the data can be matched with the host portion of the matching rule included in the fingerprint library, if the two host portions are not matched, the data is skipped to match other data included in the massive data, if the two host portions are matched, the path portion of the data is continuously matched with the path portion of the matching rule, and when the path portions of the two host portions are also matched, the query portion of the data is matched with the query portion of the matching rule, so as to finally determine whether the data is matched with the matching rule in the updated fingerprint library.

Further, if the data are determined to meet the matching condition of the matching rule, extracting a part matched with the matching rule from the data and updating the part to the database.

Further, the mass data are matched one by one based on the updated fingerprint database to determine the data matched with the matching rule in the updated fingerprint database in the mass data, and the data content of the matched part is extracted and arranged to the database, so that the volume of the database is greatly expanded.

Further, in the implementation of the embodiment of the present invention, the dirty data that may be obtained during the detection and expansion may be screened in combination with manual and/or automatic computer identification to ensure the validity and accuracy of the data that is finally updated into the database.

In a variation of step S201, the feature information may also be a regular expression determined according to the preset amount of data. Those skilled in the art will appreciate that the regular expression may be used to match the characteristic information of all data randomly selected from the database, or the regular expression may be used to match the characteristic information of all data that a user wishes to detect and augment from the mass data.

For example, if it is desired to detect and expand mass data using the device feature identification code as sample data, the sample data randomly selected from the database includes a device feature identification code of the telecommunication device and a device feature identification code of the Mobile device, and the device feature identification code of the telecommunication device is represented based on an International Mobile Equipment Identity (IMEI), and the device feature identification code of the Mobile device is represented based on a Mobile Equipment Identity (MEID), and the two device feature identification codes have a common point that both are numbers with 11 beginning at 1, so that the regular expression can be determined by referring to the common point.

Also for example, if all data randomly selected from the database is Media Access Control (MAC) addresses, the regular expression can be expressed as "/^ ([ a-zA-Z0-9] {8} \ - [ a-zA-Z0-9] {4} \ - [ a-zA-Z0-9] {4} \[ a-zA-Z0-9] {4} \[ a-zA-Z0-9] {12}) $/".

For another example, if the user wishes to detect and expand the mass data to obtain data in a specific geographic area, the regular expression may also be used to represent the specific geographic area by defining longitude and latitude.

Further, searching for data having the same characteristic information as the sample data in the mass data according to different matching rules includes directly performing regular matching on a portion to be matched (i.e., a selected matching position) of the data and a regular expression of the sample data, and if the portion with matching meets the matching condition of the regular expression, determining that the data and the sample data have the same characteristic information. For example, the regular expression of the sample data may be shop- (\ d +), and for a piece of data, if the URL of the portion to be matched of the data is shop-33415-23-test, it may be determined that the data and the sample data have the same characteristic information because the URL of the portion to be matched conforms to the logic of the regular expression.

In a variation of the step S202, the matching rule further includes matching on the right side, and if a character string of a position to be matched of a certain data in the mass data is completely matched with the feature identification code of the sample data, it is determined that the data and the sample data have the same feature information. For example, if the path position of the URL of a certain data in the mass data includes the character string car _ shanghai _ ser33456 and the characteristic identifier of the sample data is ser3356, it may be determined that the mass data and the sample data have the same characteristic information.

In another variation of step S202, the matching rule further includes a matching rule of a complete equality, and if a character string of a position to be matched of a certain data in the mass data is completely equal to the feature identification code of the sample data, it is determined that the data and the sample data have the same feature information. For example, the character string "shop" 33415& category "23 & item" test "may be considered to be identical to the feature identification code 33415.

In another variation of step S202, the matching rule further includes matching, and if a character string of a to-be-matched position of a certain data in the mass data includes the feature identification code of the sample data, it is determined that the data and the sample data have the same feature information. For example, the string shop-33415-23-test may be considered to contain the feature identification code 33415.

In a variation of step S204, when the feature information is a regular expression determined according to the preset amount of data, and when the currently scanned data in the mass data has the same feature information as the sample information, the regular expression may be directly extracted from the currently scanned data, and the regular expression is updated to the fingerprint library.

In a variation of this embodiment, when searching for the mass data based on the sample data, if a preset limiting condition exists, the step S202 searches for a partial data defined by the preset limiting condition in the mass data to obtain the matching data. Preferably, for data and sample data represented by a URL, the preset restriction condition may be a top-level domain name tld in the URL. For example, the user may select to define the top-level domain name tld for some or all of the sample data selected and determined in the step S201, and then the technical solution of the embodiment of the present invention preferably only detects and expands the data on the website where the top-level domain name tld is located to the database for the sample data defined by the top-level domain name tld when the step S202 to the step S204 are executed.

Further, the preset limiting condition may be set according to a user requirement or a data processing capability of a device that executes the technical solution of the embodiment of the present invention.

Further, the top domain names tld of the sample data may be the same or different, for example, the top domain name tld of half of the sample data and the top domain name tld of the other half of the sample data in all the sample data selected and determined from the database may be set as different websites, so as to perform data detection and retrieval in two websites simultaneously based on the technical solution of the embodiment of the present invention.

In a typical application scenario, when a computer executes the technical solution of the embodiment of the present invention, first, the sample data is loaded into a local memory of the computer, and when part or all of the data in the sample data has the preset top-level domain name tld, a mapping table may be constructed in the local memory, where the mapping table is used to classify and store feature information or regular expressions of one or more sample data having the same top-level domain name tld in the sample data.

Preferably, for an application scenario in which the feature information is the feature identification code, a character matching tree may be constructed for one or more sample data with the same top-level domain name tld, so as to improve matching efficiency in subsequent data detection and expansion.

Preferably, for the application scenario in which the characteristic information is the regular expression, the respective regular expressions of one or more sample data with the same top-level domain name tld may also be stored as a list, so as to perform the subsequent detecting and expanding steps. As a variation, the regular expression may also be determined for multiple sample data having the same top-level domain name tld.

Further, when the sample data and the mass data are both represented based on URLs, in step S202, preferably, the URLs of the sample data are processed first to obtain a top-level domain name tld corresponding to the sample data, then when the mass data is scanned one by one, it is determined whether the URL of the currently scanned data includes the top-level domain name tld, and if the determination result indicates that the URL of the currently scanned data does not include the top-level domain name tld, the data is directly skipped; otherwise, if the judgment result indicates that the URL of the currently scanned data includes the top-level domain name tld, the step S202 is executed again, and the URL of the data is compared with the feature information of the sample data based on the selected matching position, so as to search the mass data for data having the same feature information as the sample data.

Therefore, by adopting the scheme of the second embodiment, data having the same characteristic information as the sample data in the mass data can be detected according to the sample data, so that the data finally expanded into the database has the same characteristic information, and the actual use requirement of a user for finding and expanding the data of a specific type in the mass data is met.

Those skilled in the art understand that, in this embodiment, the step S201, the step S202, and corresponding variations can be understood as a specific implementation manner of the step S101 and the step S102 in the embodiment shown in fig. 1, and the matching workload when matching in the massive data is reduced by the preset limiting condition, and at the same time, the user is allowed to perform data detection and expansion for a specific website. Further, a user can select whether the preset limiting condition needs to be set according to actual requirements, wherein when the user does not set the preset limiting condition, all access records on the internet are used as the mass data to perform data detection and expansion (namely, whole-network search); when the user sets the preset limiting condition, the embodiment of the present invention uses the access records on one or more websites defined by the preset limiting condition as the mass data to obtain the data required by the user (i.e. specific website search).

As a variation, when the user selects to perform the global search, the embodiment of the present invention may first perform the technical solution of the embodiment of the present invention on a plurality of websites once to obtain the matching rules from each website, and after integrating the matching rules of each of the plurality of websites into a universal match symbol, perform the global search using the universal match symbol as the feature information of the sample data.

Fig. 3 is a flowchart of a data detection and expansion method based on sample data according to a third embodiment of the present invention. Specifically, in this embodiment, step S301 is first executed to select a preset number of data from the database, and use the feature information of the preset number of data as the sample data. More specifically, a person skilled in the art may refer to step S201 in the embodiment shown in fig. 2, which is not described herein again.

And then, step S302 is executed to search data having the same characteristic information as the sample data in the mass data, and use the data having the same characteristic information as the matching data. Specifically, a person skilled in the art may refer to step S202 in the embodiment shown in fig. 2, which is not described herein again.

Step S303 is executed next, and the matching data is structured to obtain standard data arranged according to a preset format. Specifically, the result of the structuring process may be represented in a table form, wherein the table records all or part of the content of the matching data by category. More specifically, the standard data may be a result obtained by arranging the contents in the table according to the preset format. In a preferred embodiment, the matching data is also represented in the form of a URL, the categories recorded in the table include a top-level domain name tld, a port (port), a matching parameter (querykey), a matching location, matching content, and a matching manner, and this step may be performed by splitting the URL of the matching data according to the categories recorded in the table, and then reordering and integrating the split results according to the preset format, where the result of reordering and integrating is the standard data.

Then, the step S304 is performed, and the matching rule is generated and deduplicated based on the standard data. In a preferred embodiment, the standard data may be first converted into the matching rule according to the preset format, and then the repeated items in the converted matching rule are removed to obtain the duplicate-removed matching rule. Those skilled in the art understand that, through the processing in step S303, the standard data may include only key information required for performing subsequent matching work, and cannot be directly applied to the subsequent step, so that the standard data needs to be processed in this step, and is converted into the matching rule according to the preset format, so as to be used in the subsequent step; on the other hand, since the design of the URL of the same website generally has similarity, after all the matching rules are obtained by the conversion in this step, the duplicate removal processing may be performed on all the matching rules to remove the duplicate items in the matching rules obtained by the conversion in this step.

Step S305 is performed next, and the fingerprint database is updated based on the matching rule after the duplication removal. In particular, the updating comprises storing the de-duplicated matching rules to the fingerprint repository. More specifically, the updating further includes removing matching rules that are repeated with existing matching rules in the fingerprint database from the matching rules after the duplication removal. In a preferred embodiment, the match rule after the duplication removal is compared with the match rule in the fingerprint database to remove duplicate items twice, and then the match rule after the duplicate items are removed twice is updated to the fingerprint database.

And finally, executing step S306, performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database. Specifically, a person skilled in the art may refer to step S104 in the embodiment shown in fig. 1, which is not described herein again.

Further, the matching rule may be understood as a combination of filtering and extracting data.

In a preferred application scenario, the top-level domain name tld and the matching parameters in the matching rule may be used to filter data. For example, when the step S305 is executed, it may be first preliminarily determined whether the currently scanned data in the massive data is worth further matching work based on the top-level domain name tld and the matching parameter, and if the top-level domain name tld of the currently scanned data does not match the top-level domain name tld recorded in the matching rule, the currently scanned data may be directly rejected, so as to save the matching amount of the embodiment of the present invention and improve the matching efficiency.

In another preferred application scenario, the matching manner, the matching position, and the matching content or regular expression in the matching rule may be used to extract data to finally determine whether the currently scanned data has the same characteristic information as the sample data.

Further, the fingerprint database and the database may be stored in a computer executing the embodiment of the present invention, may also be stored in other storage devices coupled to the computer, or may also be stored in a cloud.

From the above, by adopting the solution of the third embodiment, it can be understood that in this embodiment, the step S303, the step S304, and the step S305 are the step S103 in the embodiment shown in fig. 1, or a specific implementation manner of the step S203 in the embodiment shown in fig. 2, through the structuring process, a plurality of matching data obtained through matching in different ways can have a highly uniform format, which is beneficial to the subsequent processing, and on the other hand, through the deduplication in the step S304 and the secondary deduplication in the step S305, it is ensured that no duplicate item occurs in the matching rule in the fingerprint library, so as to avoid meaningless waste of storage resources.

In a typical application scenario, the data is an item sold on a website, and the data is represented in the form of a URL, the database stores part of the goods sold on the website, the information of other goods sold on the website which the user wants to obtain, the user can adopt the technical scheme of the embodiment of the invention to randomly select a preset number of commodities from the plurality of commodities in the database, and the number of the selected commodity on the website is used as the characteristic identification code of the selected commodity, com (i.e., the preset restriction condition set by the top-level domain name tld), the user selects 2 commodities in the database as the sample data, and if the serial number of the commodity A on the website is item1234, and the serial number of the commodity B on the website is item1368, the sample data is item1234 and item 1368.

When the technical scheme of the embodiment of the invention is executed based on the sample data to search in the mass data, firstly, the sample data can be loaded in the local memory of the computer executing the embodiment of the invention, and a dictionary is constructed. The dictionary key (key) is a top-level domain name tld of the sample data (in the present application scenario, host.com), and the value (value) of the dictionary is a character matching tree under the top-level domain name tld. Preferably, the character matching tree is constructed by splitting the character strings of all sample data into individual characters. Preferably, in this application scenario, a character matching tree shown in fig. 4 can be constructed and obtained based on the sample data item1234 and item 1368.

And then scanning the mass data one by one to search based on the character matching tree. Com, if not equal, skipping the currently scanned data; and if the current scanned data are equal, performing subsequent matching work on the current scanned data.

Com, for the currently scanned data with top-level domain name tld equal to host, it needs to perform equal matching on the query portion of the URL of the currently scanned data (i.e. the matching location is a query, and the matching rule is equal matching). In http:// a.host.com/path/test.html? For example, when the URL i234& qk2 item _1246& item _ id item _1234 represents the currently scanned data, the URL may be split first to obtain a query portion in the URL of the currently scanned data, the query portion may be further split by separators "&" and "═" to obtain dictionaries { "qk1": i123"," qk2": item _1246", "item _ id": item _1234 "represented in the form of key value pairs, and then the dictionaries may be traversed to search values in the dictionaries on the character matching tree shown in fig. 4 one by one according to characters.

For example, when the value i123 is matched, i is matched first, and the matching is successful; and then the second character 1 of the value i123 is matched downwards, and the child node list of the character i in the character matching tree shown in fig. 4 only has the character t and does not contain 1, so that the matching of the value i123 is unsuccessful.

As another example, when matching value item _1246, the first character i matches successfully; the second character, t, is also included in the list of children of character i in the character matching tree shown in FIG. 4; the third character e is also in the child node list of the t character of the character matching tree shown in FIG. 4; the character e, the character m and the character 1 are matched with the character matching tree shown in FIG. 4 in the same way; next, matching character 2, character 1 in the character matching tree shown in fig. 4 has two child nodes [2,3] containing the character 2 to be matched, so that character 4 can be continuously matched; in matching the character 4, since it is determined that the value item _1246 may possibly match the branch of the character 2 in the child node [2,3] below the character 1 in the character matching tree shown in fig. 4 when the last character 2 is matched, matching of the character 4 based on the branch of the character 2 is continued, but since the child node below the node of the character 2 in the branch of the character 2 in the character matching tree shown in fig. 4 is the character 3 and does not contain the character 4 to be matched, matching of the value item _1246 is also unsuccessful.

For another example, when matching the value item _1234, through the foregoing matching step with the character matching tree shown in fig. 4, it may be determined that the value item _1234 and the character matching tree shown in fig. 4 can be completely matched, so it is determined that the URL of the data to be scanned contains the sample data, and the matching parameter is the product ID.

Table 1 matching data list based on URL representation

http://a.host.com/path/test.html？qk1＝i234&qk2＝item_1246&item_id＝item_1234
	http://b.host.com/test？item_id＝item_1368&a＝c
http://c.host.com:1234/test？id＝item_1234
	http://item_1368.host.com/detai_info.html
http://a.host.com:3345/category-1234-item_1234-t12
	http://a.host.com:3567/item/item_1234/detail.html

Continuing to scan the mass data, it is also possible to obtain the following matching data based on URL representation. The matching data may include the URLs shown in table 1 above.

Table 2 table 1 table for structured standard data

As shown in table 2, after scanning the mass data one by one based on the sample data is completed, the matching data obtained by the search may be structured to obtain the standard data represented based on the preset format. Preferably, the standard data is arranged in the order of top-level domain name tld, port (port), matching parameter (querykey), matching location, matching content, and matching manner, wherein the default content is indicated by null. For example, for a port, when the port is a default value (i.e. 80), it may be omitted from the URL, and then the standard data is also indicated by a space. For another example, for the matching data obtained by searching the mass data with the path as the matching position in the embodiment of the present invention, after the matching data is structured into the standard data, the matching parameters of the standard data are null.

Table 3 matching rule list obtained based on the standard data conversion of table 2

For the standard data listed in table 2, converting the standard data into the matching rule according to the preset format, as shown in table 3. Where (item _ \ d +) is a regular expression used to represent a string of characters that begins with item _ and is followed by a number.

Further, according to the matching rule and the matching parameter, a second row may be deduplicated in the matching rule listed in table 3; and then comparing the matching rule with the existing matching rule in the fingerprint database, removing the matching rule which is possibly repeated with the existing matching rule in the fingerprint database in the table 3, and finally updating the matching rule subjected to twice duplication removal to the fingerprint database.

Further, the updated fingerprint database is reapplied to the mass data, and rescanning is performed based on the sequence of the host, the path and the query, so that the newly added matching rule can be matched with the URLs of more commodities, and the URL of the commodity obtained through matching (or the part of the commodity URL, which meets the matching rule) is updated to the database, so that the database can be expanded finally.

For example, the new matching rule http://. host.com/? item _ id? item _ id ═ test1& b ═ c, or URL:// test1.host. com/path/subpath/subpath/a. html? And the two newly matched commodity URLs of the commodity with the matching rule of q1, v1, 2, v2, item _ id and 11111 are test1 and 11111.

Those skilled in the art understand that, by the technical solution of the embodiment of the present invention, based on the sample data item _1234 in the database, two data, i.e. test1 and 11111, are finally extended. In the practical application process, the technical scheme of the embodiment of the invention can find a large amount of potential data in the long-tailed URL, thereby greatly expanding the database and realizing deep mining of mass data.

Fig. 5 is a schematic structural diagram of a data detection and expansion apparatus based on sample data according to a fourth embodiment of the present invention. Those skilled in the art understand that the data detection and expansion device 4 of the present embodiment is used to implement the method solutions in the embodiments shown in fig. 1 to fig. 4. Specifically, in this embodiment, the data detecting and expanding device 4 includes a determining module 41, configured to determine the sample data based on at least one piece of data in a database, where a plurality of pieces of data detected from mass data are stored; a searching module 42, configured to search the mass data based on the sample data to obtain matching data in the mass data, where the matching data matches the sample data; an updating module 43, configured to process the matching data to obtain a matching rule, and update a fingerprint database, where the matching rule obtained in history is stored in the fingerprint database; and an extraction module 44, configured to perform matching extraction on the mass data based on the updated fingerprint database to obtain data in the mass data that matches the matching rule in the updated fingerprint database, and expand the data obtained through matching to the database.

Further, the determining module 41 includes a selecting sub-module 411, configured to select a preset amount of data from the database, and use characteristic information of the preset amount of data as the sample data. Preferably, the feature information includes a feature identification code of the preset amount of data; or a regular expression determined according to the preset amount of data.

Further, the searching module 42 includes a first searching submodule 421, configured to search for data having the same characteristic information as the sample data in the massive data, and use the data having the same characteristic information as the matching data.

Further, the search module 42 further includes a second search submodule 422, where the second search submodule 422 is configured to, when searching for the mass data based on the sample data, if a preset limiting condition exists, search for a part of data defined by the preset limiting condition in the mass data to obtain the matching data.

Further, the updating module 43 includes a processing submodule 431, configured to perform structural processing on the matching data to obtain standard data arranged according to a preset format; a generating submodule 432, configured to generate the matching rule based on the standard data and perform deduplication; and an update submodule 433 for updating the fingerprint repository based on the de-duplicated matching rules.

Further, the generating sub-module 432 includes a converting unit 4321, configured to convert the standard data into the matching rule according to the preset format; and a duplicate removal unit 4322, configured to remove duplicate entries in the matching rule obtained through the conversion, to obtain the duplicate-removed matching rule.

Further, the update sub-module 433 includes a comparing unit 4331, configured to compare the de-duplicated matching rule with the matching rule in the fingerprint database to remove duplicate items twice; and an updating unit 4332, configured to update the matching rule with the duplicate entry removed twice to the fingerprint database.

Preferably, the data is an internet access record.

More contents of the working principle and the working mode of the data detection and expansion device 4 can refer to the related descriptions in fig. 1 to 4, and are not described again here.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A data detection and expansion method based on sample data is characterized by comprising the following steps:

determining the sample data based on at least one piece of data in a database, wherein the database stores a plurality of pieces of data acquired by detecting mass data;

searching in the mass data based on the sample data to obtain matched data matched with the sample data in the mass data;

processing the matching data to obtain a matching rule, and updating a fingerprint database, wherein the matching rule obtained historically is stored in the fingerprint database;

performing matching extraction in the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data, and expanding the data obtained by matching to the database;

and repeating the steps of determining sample data from the database to expanding the data obtained by matching to the database.

2. The method for data detection and expansion based on sample data according to claim 1, wherein the step of determining the sample data based on at least one piece of data in the database comprises the following steps:

and selecting a preset amount of data from the database, and taking the characteristic information of the preset amount of data as the sample data.

3. The method of claim 2, wherein the characteristic information comprises:

the feature identification codes of the preset amount of data; or

And the regular expression is determined according to the preset amount of data.

4. The method for data detection and expansion based on sample data according to claim 2, wherein the step of searching in the mass data based on the sample data to obtain the matching data matching with the sample data in the mass data comprises the following steps:

and searching data with the same characteristic information as the sample data in the mass data, and taking the data with the same characteristic information as the matching data.

5. The method according to claim 4, wherein when searching in the mass data based on the sample data, if a preset constraint exists, searching in a part of the mass data defined by the preset constraint to obtain the matching data.

6. The method of claim 1, wherein the matching data is processed to obtain matching rules and update the fingerprint database, comprising the steps of:

carrying out structuralization processing on the matched data to obtain standard data arranged according to a preset format;

generating the matching rule based on the standard data and removing duplication;

and updating the fingerprint database based on the matching rule after the duplication removal.

7. The method of claim 6, wherein generating the matching rules based on the standard data and de-duplicating the matching rules comprises:

converting the standard data into the matching rule according to the preset format;

and removing repeated items in the matching rule obtained by conversion to obtain the matching rule after duplication removal.

8. The method of claim 6, wherein the step of updating the fingerprint database based on the de-duplicated fingerprints comprises the steps of:

comparing the de-duplicated matching rules with the matching rules in the fingerprint database to remove duplicate items for the second time;

and updating the matching rule after the repeated items are removed twice to the fingerprint database.

9. The method of any one of claims 1 to 8, wherein the data is an internet access record.

10. A data probing and expansion device based on sample data, comprising:

the determining module is used for determining the sample data based on at least one piece of data in a database, and the database stores a plurality of pieces of data acquired by detecting mass data;

the searching module is used for searching in the mass data based on the sample data so as to obtain matched data matched with the sample data in the mass data;

the updating module is used for processing the matching data to obtain a matching rule and updating a fingerprint database, and the matching rule obtained historically is stored in the fingerprint database;

the extraction module is used for performing matching extraction on the mass data based on the updated fingerprint database to obtain data matched with the matching rule in the updated fingerprint database in the mass data and expanding the data obtained by matching to the database;

11. The sample data-based data detection and expansion device according to claim 10, wherein the determination module comprises:

and the selection submodule is used for selecting data with preset quantity from the database and taking the characteristic information of the data with the preset quantity as the sample data.

12. The sample data-based data detection and expansion device of claim 11, wherein the characteristic information comprises:

the feature identification codes of the preset amount of data; or

13. The sample data-based data detection and expansion device of claim 11, wherein the lookup module comprises:

and the first searching submodule is used for searching the data with the same characteristic information as the sample data in the mass data and taking the data with the same characteristic information as the matching data.

14. The apparatus according to claim 13, wherein the search module further comprises a second search submodule, and the second search submodule is configured to search, when searching in the mass data based on the sample data, if a preset constraint condition exists, a part of data defined by the preset constraint condition in the mass data, so as to obtain the matching data.

15. The sample data-based data probing and expansion device according to claim 10, wherein said update module comprises:

the processing submodule is used for carrying out structural processing on the matched data so as to obtain standard data arranged according to a preset format;

the generation submodule is used for generating the matching rule based on the standard data and removing duplication;

and the updating submodule is used for updating the fingerprint database based on the matching rule after the duplication is removed.

16. The sample data-based data probing and expansion device according to claim 15, wherein said generating sub-module comprises:

the conversion unit is used for converting the standard data into the matching rule according to the preset format;

and the duplication removing unit is used for removing repeated items in the converted matching rule to obtain the duplicated matching rule.

17. The sample data-based data probing and expansion device according to claim 16, wherein said update submodule comprises:

the comparison unit is used for comparing the matching rule after the duplication removal with the matching rule in the fingerprint database so as to remove repeated items for the second time;

and the updating unit is used for updating the matching rule after the repeated items are removed twice to the fingerprint database.

18. The apparatus according to any one of claims 10 to 17, wherein the data is an internet access record.