CN112822302B - Data normalization method and device, electronic equipment and storage medium - Google Patents

Data normalization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112822302B
CN112822302B CN201911127228.6A CN201911127228A CN112822302B CN 112822302 B CN112822302 B CN 112822302B CN 201911127228 A CN201911127228 A CN 201911127228A CN 112822302 B CN112822302 B CN 112822302B
Authority
CN
China
Prior art keywords
rule
network address
regular
domain name
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911127228.6A
Other languages
Chinese (zh)
Other versions
CN112822302A (en
Inventor
郭玲
朱建新
杨雷
张晓雨
唐潜
丁娇
秦首科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201911127228.6A priority Critical patent/CN112822302B/en
Publication of CN112822302A publication Critical patent/CN112822302A/en
Application granted granted Critical
Publication of CN112822302B publication Critical patent/CN112822302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/30Managing network names, e.g. use of aliases or nicknames
    • H04L61/301Name conversion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Abstract

The application discloses a data normalization method, a data normalization device, electronic equipment and a storage medium, and relates to the field of data processing, in particular to the technical field of network address normalization processing. The specific implementation scheme is as follows: acquiring a plurality of regular word lists, wherein the regular word lists are generated by an offline module and respectively record regular expressions with different action ranges; acquiring a target network address; and according to the plurality of regular word lists, sequentially performing character matching on the network addresses according to the sequence of the action ranges of the recorded regular expressions from large to small to obtain the normalized network addresses. And the regular expressions are matched with the target network address by sequentially using a plurality of rules, so that the accuracy of network address normalization processing is improved. The online module does not need to generate a regular word list, so that the online module can uninterruptedly normalize the acquired target address, and the network address normalization processing efficiency is improved.

Description

Data normalization method and device, electronic equipment and storage medium
Technical Field
The application relates to a data processing technology, in particular to a network address normalization processing technology.
Background
In the internet advertisement putting process, merchants are very concerned about return on investment. When calculating the Return On Investment (ROI), the conversion data can be determined according to the network address of the landing page clicked by the user. In order to monitor the released traffic, the merchant may add a number of parameters to the landing page, which results in that many network addresses actually point to the same landing page, and at this time, the network addresses need to be normalized. However, on-line normalization is inefficient.
Disclosure of Invention
The embodiment of the application provides a data normalization method, a data normalization device, electronic equipment and a storage medium, and can improve the normalization efficiency of network addresses.
The embodiment of the application provides a data normalization method, which is applied to an online module and comprises the following steps:
acquiring a plurality of regular word lists, wherein the regular word lists are generated by an offline module and respectively record regular expressions with different action ranges;
acquiring a target network address;
and sequentially carrying out character matching on the network addresses according to the regular expression tables and the recorded order of the action range of the regular expression from large to small to obtain the normalized network addresses.
According to the data normalization method provided by the application embodiment, the online module can acquire the plurality of rule word lists, and when the target network address is normalized, the plurality of rules are sequentially used for matching the regular expressions of the target network address according to the sequence that the action ranges of the regular expressions in the rule word lists are from large to small, so that the accuracy of network address normalization processing is improved. The online module does not need to generate a regular word list, so that the online module can uninterruptedly normalize the acquired target address, and the network address normalization processing efficiency is improved.
On the basis of the above embodiment, obtaining a plurality of rule word lists includes:
acquiring a preposed general rule, a domain name rule and a page rule generated by an offline module; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule;
correspondingly, according to a plurality of regular word lists and according to the sequence of the action range of the recorded regular expressions from large to small, the character matching is carried out on the network addresses in sequence to obtain the normalized network addresses, and the method comprises the following steps:
and sequentially using the preposed general rule, the domain name rule and the page rule to perform character matching on the network address to obtain a normalized network address.
The embodiment of the application can realize regular expression matching from a large range to a small range through the preposed general rule, the domain name rule and the page rule, so that the page rule and the domain name rule with smaller granularity are executed after the preposed general rule, and further, the effectiveness of the word list with the smaller granularity rule is improved.
On the basis of the above embodiment, acquiring a plurality of rule word lists further includes:
acquiring a post-positioned general rule generated by an offline module, wherein the post-positioned general rule comprises a blacklist network address;
correspondingly, the method for character matching of the network address by sequentially using the preposed general rule, the domain name rule and the page rule further comprises the following steps:
and performing character matching according to the post-positioned general rule to obtain the normalized network address.
The application embodiment can use the post-general rule to check whether the network address is the network address in the blacklist or not after the target network address is processed by using the pre-general rule, the domain name rule and the page rule in sequence, namely the network address with potential safety hazard, so that the safety of the network address normalization result is improved.
On the basis of the above embodiment, after obtaining the plurality of rule word lists, the method further includes:
inquiring whether to update the rule word list;
and if the regular word list is updated, acquiring the updated regular word list according to the preset updating address, wherein the updated regular word list is generated by the off-line module.
According to the application embodiment, when the rule word list needing to be updated is inquired, the updated rule word list is obtained by accessing the preset updating address, the updated rule word list is obtained in time by the online module after the offline module updates the rule word list, the instantaneity of the rule word list is improved, and the reliability of network address normalization processing is further improved.
The embodiment of the present application further provides a data normalization method, applied to an offline module, including:
acquiring a network address;
generating a plurality of regular word lists according to the network address, wherein the regular word lists respectively record regular expressions with different action ranges;
and sending the generated multiple regular word lists to an online module so that the online module can normalize the network address according to the multiple regular word lists.
In the embodiment of the application, the off-line module can acquire the network address, generate the rule word list according to the network address and send the generated rule word list to the on-line module, so that the on-line module can normalize the network address. Compared with the method for maintaining the regular word list and normalizing the network address by using the online system, the method for maintaining the regular word list generates and maintains the regular word list by using the offline module, does not need to maintain the regular word list by using the online module, and further greatly improves the network address normalization efficiency of the online module. Meanwhile, the multiple regular word lists generated by the offline module can normalize the network addresses from different granularities, and the accuracy of network address normalization is improved.
On the basis of the above embodiment, generating a plurality of rule word lists according to the network address includes:
generating a preposed general rule, a domain name rule and a page rule according to the network address; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule.
In the embodiment of the application, the offline module generates the preposed general rule, the domain name rule and the page rule, and the preposed general rule, the domain name rule and the page rule can perform regular matching on network addresses of different levels to realize the provision of the rule word list of different granularities.
On the basis of the above embodiment, generating a domain name rule and a page rule according to a network address includes:
acquiring an incremental network address, and sampling the incremental network address to obtain a plurality of network addresses to be identified, wherein the incremental network address comprises an incremental domain name network address and an incremental page-level network address;
carrying out regular expression replacement on a plurality of network addresses to be identified by using a preposed general rule to obtain preprocessed network addresses;
respectively calculating the webpage signature and the webpage check code of the webpage pointed by each preprocessed network address;
if the web signatures of the plurality of preprocessed network addresses are the same or the web check codes are the same, removing parameters after symbols are preset in the plurality of preprocessed network addresses, wherein the characters after the symbols are preset are parameter information carried by the web;
and mapping the plurality of preprocessed network addresses with the parameters removed to the same network address to obtain a domain name rule and a page rule.
According to the application embodiment, a plurality of network addresses actually pointing to the same address can be determined through the webpage signature and the webpage check code MD5 of the webpage, so that a domain name rule and a page rule are formed, and address mapping in a rule word list is more accurate.
On the basis of the above embodiment, sending the generated multiple rule word lists to the online module includes:
and when the rule word list is updated, sending the updated rule word list to a preset updating address.
In the embodiment, after the rule word list is generated by the off-line module, the rule word list is uploaded to the preset updating address. When the online module determines that updating is needed, the updated regular word list can be obtained by accessing the preset updating address, so that online and offline data synchronization is realized, and the accuracy of online network address normalization is improved.
On the basis of the above embodiment, generating a plurality of rule word lists according to the network address includes:
and determining a post-general rule according to the website security, wherein the post-general rule comprises a blacklist network address.
According to the application embodiment, the offline module can edit the post-general rule according to the network address security, so that the network address contained in the post-general rule is the network address in the blacklist, and the security is improved.
On the basis of the above embodiment, after generating a plurality of rule word tables according to the network address, the method further includes:
if the regular expressions among the regular word lists conflict, the regular expressions with smaller action ranges are reserved, and the regular expressions with larger action ranges are deleted.
In the embodiment of the application, the regular expressions in the generated multiple regular word lists can be checked, and whether conflicts exist or not can be judged. When the regular expressions conflict, the regular expressions with smaller action ranges are reserved, the regular expressions with larger action ranges are deleted, and the reliability of the regular word list is improved.
On the basis of the above embodiment, after generating a plurality of rule word tables according to the network address, the method further includes:
inputting a preset number of network addresses;
acquiring the number of network addresses which are correct in normalization processing and the number of network addresses which need normalization processing;
determining a normalization accuracy parameter according to the number of the network addresses which are correctly normalized and a preset number;
and determining a normalization recall parameter according to the number of the network addresses which are correct in normalization processing and the number of the network addresses which need to be normalized.
According to the application embodiment, the accuracy parameters and the recall parameters of the normalization processing can be calculated in the off-line module through the preset number of network addresses for testing, so that the normalization processing effect can be displayed for a user more intuitively, and the usability is improved.
The embodiment of the present application further provides a device for data normalization, which is applied to an online module, and includes:
the regular word list obtaining sub-module is used for obtaining a plurality of regular word lists, the regular word lists are generated by the off-line module, and the regular expressions with different action ranges are respectively recorded in the regular word lists;
the target network address acquisition submodule is used for acquiring a target network address;
and the normalization submodule is used for sequentially carrying out character matching on the network addresses according to the regular word lists and the sequence of the action ranges of the recorded regular expressions from large to small to obtain the normalized network addresses.
The embodiment of the present application further provides a device for data normalization, which is applied to an offline module, and includes:
the network address acquisition submodule is used for acquiring a network address;
the regular word list generating submodule is used for generating a plurality of regular word lists according to the network address, and the regular word lists respectively record regular expressions with different action ranges;
and the sending submodule is used for sending the generated multiple regular word lists to the online module so that the online module can normalize the network address according to the multiple regular word lists.
An embodiment of the present application further provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.
Embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above method.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be considered limiting of the present application. Wherein:
FIG. 1 is a schematic diagram of a scenario according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram of a method of data normalization according to a first embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a method of data normalization according to a second embodiment of the present application;
FIG. 4 is a schematic diagram of an apparatus for data normalization according to a third embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for data normalization according to a fourth embodiment of the present application;
FIG. 6 is a block diagram of an electronic device for implementing the method of data normalization of an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a network architecture provided in an embodiment of the present application, and includes an online module 010 and an offline module 020. The online module 010 is configured to normalize a Uniform Resource Locator (URL) on a line. The online module 010 obtains the rule vocabulary generated by the offline module 020, and performs character matching on the network addresses in sequence according to the sequence that the action range of the regular expressions recorded in the rule vocabulary is from large to small to obtain the normalized network addresses. In the present application, the URL of the web page is also called a network address.
The offline module 020 is used for generating a rule vocabulary under the online. After the regular vocabulary is generated or the updated regular vocabulary is generated, the offline module 020 uploads the regular vocabulary to the network so that the online module 010 can obtain the regular vocabulary through the network. The offline module 020 generates a regular vocabulary according to the input network address, and updates the local regular vocabulary according to the regular vocabulary. And then, the local rule word list is issued to the network, and the online rule word list is stored in the network. The online module 010 obtains an online rule vocabulary as a basis for normalization. The target network address input to the online module 010 is subjected to operations such as matching of the regular expression, and the normalized network address is obtained.
When the online module carries out normalization processing at present, the online module is required to maintain a regular word list. However, the establishment and maintenance of the regular vocabulary requires a large amount of computing resources, resulting in low efficiency of the online normalization process. The data normalization method provided by the embodiment of the application can improve the integrity of the normalization rule, improve the automation degree and improve the normalization timeliness. The online module 010 provides online network address normalization service, and can synchronize offline normalization rule vocabulary based on an online platform, and realize high throughput and timeliness of URL normalization requests by using high concurrent asynchronous request capability of the online platform. The offline module is used for mining and generating the rule word list, and the rule word list used for normalization by the online module 010 can be generated through a full-automatic data processing mode provided by the application, so that the normalization rule word list gradually advancing to be perfect is generated. Lays a foundation for subsequent target transformation bid function (OCPC) expansion.
In general, the summary output speed of the network address normalization rule is slow, and the time required for generating the rule vocabulary is far longer than the time of normalization processing on a large number of network addresses. Therefore, when the generation of the regular vocabulary and the normalization process are performed by the online module, the normalization process is interrupted, resulting in poor efficiency. The embodiment of the application separates the generation function of the regular word list from the normalization processing function according to the regular word list. Furthermore, the online module 010 does not need to spend a large amount of time to generate the regular vocabulary, so that the time cost for waiting for updating the regular vocabulary is reduced, the normalization processing efficiency is improved, the system timeliness is improved, and the high throughput rate of the normalization processing is realized. Specifically, the data normalization method provided in the embodiment of the present application can establish and maintain the regular vocabulary by the offline module 020, and normalize the network address by the online module 010, thereby greatly improving the network address normalization efficiency. The offline module 020 and the online module 010 can be executed by two servers respectively. The off-line module 020 and the on-line module 010 are further described by the embodiments.
Example one
Fig. 2 is a diagram of a data normalization method according to an embodiment of the present application, which is applied to an online module and is applicable to a case of performing online normalization on a network address. Can be executed on a server, and the method comprises the following implementation modes:
step 101, obtaining a plurality of regular word lists, wherein the regular word lists are generated by an offline module, and the regular expressions with different action ranges are recorded in the regular word lists respectively.
And the online module receives a batch processing request of the redundant network address URL so as to rapidly produce a normalized network address list.
Multiple rule word lists may be pre-stored in the online module. For example, a plurality of rule word lists generated by the offline module architecture are stored in the preset storage address. And when the online module is started, acquiring a plurality of rule word lists through the preset storage address.
Optionally, acquiring a pre-general rule, a domain name rule and a page rule generated by the offline module; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule.
Can the question mark symbol "? "character after. The slash symbol "/" characters appearing in succession can be removed through the domain name rule and the page rule, and normalized characters are obtained.
The action range can be the length of the network address adapted by the regular expression, and can also be the range of the network address which can be processed by the regular expression. For example, the regular expression contained by the pre-generic rule may be applicable to all network addresses. The regular expression contained in the domain name rule is applicable to the domain name level network address processed by the preposed general rule. The regular expression included in the page rule is applicable to page-level network addresses and the like processed by the regular expression included in the domain name rule.
Further, a post-arranged general rule generated by the off-line module is obtained, and the post-arranged general rule comprises a blacklist network address.
The blacklist stores the web addresses to be shielded. The post-generic rule is used for filtering the network addresses which pass through the pre-generic rule, the domain name rule and the webpage rule based on the blacklist network addresses.
Step 102, acquiring a target network address.
The requester who needs to perform the network address normalization processing can send the target network address to the online module. Or when the online module receives the network address processing task, the online module reads the data file recorded with the target network address by accessing the network resource.
And 103, sequentially performing character matching on the network addresses according to the regular word lists and the sequence of the recorded action ranges of the regular expressions from large to small to obtain the normalized network addresses.
Optionally, the prefix general rule, the domain name rule and the page rule are used in sequence to perform character matching on the network address, so as to obtain the normalized network address.
The action range of the preposed general rule is larger than that of the domain name rule, and the action range of the domain name rule is larger than that of the page rule. The method sequentially uses the preposed general rule, the domain name rule and the page rule to match characters of the network address, realizes the effect range from large to small, and can ensure the rule priority principle of fine granularity and solve the problem of normalization error caused by matching different levels of rules by sequentially matching the input URL with the general rule, the domain name rule and the page rule.
The regular expression matching from a large range to a small range is realized through the preposed general rule, the domain name rule and the page rule, so that the page rule and the domain name rule with smaller granularity are executed after the preposed general rule, and further, the effectiveness of the vocabulary of the smaller granularity rule is improved
Further, character matching is carried out according to a post-arranged general rule, and a normalized network address is obtained.
After the target network address is processed by sequentially using the preposed general rule, the domain name rule and the page rule, whether the network address is a network address in a blacklist or not is checked by using the postposed general rule, namely the network address with potential safety hazards is detected, and the safety of a network address normalization result is further improved.
Further, after obtaining the plurality of rule word lists in step 103, the method further includes:
inquiring whether to update the rule word list; and if the regular word list is updated, acquiring the updated regular word list according to the preset updating address, wherein the updated regular word list is generated by the off-line module.
The updated rule word list may be stored in a network database. The online module may query whether there is an updated regular vocabulary through a query statement. And if the updated regular word list exists, the online module accesses the network database through a preset updating address and acquires the updated regular word list. The rule vocabulary may be one or more of pre-generic rules, domain name rules, page rules, or post-generic rules.
When the rule word list needing to be updated is inquired, the updated rule word list is obtained by accessing the preset updating address, so that the updated rule word list is obtained by the online module in time after the rule word list is updated by the offline module, the real-time performance of the rule word list is improved, and the reliability of network address normalization processing is further improved.
According to the data normalization method provided by the embodiment of the application, the online module can acquire the plurality of rule word lists, and when the target network address is normalized, the plurality of rules are sequentially used for matching the regular expressions for the target network address according to the sequence from large action range to small action range of the regular expressions in the rule word lists, so that the accuracy of network address normalization processing is improved. The online module does not need to generate a regular word list, so that the online module can uninterruptedly normalize the acquired target address, and the network address normalization processing efficiency is improved.
Example two
Fig. 3 is a data normalization method provided in the second embodiment of the present application, which is applied to an offline module, and is applicable to a case where a rule vocabulary is generated when a network address is normalized online, and the method can be executed on a server, where the server and the server in the first embodiment may be two different servers, and the method can be implemented by:
step 201, acquiring a network address.
The network address to be analyzed may be sent to the offline module periodically.
Step 202, generating a plurality of regular word lists according to the network address, wherein the regular word lists respectively record regular expressions with different action ranges.
When the online module performs the network address normalization processing, the online module is based on the rule word list generated by the offline module. The regular word list is stored in a text form, and the stored rules are expressed by regular expressions. According to the action range of the regular expression contained in the rule, the generation of the rule is divided into three granularities: general rules, domain name rules, and page rules.
Optionally, generating a pre-general rule, a domain name rule and a page rule according to the network address; the preposed general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule includes a regular expression of the non-page address portion of the network address at the page level. The action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule.
The preposed general rule can be obtained by mining the network address commonality. The general rules include regular expressions that replace invalid parameter names, regular expressions that replace invalid strings, and the like. The invalid characters which appear already can be counted, and the counted invalid characters are given to generate a preposed general rule.
Alternatively, the processing of the position symbol "#" and the processing of the separation symbol "/" may be added to the leading general rule. The preposed general rule replaces hard coding with word list configuration, so that the subsequent iteration speed of the module can be increased, and the problem that individual site processing errors cannot be repaired is solved.
The off-line module generates a preposed general rule, a domain name rule and a page rule, and the preposed general rule, the domain name rule and the page rule can perform regular matching on network addresses of different levels to realize normalization word lists for the network addresses of different granularities.
Further, when generating the domain name rule and the page rule according to the network address, the method can be implemented by the following steps:
step 1) obtaining an incremental network address, and sampling the incremental network address to obtain a plurality of network addresses to be identified.
Wherein the incremental network address comprises an incremental domain name network address and an incremental page level network address. And acquiring a certain number of network address URLs as input, maintaining a processed network address list, and calculating the incremental network addresses according to the network address list. Subsequent steps performed on incremental network addresses rather than full network addresses can reduce the network address load that the mining policy needs to handle.
And 2) carrying out regular expression replacement on a plurality of network addresses to be identified by using a preposed general rule to obtain the preprocessed network addresses.
The incremental network addresses are pre-processed using the same regular vocabulary as the online module.
And 3) respectively calculating the webpage signature and the webpage check code of the webpage pointed by each preprocessed network address.
And based on the online platform, capturing webpage content through a web crawler. And generating a webpage signature according to a preset signature algorithm. The preset signature algorithm may be an algorithm for selecting a space with the longest name in a webpage as a signature, or an algorithm for using webpage content as a signature.
The web page check code (content) is an MD5 value of a component code on a web page, MD5 is a Message-Digest Algorithm (MD 5), and the process captures the web page component code through a web crawler to generate the web page check code content.
After the web page check code and the web page signature are determined, a fusion feature (join feature) can be formed according to the web page check code and the web page signature. The fusion feature may be a key-value pair consisting of a web page check code and a web page signature.
Furthermore, the network address can be split, the number of words can be limited, and the like, so that the problem of task list blockage can be prevented.
And 4) if the web signatures of the plurality of preprocessed network addresses are the same or the web check codes are the same, removing the parameters of the preprocessed network addresses after symbols are preset, wherein the characters after the symbols are preset are parameter information carried by the web.
If the signatures of a plurality of network addresses in the fusion characteristics are the same or the characteristics of the webpage check codes are the same, parameters in the network addresses can be removed, the preprocessed URL is used for replacing the original URL and generating rules. Where the parameter may be a question mark "? "the following character. Optionally, part of the network address parameters may become dead links after being removed, but different parameters correspond to the same page (or the same template). For this part of the network address, if the signature or the content is the same, the network address is mapped to a fixed URL.
And 5) mapping the plurality of preprocessed network addresses with the parameters removed to the same network address to obtain a domain name rule and a page rule.
According to the application embodiment, a plurality of network addresses actually pointing to the same address can be determined through the webpage signature and the webpage check code MD5 of the webpage, so that a domain name rule and a page rule are formed, and address mapping in a rule word list is more accurate.
Optionally, the generation period of the domain name rule and the page rule may be set as a 24-hour continuous time window. And generating an hour level rule through the network address in the merge window, so that the overall timeliness of the data stream is improved.
Further, a post-general rule is determined according to the website security, and the post-general rule comprises a blacklist network address.
The post-common rules contain blacklisted network addresses, including illegal network addresses and the like, which are considered insecure network addresses. For example, a fault page rule or a domain name rule for documenting that should be deleted.
And editing the post-general rule by the off-line module according to the network address security, so that the network address contained in the post-general rule is the network address in the blacklist, and the security is improved.
And step 203, sending the generated multiple regular word lists to an online module so that the online module can normalize the network address according to the multiple regular word lists.
Optionally, when the rule word list is updated, the updated rule word list is sent to the preset update address.
And after the offline module generates the regular word list, uploading the regular word list to a preset updating address. When the online module determines that updating is needed, the updated regular word list can be obtained by accessing the preset updating address, so that online and offline data synchronization is realized, and the accuracy of online network address normalization is improved.
Further, if the regular expressions among the regular word lists conflict, the regular expressions with smaller action ranges are reserved, and the regular expressions with larger action ranges are deleted.
And updating all the rules to a local normalization rule word list maintained by the offline module. If the rules with different granularities conflict, the priority sequence reserved by the rules is from small to big: page rules, domain name rules, pre-generic rules.
In the embodiment of the application, the regular expressions in the generated multiple regular word lists can be checked, and whether conflicts exist or not can be judged. When the regular expressions conflict, the regular expressions with smaller action ranges are reserved, the regular expressions with larger action ranges are deleted, and the reliability of the regular word list is improved.
Further, inputting a preset number of network addresses; acquiring the number of network addresses which are correct in normalization processing and the number of network addresses which need normalization processing; determining a normalization accuracy parameter according to the number of the network addresses which are correctly normalized and a preset number; and determining a normalization recall parameter according to the number of the network addresses which are correct in normalization processing and the number of the network addresses which need to be normalized.
The normalization accuracy parameter represents the correct proportion for normalization processing, and is represented by the formula: the number of network addresses for which the normalization processing is correct/the total number of network addresses for which the normalization processing is performed are calculated. The normalized recall parameter, also known as coverage, is expressed by the formula: the number of network addresses correct for normalization/the total number of network addresses that require normalization is calculated.
The network address needing to be normalized refers to the network address with parameters, and the network address without needing to be normalized refers to the network address without parameters. The accuracy parameter and the recall parameter may be calculated by: (1) extracting network addresses from a network address library: at most 20 different page level network addresses are extracted under the same domain name, three original network addresses are randomly extracted as a group by the network addresses of the same page level, and 2000 groups of network address data are extracted in total. Because the same site is generally consistent in normalization rule, the number of network addresses with the same domain name is limited, and the condition that the number of the network addresses of a certain domain name is excessively extracted to influence the evaluation of normalization performance can be prevented. (2) carrying out normalization: and calling an online module to carry out normalization processing on the network address, and receiving a normalization result returned by the online module. Or the off-line module uses the generated regular word list to carry out normalization processing on the network address to obtain a normalization result. (3) calculating a signature: and calculating a signature according to each pair of normalized network addresses and the original network address. (4) judging whether page-level network address normalization is correct: if the signatures of more than two pairs of normalization results in the group are the same, the normalization is correct, otherwise, the normalization is incorrect. (5) calculating accuracy parameters and recall parameters: and respectively calculating a normalized accuracy parameter and a normalized recall parameter according to the formulas.
According to the application embodiment, the accuracy parameters and the recall parameters of the normalization processing can be calculated in the off-line module through the preset number of network addresses for testing, so that the normalization processing effect can be displayed for a user more intuitively, and the usability is improved.
According to the data normalization method provided by the embodiment of the application, the off-line module can acquire the network address, the rule word list is generated according to the network address, and the generated rule word list is sent to the on-line module, so that the on-line module can perform normalization processing on the network address. Compared with the method for maintaining the regular word list and normalizing the network address by using the online system, the method for maintaining the regular word list generates and maintains the regular word list by using the offline module in the embodiment of the application, and the online module is not required to maintain the regular word list, so that the network address normalization efficiency of the online module is greatly improved. Meanwhile, the multiple regular word lists generated by the offline module can normalize the network addresses from different granularities, and the accuracy of network address normalization is improved.
EXAMPLE III
Fig. 4 is a device 300 for data normalization according to a third embodiment of the present application, applied to an online module, and including: a rule vocabulary obtaining sub-module 301, a target network address obtaining sub-module 302, and a normalization sub-module 303. Wherein:
the regular word list obtaining sub-module 301 is configured to obtain a plurality of regular word lists, where the plurality of regular word lists are generated by the offline module and record regular expressions with different action ranges;
a target network address obtaining submodule 302 configured to obtain a target network address;
and the normalization submodule 303 is configured to perform character matching on the network addresses in sequence according to the order from large to small of the action range of the recorded regular expressions according to the plurality of regular word lists, so as to obtain a normalized network address.
On the basis of the above embodiment, the rule vocabulary obtaining sub-module 301 is configured to:
acquiring a preposed general rule, a domain name rule and a page rule generated by an offline module; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule;
accordingly, the normalization submodule 303 is configured to:
and sequentially using the preposed general rule, the domain name rule and the page rule to perform character matching on the network address to obtain a normalized network address.
On the basis of the above embodiment, the rule vocabulary obtaining sub-module 301 is further configured to:
acquiring a post-positioned general rule generated by an offline module, wherein the post-positioned general rule comprises a blacklist network address;
correspondingly, the normalization submodule 303 is further configured to:
and performing character matching according to the post-positioned general rule to obtain the normalized network address.
On the basis of the above embodiment, the method further includes an update sub-module, configured to:
inquiring whether to update the rule word list;
and if the regular word list is updated, acquiring the updated regular word list according to the preset updating address, wherein the updated regular word list is generated by the off-line module.
In the data normalization apparatus provided in the embodiment of the application, the rule vocabulary obtaining sub-module 301 obtains a plurality of rule vocabularies, and when the target network address obtained by the target network address obtaining sub-module 302 is normalized, the normalization sub-module 303 performs regular expression matching on the target network address by sequentially using a plurality of rules according to the sequence of the regular expressions in the rule vocabularies from large to small, so that the accuracy of the network address normalization processing is improved. The online module does not need to generate a regular word list, so that the online module can uninterruptedly normalize the acquired target address, and the network address normalization processing efficiency is improved.
Example four
Fig. 5 is a device 400 for data normalization according to a fourth embodiment of the present application, applied to an offline module, including: a network address acquisition submodule 401, a rule word list generation submodule 402, and a transmission submodule 403. Wherein:
a network address obtaining submodule 401, configured to obtain a network address;
the regular word list generation submodule 402 is configured to generate a plurality of regular word lists according to the network address, where the plurality of regular word lists record regular expressions with different action ranges respectively;
the sending submodule 403 is configured to send the generated multiple rule word lists to the online module, so that the online module performs normalization processing on a network address according to the multiple rule word lists.
On the basis of the above embodiment, the rule vocabulary generating sub-module 402 is configured to:
generating a preposed general rule, a domain name rule and a page rule according to the network address; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule.
On the basis of the above embodiment, the rule word list generating sub-module 402 is configured to:
acquiring an incremental network address, and sampling the incremental network address to obtain a plurality of network addresses to be identified, wherein the incremental network address comprises an incremental domain name network address and an incremental page-level network address;
carrying out regular expression replacement on a plurality of network addresses to be identified by using a preposed general rule to obtain preprocessed network addresses;
respectively calculating the webpage signature and the webpage check code of the webpage pointed by each preprocessed network address;
if the web signatures of the plurality of preprocessed network addresses are the same or the web check codes are the same, removing parameters after symbols are preset in the plurality of preprocessed network addresses, wherein the characters after the symbols are preset are parameter information carried by the web;
and mapping the plurality of preprocessed network addresses with the parameters removed to the same network address to obtain a domain name rule and a page rule.
On the basis of the above embodiment, the sending sub-module 403 is configured to:
and when the regular word list is updated, sending the updated regular word list to a preset updating address.
On the basis of the above embodiment, the rule vocabulary generating sub-module 402 is configured to:
and determining a post-general rule according to the website security, wherein the post-general rule comprises a blacklist network address.
On the basis of the above embodiment, the system further includes a collision detection sub-module, configured to:
if the regular expressions among the regular word lists conflict, the regular expressions with smaller action ranges are reserved, and the regular expressions with larger action ranges are deleted.
On the basis of the above embodiment, the method further includes an evaluation parameter calculation sub-module, configured to:
inputting a preset number of network addresses;
acquiring the number of network addresses which are correct in normalization processing and the number of network addresses which need normalization processing;
determining a normalization accuracy parameter according to the number of the network addresses with correct normalization processing and a preset number;
and determining a normalization recall parameter according to the number of the network addresses which are correct in normalization processing and the number of the network addresses which need to be normalized.
The data normalization device provided by the embodiment of the application can obtain a network address by the network address obtaining submodule 401, the rule word list generating submodule 402 generates a rule word list according to the network address, and the sending submodule 403 sends the generated rule word list to the online module, so that the online module performs normalization processing on the network address. Compared with the method for maintaining the regular word list and normalizing the network address by using the online system, the method for maintaining the regular word list generates and maintains the regular word list by using the offline module in the embodiment of the application, and the online module is not required to maintain the regular word list, so that the network address normalization efficiency of the online module is greatly improved. Meanwhile, the multiple regular word lists generated by the offline module can normalize the network addresses from different granularities, and the accuracy of network address normalization is improved.
EXAMPLE six
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 6, it is a block diagram of an electronic device according to the method for data normalization in the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 501 is taken as an example.
Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for normalizing data provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of normalizing data provided herein.
The memory 502, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the data normalization method in the embodiments of the present application (e.g., the regular word list obtaining sub-module 301, the target network address obtaining sub-module 302, and the normalization sub-module 303 shown in fig. 4; and also, for example, the network address obtaining sub-module 401, the regular word list generating sub-module 402, and the sending sub-module 403 shown in fig. 5). The processor 501 executes various functional applications of the server and data processing, i.e., a method for implementing data normalization in the above method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 502.
The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device normalized by the data, or the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected to a data normalization electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method of data normalization may further comprise: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the data-normalized electronic device, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, the online module can obtain the plurality of rule word lists, when the target network address is subjected to normalization processing, the plurality of rules are sequentially used for matching the regular expressions to the target network address according to the sequence that the action ranges of the regular expressions in the rule word lists are from large to small, and accuracy of network address normalization processing is improved. The online module does not need to generate a regular word list, so that the online module can uninterruptedly normalize the acquired target address, and the network address normalization processing efficiency is improved.
The online module can realize regular expression matching from a large range to a small range through the preposed general rule, the domain name rule and the page rule, so that the page rule with smaller granularity is executed after the domain name rule and the preposed general rule, and further, the effectiveness of the vocabulary of the rule with smaller granularity is improved.
The online module can use the post-general rule to check whether the network address is a network address with potential safety hazard in a blacklist after the target network address is processed by using the pre-general rule, the domain name rule and the page rule in sequence, so that the safety of a network address normalization result is improved.
The online module can obtain the updated regular word list by accessing the preset updating address when the regular word list needs to be updated, so that the online module can timely obtain the updated regular word list after the offline module updates the regular word list, the real-time performance of the regular word list is improved, and the reliability of network address normalization processing is further improved.
The off-line module can acquire the network address, generate a regular word list according to the network address, and send the generated regular word list to the on-line module so that the on-line module can normalize the network address. Compared with the method for maintaining the regular word list and normalizing the network address by using the online system, the method for maintaining the regular word list generates and maintains the regular word list by using the offline module in the embodiment of the application, and the online module is not required to maintain the regular word list, so that the network address normalization efficiency of the online module is greatly improved. Meanwhile, the multiple regular word lists generated by the offline module can normalize the network addresses from different granularities, and the accuracy of network address normalization is improved.
The off-line module generates a preposed general rule, a domain name rule and a page rule, and the preposed general rule, the domain name rule and the page rule can perform regular matching on network addresses of different levels to realize normalization word lists for the network addresses of different granularities.
The off-line module can determine a plurality of network addresses actually pointing to the same address through the webpage signature and the webpage check code MD5 of the webpage, and further form a domain name rule and a page rule, so that the address mapping in the rule word list is more accurate.
After the rule word list is generated by the off-line module, the rule word list is uploaded to a preset updating address. When the online module determines that updating is needed, the updated regular word list can be obtained by accessing the preset updating address, so that online and offline data synchronization is realized, and the accuracy of online network address normalization is improved.
The post-general rule can be edited by the off-line module according to the network address security, so that the network addresses contained in the post-general rule are network addresses in a blacklist, and the security is improved.
The regular expressions in the generated multiple regular word lists can be checked, and whether conflicts exist or not is judged. When the regular expressions conflict, the regular expressions with smaller action ranges are reserved, the regular expressions with larger action ranges are deleted, and the reliability of the regular word list is improved.
The method can calculate the accuracy parameters and the recall parameters of the normalization processing in the off-line module through the preset number of network addresses for testing, so that the normalization processing effect can be displayed for a user more intuitively, and the usability is improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (13)

1. A method for data normalization, applied to an online module, includes:
acquiring a plurality of regular word lists, wherein the regular word lists are generated by an offline module and respectively record regular expressions with different action ranges;
acquiring a target network address;
according to the plurality of regular word lists, sequentially performing character matching on the network addresses according to the sequence of the recorded action ranges of the regular expressions from large to small to obtain normalized network addresses;
wherein, the obtaining a plurality of rule word lists comprises:
acquiring a preposed general rule, a domain name rule and a page rule generated by an offline module; the pre-generic rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in a page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule;
correspondingly, the sequentially performing character matching on the network address according to the regular word lists and the sequence of the recorded action ranges of the regular expressions from large to small to obtain a normalized network address includes:
and performing character matching on the network address by using a preposed general rule, a domain name rule and a page rule in sequence to obtain a normalized network address, wherein the action range is the length of the network address adapted to the regular expression or the range of the network address capable of being processed by the regular expression.
2. The method of data normalization of claim 1, wherein obtaining a plurality of rule word lists further comprises:
acquiring a post-positioned general rule generated by an offline module, wherein the post-positioned general rule comprises a blacklist network address;
correspondingly, after the character matching is performed on the network address by using the preposed general rule, the domain name rule and the page rule in sequence, the method further comprises the following steps:
and performing character matching according to the post-positioned general rule to obtain a normalized network address.
3. The method of data normalization according to any one of claims 1-2, further comprising, after the obtaining the plurality of regular word lists:
inquiring whether to update the rule word list;
and if the regular word list is updated, acquiring the updated regular word list according to the preset updating address, wherein the updated regular word list is generated by the off-line module.
4. A method for data normalization, applied to an offline module, includes:
acquiring a network address;
generating a plurality of regular word lists according to the network address, wherein the regular word lists respectively record regular expressions with different action ranges;
sending the generated multiple regular word lists to an online module so that the online module can carry out normalization processing on network addresses according to the multiple regular word lists;
wherein, the generating a plurality of rule word lists according to the network address comprises:
generating a preposed general rule, a domain name rule and a page rule according to the network address; the pre-generic rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in a page-level network address; the scope of action of the regular expression contained in the pre-generic rule is greater than the scope of action of the regular expression contained in the domain name rule, and the scope of action of the regular expression contained in the domain name rule is greater than the scope of action of the regular expression contained in the page rule, wherein the scope of action is the length of the network address adapted to the regular expression, or the range of network addresses that can be processed by the regular expression.
5. The method of claim 4, wherein the generating domain name rules and page rules according to network addresses comprises:
acquiring an incremental network address, and sampling the incremental network address to obtain a plurality of network addresses to be identified, wherein the incremental network address comprises an incremental domain name network address and an incremental page-level network address;
carrying out regular expression replacement on the plurality of network addresses to be identified by using the preposed general rule to obtain preprocessed network addresses;
respectively calculating the webpage signature and the webpage check code of the webpage pointed by each preprocessed network address;
if the web signatures of the plurality of preprocessed network addresses are the same or the web check codes are the same, removing parameters after symbols are preset in the plurality of preprocessed network addresses, wherein the characters after the symbols are parameter information carried by the web;
and mapping the plurality of preprocessed network addresses with the parameters removed to the same network address to obtain a domain name rule and a page rule.
6. The method of data normalization of claim 4, wherein the sending the generated plurality of rule word lists to an online module comprises:
and when the rule word list is updated, sending the updated rule word list to a preset updating address.
7. The method of claim 4, wherein the generating a plurality of rule word lists according to network addresses comprises:
and determining a post-general rule according to the website security, wherein the post-general rule comprises a blacklist network address.
8. The method of claim 4, further comprising, after generating the plurality of rule vocabularies based on the network address:
if the regular expressions among the regular word lists conflict, the regular expressions with smaller action ranges are reserved, and the regular expressions with larger action ranges are deleted.
9. The method of data normalization of claim 4, after generating the plurality of regular word lists according to the network address, further comprising:
inputting a preset number of network addresses;
acquiring the number of network addresses which are correct in normalization processing and the number of network addresses which need normalization processing;
determining a normalization accuracy parameter according to the number of the network addresses with correct normalization processing and the preset number;
and determining a normalization recall parameter according to the number of the network addresses with correct normalization processing and the number of the network addresses needing normalization processing.
10. An apparatus for data normalization, applied to an online module, includes:
the regular word list obtaining sub-module is used for obtaining a plurality of regular word lists, the regular word lists are generated by the off-line module, and the regular word lists respectively record regular expressions with different action ranges;
the target network address acquisition submodule is used for acquiring a target network address;
the normalization submodule is used for sequentially carrying out character matching on the network addresses according to the regular word lists and the sequence of the recorded action ranges of the regular expressions from large to small to obtain normalized network addresses;
the rule word list obtaining submodule is specifically configured to:
acquiring a preposed general rule, a domain name rule and a page rule generated by an offline module; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule;
the normalization submodule is specifically configured to: and performing character matching on the network address by sequentially using a preposed general rule, a domain name rule and a page rule to obtain a normalized network address, wherein the action range is the length of the network address adapted to the regular expression or the range of the network address capable of being processed by the regular expression.
11. The device for data normalization, applied to an offline module, comprises:
the network address acquisition submodule is used for acquiring a network address;
the regular word list generating submodule is used for generating a plurality of regular word lists according to the network address, and the regular word lists respectively record regular expressions with different action ranges;
the sending submodule is used for sending the generated multiple regular word lists to an online module so that the online module can carry out normalization processing on network addresses according to the multiple regular word lists;
the rule word list generation submodule is specifically used for:
generating a preposed general rule, a domain name rule and a page rule according to the network address; the pre-general rule comprises a regular expression of abnormal characters in the network address; the domain name rule comprises a regular expression of a non-domain name part in the domain name level network address; the page rule comprises a regular expression of a non-page address part in the page-level network address; the action range of the regular expression contained in the preposed general rule is larger than that of the regular expression contained in the domain name rule, and the action range of the regular expression contained in the domain name rule is larger than that of the regular expression contained in the page rule, wherein the action range is the length of the network address adapted to the regular expression or the network address range which can be processed by the regular expression.
12. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
13. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.
CN201911127228.6A 2019-11-18 2019-11-18 Data normalization method and device, electronic equipment and storage medium Active CN112822302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911127228.6A CN112822302B (en) 2019-11-18 2019-11-18 Data normalization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911127228.6A CN112822302B (en) 2019-11-18 2019-11-18 Data normalization method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112822302A CN112822302A (en) 2021-05-18
CN112822302B true CN112822302B (en) 2023-03-24

Family

ID=75852368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911127228.6A Active CN112822302B (en) 2019-11-18 2019-11-18 Data normalization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112822302B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900546B (en) * 2022-07-08 2022-09-16 支付宝(杭州)信息技术有限公司 Data processing method, device and equipment and readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164502A1 (en) * 2007-12-24 2009-06-25 Anirban Dasgupta Systems and methods of universal resource locator normalization
CN103198091B (en) * 2012-12-04 2016-12-21 网易(杭州)网络有限公司 The processing method of a kind of online data based on user behavior request and equipment
CN103399872B (en) * 2013-07-10 2016-09-28 北京奇虎科技有限公司 The method and apparatus that webpage capture is optimized
WO2016138067A1 (en) * 2015-02-24 2016-09-01 Cloudlock, Inc. System and method for securing an enterprise computing environment
CN110008419B (en) * 2019-03-11 2023-07-14 创新先进技术有限公司 Webpage deduplication method, device and equipment

Also Published As

Publication number Publication date
CN112822302A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
EP3862892A1 (en) Session recommendation method and apparatus, and electronic device
CN111080243A (en) Service processing method, device, system, electronic equipment and storage medium
CN111752843A (en) Method, device, electronic equipment and readable storage medium for determining influence surface
CN114363019B (en) Training method, device, equipment and storage medium for phishing website detection model
CN112269706B (en) Interface parameter verification method, device, electronic equipment and computer readable medium
CN110990057A (en) Extraction method, device, equipment and medium of small program sub-chain information
CN111090991A (en) Scene error correction method and device, electronic equipment and storage medium
CN111241838A (en) Text entity semantic relation processing method, device and equipment
CN111949272A (en) Compilation optimization method and device for hosted application, electronic device and readable storage medium
CN111475164A (en) Component dependency relationship detection method and device and electronic equipment
CN110472034B (en) Detection method, device and equipment of question-answering system and computer readable storage medium
CN111813623A (en) Page monitoring method and device, electronic equipment and storage medium
CN112822302B (en) Data normalization method and device, electronic equipment and storage medium
CN110752968A (en) Performance benchmark test method and device, electronic equipment and storage medium
US20210209143A1 (en) Document type recommendation method and apparatus, electronic device and readable storage medium
US20210216710A1 (en) Method and apparatus for performing word segmentation on text, device, and medium
CN110909390B (en) Task auditing method and device, electronic equipment and storage medium
CN112084150A (en) Model training method, data retrieval method, device, equipment and storage medium
CN111666417A (en) Method and device for generating synonyms, electronic equipment and readable storage medium
CN112069137A (en) Method and device for generating information, electronic equipment and computer readable storage medium
CN111310044A (en) Method, device, equipment and storage medium for extracting page element information
CN111596897B (en) Code multiplexing processing method and device and electronic equipment
CN113792232A (en) Page feature calculation method, device, electronic equipment, medium and program product
CN112052347A (en) Image storage method and device and electronic equipment
US20220092186A1 (en) Security information analysis device, system, method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant