WO2022127259A1 - Data cleaning method, apparatus and device, and storage medium - Google Patents

Data cleaning method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2022127259A1
WO2022127259A1 PCT/CN2021/120043 CN2021120043W WO2022127259A1 WO 2022127259 A1 WO2022127259 A1 WO 2022127259A1 CN 2021120043 W CN2021120043 W CN 2021120043W WO 2022127259 A1 WO2022127259 A1 WO 2022127259A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cleaned
target data
target
extractor
Prior art date
Application number
PCT/CN2021/120043
Other languages
French (fr)
Chinese (zh)
Inventor
孟宪奎
程强
万月亮
Original Assignee
北京锐安科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京锐安科技有限公司 filed Critical 北京锐安科技有限公司
Publication of WO2022127259A1 publication Critical patent/WO2022127259A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the technical field of data cleaning, for example, to a data cleaning method, apparatus, device, and storage medium.
  • the post data is usually forwarded directly, or the post data is word-segmented and stored, which not only consumes a lot of storage space, but also requires redundant data in the post data during the data transmission process. Forward.
  • Embodiments of the present application provide a data cleaning method, apparatus, device, and storage medium, so as to clean redundant data in data, save storage space, and improve data transmission efficiency.
  • an embodiment of the present application provides a data cleaning method, including:
  • the target data is decoded, and the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
  • an embodiment of the present application further provides a data cleaning device, including:
  • a data extractor determination module configured to acquire the data to be cleaned, and determine a target data extractor corresponding to the data to be cleaned
  • a target data extraction module configured to parse the data to be cleaned, and extract target data contained in the data to be cleaned through the target data extractor, where the target data includes attribute names, attribute data or label text data at least one;
  • the target data screening module is configured to decode the target data, and screen the decoded target data according to a plurality of reference data uploaded by the client, so as to clean the data to be cleaned.
  • an embodiment of the present application further provides a data cleaning device, where the data cleaning device includes:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the data cleaning method according to any one of the embodiments of this application.
  • the embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, are used to execute any one of the embodiments of the present application. data cleaning method.
  • FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application.
  • Fig. 6 is the composition diagram of a kind of post data in the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data cleaning device in an embodiment of the present application.
  • FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application. This embodiment is applicable to the case of filtering out redundant data in data.
  • the method can be executed by a data cleaning device, and the device can It is implemented by means of software and/or hardware, and is integrated in the data cleaning device that executes the method.
  • the data cleaning device may be a server, a computer, or a tablet computer, etc., which is not added in this embodiment. limited. Referring to Figure 1, the method includes the following steps:
  • Step 110 Acquire the data to be cleaned, and determine a target data extractor corresponding to the data to be cleaned.
  • the data to be cleaned may be post data or get data, which is not limited in this embodiment.
  • a target data extractor corresponding to the data to be cleaned may be determined.
  • the post data can be composed of three parts: request url (uniform resource locator, uniform resource locator), request cookie and request body.
  • these three parts of post data all need to be processed by target data. extract.
  • url and cookie are generally key-value pair data
  • the body can contain xml (extensible markup language, extensible markup language), JSON (JavaScript Object Notation, JS object notation) and key-value pair data in three data formats. .
  • the data format type contained in the post data to be cleaned can be identified, and the data format type contained in the post data can be identified according to the data format of the post data.
  • the type selects the target data extractor to extract the target data contained in the post data.
  • a target data extractor that matches the key-value pair data can be selected to extract the key-value pair data contained in the post data; If the post data to be cleaned only contains xml data, the target data extractor that matches the xml data can be selected to extract the xml data contained in the post data; if it is identified that the post data to be cleaned contains key-value pair data and xml data , the target data extractor that matches the key-value pair data and the target data extractor that matches the xml data can be selected to extract the key-value pair data and the xml data contained in the post data.
  • Step 120 Analyze the data to be cleaned, and extract the target data contained in the data to be cleaned through a target data extractor.
  • the target data includes at least one item of attribute name, attribute data or label text data.
  • the data to be cleaned may be parsed to determine the encoding mode of the data to be cleaned.
  • the encoding mode may be a base64 encoding mode, a decoder encoding mode, or an encryption encoding mode, which is not limited in this embodiment.
  • the target data contained in the data to be cleaned can be extracted by the selected target data extractor.
  • Step 130 Decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned.
  • the extracted target data can be decoded according to the parsed encoding mode of the data to be cleaned,
  • the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned and filter out redundant data contained in the data to be cleaned.
  • the plurality of reference data uploaded by the client may be data related to user requirements, for example, keywords related to user requirements, etc., which are not limited in this embodiment.
  • the target data can be compared with the reference data uploaded by the client, the target data corresponding to the reference data can be retained, and the target data that is different from the reference data can be filtered out.
  • Corresponding target data so as to filter the target data to filter out redundant data that is not related to user needs.
  • the data to be cleaned is obtained and the target data extractor corresponding to the data to be cleaned is determined; the data to be cleaned is parsed, and the target data contained in the data to be cleaned is extracted through the target data extractor, and the target data includes attributes At least one of name, attribute data, or label text data; decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned, realizing the The redundant data is cleaned, which can save storage space and improve data transmission efficiency.
  • FIG. 2 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 2, the data cleaning method may include the following steps:
  • Step 210 Acquire data to be cleaned.
  • Step 220 Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
  • the data format types include: key-value pair, xml, JSON, etc., which are not limited in this embodiment.
  • the data format type contained in the post data to be cleaned can be identified, and the data contained in the post data can be identified.
  • the format type may be one or more of key-value pair, xml, and JSON data format types, which are not limited in this embodiment.
  • the target data extractor may be determined according to the data format type contained in the identified data to be cleaned, where the target data extractor may include a key-value pair extractor, an xml extractor, or a JSON extractor.
  • the key-value pair extractor and the xml extractor can be determined as the target data extractor.
  • Step 230 Analyze the data to be cleaned, and extract the target data contained in the data to be cleaned by using a target data extractor.
  • extracting the target data included in the data to be cleaned by the target data extractor may include: extracting key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor.
  • Extracting the key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor may include: extracting the key-value pairs contained in the data to be cleaned by the key-value pair extractor; or, extracting the data to be cleaned by using the xml extractor The xml data contained in the cleaning data; or, the JSON information contained in the data to be cleaned is extracted through a JSON extractor.
  • the key-value pairs contained in the data to be cleaned can also be extracted by the key-value pair extractor
  • the xml data contained in the data to be cleaned can be extracted by the xml extractor
  • the JSON extractor can be used to extract the xml data contained in the data to be cleaned at the same time. Extract the JSON information contained in the data to be cleaned.
  • Step 240 Decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned.
  • FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 3, the data cleaning method may include the following steps:
  • Step 310 Acquire data to be cleaned.
  • Step 320 Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
  • Step 330 Analyze the data to be cleaned.
  • parsing the data to be cleaned may include identifying an encoding mode of the data to be cleaned, where the encoding mode of the data to be cleaned may be a base64 encoding mode, a decoder encoding mode, or an encrypted encoding mode, etc., It is not limited in this embodiment.
  • Step 340 Extract the target data contained in the data to be cleaned by using the target data extractor.
  • Step 350 Select a target decoder according to the parsed coding mode corresponding to the target data, and decode the target data.
  • the target after the target data included in the data to be cleaned is extracted by the target data extractor, the target can be selected according to the encoding mode of the target data obtained by analysis, that is, the encoding mode of the data to be cleaned obtained by analysis Decoder, and decode the extracted target data according to the target decoder to convert the target data from unrecognizable characters to characters that are easy to understand, for example, convert the target data from "%%" to "abc" and other characters .
  • the target decoder corresponding to the base64 encoding mode can be selected to decode the target data; If the encoding mode of the data is the decoder encoding mode, that is, the encoding mode of the target data is the decoder encoding mode, the target decoder corresponding to the decoder encoding mode can be selected to decode the target data.
  • the encoding mode is the encryption encoding mode
  • an encryption key corresponding to the encryption encoding mode is obtained, and the target data is decoded according to the encryption key.
  • the encryption encoding mode of the data to be cleaned is the encryption encoding mode, that is, the encoding mode of the target data is the encryption encoding mode
  • the encryption key corresponding to the encryption encoding mode can be obtained, and the target data can be decoded according to the encryption key.
  • Step 360 Screen the decoded target data according to the multiple reference data uploaded by the client to clean the data to be cleaned.
  • the data to be cleaned is parsed to determine the encoding mode of the data to be cleaned; a target decoder is selected according to the parsed encoding mode corresponding to the target data, and the target data is decoded, thereby converting the target data into A form of expression that is easy to identify, and provides a basis for subsequent cleaning of redundant data in post data.
  • FIG. 4 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 4, the data cleaning method may include the following steps:
  • Step 410 Acquire data to be cleaned.
  • Step 420 Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
  • Step 430 Analyze the data to be cleaned.
  • Step 440 Extract the target data contained in the data to be cleaned by using the target data extractor.
  • Step 450 Select a target decoder according to the parsed coding mode corresponding to the target data, and decode the target data.
  • Step 460 Compare the decoded target data with multiple reference data output by the data model uploaded by the client; if the first data in the target data matches the multiple reference data, keep the first data; if If the second data in the target data does not match the multiple reference data, the second data is filtered out.
  • the first data in the target data and the second data in the target data are any data in the target data, which are only for the convenience of describing this embodiment, rather than limiting the embodiment of this application.
  • the decoded target data may be compared with multiple reference data output from the data model uploaded by the client. Any reference data in the reference data matches, for example, the similarity is greater than a set threshold (for example, 0.9, 0.85 or 0.99, etc., which are not limited in this embodiment), the first data can be retained; if the second The data does not match any reference data among the multiple reference data. For example, if the similarity with any reference data is less than the set threshold, it can be determined that the second data is redundant data, and can be filtered out. .
  • a set threshold for example, 0.9, 0.85 or 0.99, etc.
  • the data model involved in this embodiment may be a data model obtained by training on the client or computer in advance; in this embodiment, the process of training the data model may include: the sample data may be Labeling, wherein the sample data can be a large amount of post data, which is not limited in this embodiment; a data model can be constructed through naive Bayesian training, and all user-required data can be output, and the output data can be normalized on this basis. Normalization (data standardization, case removal, etc.), and then highly aggregate the normalized data, refine the data content, and reduce the number of data pieces. reference data.
  • Normalization data standardization, case removal, etc.
  • the decoded target data can be compared with multiple reference data output by the data model uploaded by the client; If the reference data matches, the first data is retained; if the second data in the target data does not match multiple reference data, the second data is filtered out, which realizes the filtering of redundant data contained in the post data. , which can save the storage space of post data and improve the transmission efficiency of post data.
  • the process of the data cleaning method includes:
  • Step 510 post data identification.
  • the post data identification mainly identifies the data structure of the post, including identifying the data encoding mode, identifying the cookie part data, identifying the header data of the request header, and identifying the body data of the request body.
  • identify the data type mainly including json, xml, key-value pair type data.
  • Step 520 post data extraction.
  • the post data consists of three parts: request url, cookie, and body. Each part needs to be extracted.
  • the request url data and cookie data are generally key-value pair data
  • the body contains xml, JSON, and key-value data.
  • JSON format data Data analysis is carried out for JSON format data, and attributes and attribute data are extracted according to the characteristics of JSON data structure, and label text data is extracted.
  • Step 530 Decode the data in reverse.
  • the data reverse decoding technology mainly identifies the data pattern from the extracted data.
  • the current stage mainly supports base64 encoding and ordinary decoder encoding mode data. Under the condition of identifying the corresponding data encoding mode, the standard decoder is called for decoding, so as to realize the restoration of the encoded data, thereby realizing the data restoration capability as much as possible and improving the data quality.
  • Step 540 data cleaning.
  • the data cleaning technology mainly includes two parts:
  • the value direction of the data is different.
  • the sample data is labeled according to the user's requirements, the data model is constructed through Naive Bayesian training, and all the user-required data is output. On this basis, the output data is normalized (data standardization, Remove case, etc.), and then highly aggregate the normalized data, refine the data content, and reduce the number of data items.
  • the number of keywords has a great influence on efficient processing. Therefore, under the premise of meeting business needs, the number of keywords should be reduced as much as possible. In this study, the keywords are highly aggregated and refined.
  • the embodiments of the present application can analyze post data in the Internet in real time and extract valuable data, reduce distributed file (hadoop) and retrieval engine data storage, improve the retrieval engine's ability to read and write data, and extract valuable data in post data.
  • shade distributed file
  • retrieval engine data storage improve the retrieval engine's ability to read and write data
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus in an embodiment of the present application, and the apparatus can execute the data cleaning method involved in the above-mentioned embodiments.
  • a data extractor determination module 710 a target data extraction module 720 , and a target data screening module 730 .
  • the data extractor determination module 710 is configured to obtain the data to be cleaned, and to determine the target data extractor corresponding to the data to be cleaned;
  • the target data extraction module 720 is configured to parse the data to be cleaned, and extract target data contained in the data to be cleaned through the target data extractor, where the target data includes attribute names, attribute data or label text data at least one of the
  • the target data screening module 730 is configured to decode the target data, and screen the decoded target data according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
  • the data to be cleaned is obtained through the data extractor determination module, and the target data extractor corresponding to the to-be-cleaned data is determined; the target data extraction module is used to parse the to-be-cleaned data, and the The target data extractor extracts the target data contained in the data to be cleaned; the target data is decoded by the target data screening module, and the decoded target data is screened according to a plurality of reference data uploaded by the client, so as to Cleaning the data to be cleaned realizes cleaning of redundant data in the data, which can save storage space and improve data transmission efficiency.
  • a data extractor determination module 710 configured to identify the data format type contained in the data to be cleaned, and determine a target data extractor according to the data format type;
  • the data format type includes key-value pair, extensible markup language xml, or JS object notation JSON;
  • the target data extractor includes: a key-value pair extractor, an xml extractor or a JSON extractor.
  • the target data extraction module 720 is configured to extract key-value pairs, xml data or JSON information contained in the data to be cleaned through the target data extractor;
  • the target data extraction module 720 is further configured to extract the key-value pairs contained in the data to be cleaned through the key-value pair extractor;
  • the JSON information contained in the data to be cleaned is extracted by the JSON extractor.
  • the target data screening module 730 including a decoding module, is configured to select a target decoder according to the parsed encoding mode corresponding to the target data, and decode the target data;
  • the encoding mode includes: base64 encoding mode, decoder encoding mode or encryption encoding mode.
  • the decoding module is further configured to obtain an encryption key corresponding to the encryption encoding mode if the encoding mode is an encryption encoding mode, and decode the target data according to the encryption key.
  • the target data screening module 730 is configured to compare the decoded target data with a plurality of reference data output by the data model uploaded by the client;
  • the second data in the target data does not match a plurality of the reference data, the second data is filtered out.
  • the data to be cleaned involved in this embodiment is post data.
  • the data cleaning apparatus provided in the embodiment of the present application can execute the data cleaning method provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 8 is a schematic structural diagram of a data cleaning device provided by an embodiment of the application.
  • the data cleaning device includes a processor 80, a memory 81, an input device 82, and an output device 83; the processor in the data cleaning device
  • the number of 80 can be one or more, and one processor 80 is taken as an example in FIG. 8; the processor 80, the memory 81, the input device 82 and the output device 83 in the data cleaning device can be connected by a bus or other means, as shown in FIG. 8 Take the connection through the bus as an example.
  • the memory 81 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data cleaning method in the embodiments of the present application (for example, data extraction in the data cleaning device). Detector determination module 710, target data extraction module 720, and target data screening module 730). The processor 80 executes various functional applications and data processing of the data cleaning device by running the software programs, instructions and modules stored in the memory 81 , that is, the above-mentioned data cleaning method is implemented.
  • the memory 81 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, memory 81 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 81 may include memory located remotely from processor 80, which may be connected to the data cleaning device via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 82 may be configured to receive input numerical or character information and to generate key signal input related to user settings and function control of the data cleaning apparatus.
  • the output device 83 may include a display device such as a display screen.
  • Embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions are configured to execute a data cleaning method when executed by a computer processor, and the method includes:
  • the target data is decoded, and the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
  • a storage medium containing computer-executable instructions provided by the embodiments of the present application, the computer-executable instructions of which are not limited to the above-mentioned method operations, and can also execute the data cleaning methods provided in any embodiment of the present application. related operations.
  • the storage medium may be a non-transitory computer-readable storage medium.
  • the present application can be implemented by means of software and necessary general-purpose hardware, and certainly can also be implemented by hardware.
  • the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to related technologies, and the computer software products can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), Flash Memory (FLASH), hard disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, A server, or a network device, etc.) executes the methods described in the various embodiments of the present application.
  • a computer device which can be a personal computer, A server, or a network device, etc.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, The specific names of the multiple functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data cleaning method, apparatus and device, and a storage medium. The data cleaning method comprises: obtaining data to be cleaned, and determining a target data extractor corresponding to the data (110); parsing the data, and extracting, by means of the target data extractor, target data comprised in the data (120), the target data comprising at least one of an attribute name, attribute data, or label text data; and decoding the target data, and screening the decoded target data according to multiple pieces of reference data uploaded by a client, so as to clean the data (130).

Description

数据清洗方法、装置、设备及存储介质Data cleaning method, device, equipment and storage medium
本申请要求在2020年12月16日提交中国专利局、申请号为202011490975.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011490975.9 filed with the China Patent Office on December 16, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请实施例涉及数据清洗技术领域,例如涉及数据清洗方法、装置、设备及存储介质。The embodiments of the present application relate to the technical field of data cleaning, for example, to a data cleaning method, apparatus, device, and storage medium.
背景技术Background technique
随着互联网的不断发展,多种数据呈现爆炸式的增长;与此同时,不符合规范的冗余数据也不断地递增,尤其是互联网中的post数据,在post数据中,包含了大量的冗余结构化信息,数据的总体价值无法得到有效体现。With the continuous development of the Internet, a variety of data have shown explosive growth; at the same time, redundant data that do not meet the specifications are also increasing, especially the post data in the Internet, which contains a lot of redundant data. The overall value of data cannot be effectively reflected without structured information.
相关技术中,针对post数据,通常直接对post数据进行转发,或者对post数据进行分词存储,这样不但消耗了大量的存储空间,并且在数据传输过程中还需要对post数据中的冗余数据进行转发。In the related art, for post data, the post data is usually forwarded directly, or the post data is word-segmented and stored, which not only consumes a lot of storage space, but also requires redundant data in the post data during the data transmission process. Forward.
因此,研究一种对post数据中的冗余数据进行清洗,以节省存储空间、提升数据传输效率的方案是十分有必要的。Therefore, it is necessary to study a solution to clean redundant data in post data to save storage space and improve data transmission efficiency.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种数据清洗方法、装置、设备及存储介质,以实现对数据中的冗余数据进行清洗,节省存储空间、提升数据传输效率。Embodiments of the present application provide a data cleaning method, apparatus, device, and storage medium, so as to clean redundant data in data, save storage space, and improve data transmission efficiency.
第一方面,本申请实施例提供了一种数据清洗方法,包括:In a first aspect, an embodiment of the present application provides a data cleaning method, including:
获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;acquiring data to be cleaned, and determining a target data extractor corresponding to the data to be cleaned;
对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;Parsing the data to be cleaned, and extracting target data contained in the data to be cleaned by the target data extractor, where the target data includes at least one of attribute name, attribute data or label text data;
对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data is decoded, and the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
第二方面,本申请实施例还提供了一种数据清洗装置,包括:In a second aspect, an embodiment of the present application further provides a data cleaning device, including:
数据提取器确定模块,设置为获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;a data extractor determination module, configured to acquire the data to be cleaned, and determine a target data extractor corresponding to the data to be cleaned;
目标数据提取模块,设置为对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;A target data extraction module, configured to parse the data to be cleaned, and extract target data contained in the data to be cleaned through the target data extractor, where the target data includes attribute names, attribute data or label text data at least one;
目标数据筛选模块,设置为对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data screening module is configured to decode the target data, and screen the decoded target data according to a plurality of reference data uploaded by the client, so as to clean the data to be cleaned.
第三方面,本申请实施例还提供了一种数据清洗设备,所述数据清洗设备包括:In a third aspect, an embodiment of the present application further provides a data cleaning device, where the data cleaning device includes:
一个或多个处理器;one or more processors;
存储装置,设置为存储一个或多个程序,storage means arranged to store one or more programs,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例中任一实施例所述的数据清洗方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the data cleaning method according to any one of the embodiments of this application.
第四方面,本申请实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本申请实施例中任一实施例所述的数据清洗方法。In a fourth aspect, the embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, are used to execute any one of the embodiments of the present application. data cleaning method.
附图说明Description of drawings
图1是本申请一实施例中的一种数据清洗方法的流程图;1 is a flowchart of a data cleaning method in an embodiment of the present application;
图2是本申请另一实施例中的一种数据清洗方法的流程图;2 is a flowchart of a data cleaning method in another embodiment of the present application;
图3是本申请另一实施例中的一种数据清洗方法的流程图;3 is a flowchart of a data cleaning method in another embodiment of the present application;
图4是本申请另一实施例中的一种数据清洗方法的流程图;4 is a flowchart of a data cleaning method in another embodiment of the present application;
图5是本申请另一实施例中的一种数据清洗方法的流程图;5 is a flowchart of a data cleaning method in another embodiment of the present application;
图6是本申请实施例中的一种post数据的组成图;Fig. 6 is the composition diagram of a kind of post data in the embodiment of the present application;
图7是本申请实施例中的一种数据清洗装置的结构示意图;7 is a schematic structural diagram of a data cleaning apparatus in an embodiment of the present application;
图8是本申请实施例中的一种数据清洗设备的结构示意图。FIG. 8 is a schematic structural diagram of a data cleaning device in an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请实施例作详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请实施例,而非对本申请实施例的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请实施例相关的 部分而非全部结构。The embodiments of the present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be understood that, the specific embodiments described herein are only used to explain the embodiments of the present application, but are not intended to limit the embodiments of the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all of the structures related to the embodiments of the present application.
图1是本申请实施例中的一种数据清洗方法的流程图,本实施例可适用于对数据中的冗余数据进行滤除的情况,该方法可以由数据清洗装置来执行,该装置可以通过软件和/或硬件的方式实现,并集成在执行本方法的数据清洗设备中,在本实施例中,数据清洗设备可以为服务器、计算机或者平板电脑等设备,本实施例中对其不加以限定。参考图1,该方法包括如下步骤:FIG. 1 is a flowchart of a data cleaning method in an embodiment of the present application. This embodiment is applicable to the case of filtering out redundant data in data. The method can be executed by a data cleaning device, and the device can It is implemented by means of software and/or hardware, and is integrated in the data cleaning device that executes the method. In this embodiment, the data cleaning device may be a server, a computer, or a tablet computer, etc., which is not added in this embodiment. limited. Referring to Figure 1, the method includes the following steps:
步骤110、获取待清洗数据,并确定与待清洗数据对应的目标数据提取器。Step 110: Acquire the data to be cleaned, and determine a target data extractor corresponding to the data to be cleaned.
在本实施例中,待清洗数据可以为post数据,也可以为get数据,本实施例中对其不加以限定。In this embodiment, the data to be cleaned may be post data or get data, which is not limited in this embodiment.
在本实施例的一个示例实现方式中,在获取到待清洗数据之后,可以确定与待清洗数据对应的目标数据提取器。需要说明的是,post数据可以由请求url(uniform resource locator,统一资源定位符)、请求cookie以及请求body三个部分构成,在实施例中,post数据的这三个部分都需要进行目标数据的提取。其中,url和cookie一般为键值对数据,在body中可以包含xml(extensible markup language,可扩展标记语言)、JSON(JavaScript Object Notation,JS对象简谱)以及键值对三种数据格式类型的数据。In an example implementation of this embodiment, after the data to be cleaned is acquired, a target data extractor corresponding to the data to be cleaned may be determined. It should be noted that the post data can be composed of three parts: request url (uniform resource locator, uniform resource locator), request cookie and request body. In the embodiment, these three parts of post data all need to be processed by target data. extract. Among them, url and cookie are generally key-value pair data, and the body can contain xml (extensible markup language, extensible markup language), JSON (JavaScript Object Notation, JS object notation) and key-value pair data in three data formats. .
在本实施例的一个示例实现方式中,在获取到待清洗数据,即获取到待清洗post数据之后,可以对待清洗post数据所包含的数据格式类型进行识别,可以根据post数据所包含的数据格式类型选取目标数据提取器,以对post数据中包含的目标数据进行提取。In an example implementation of this embodiment, after the data to be cleaned is obtained, that is, after the post data to be cleaned is obtained, the data format type contained in the post data to be cleaned can be identified, and the data format type contained in the post data can be identified according to the data format of the post data. The type selects the target data extractor to extract the target data contained in the post data.
示例性的,若识别到待清洗post数据中仅包含键值对数据,则可以选取与键值对数据匹配的目标数据提取器对post数据中所包含的键值对数据进行提取;若识别到待清洗post数据中仅包含xml数据,则可以选取与xml数据匹配的目标数据提取器对post数据中所包含的xml数据进行提取;若识别到待清洗post数据中包含键值对数据和xml数据,则可以选取与键值对数据匹配的目标数据提取器,以及与xml数据匹配的目标数据提取器,对post数据中所包含的键值对数据以及xml数据进行提取。Exemplarily, if it is identified that the post data to be cleaned contains only key-value pair data, a target data extractor that matches the key-value pair data can be selected to extract the key-value pair data contained in the post data; If the post data to be cleaned only contains xml data, the target data extractor that matches the xml data can be selected to extract the xml data contained in the post data; if it is identified that the post data to be cleaned contains key-value pair data and xml data , the target data extractor that matches the key-value pair data and the target data extractor that matches the xml data can be selected to extract the key-value pair data and the xml data contained in the post data.
步骤120、对待清洗数据进行解析,并通过目标数据提取器提取待清洗数据包含的目标数据。Step 120: Analyze the data to be cleaned, and extract the target data contained in the data to be cleaned through a target data extractor.
其中,目标数据包含属性名称、属性数据或者标签文本数据中至少一项。The target data includes at least one item of attribute name, attribute data or label text data.
在本实施例的一个示例实现方式中,在确定与待清洗数据对应的目标数据提取器之后,可以对待清洗数据进行解析,从而确定待清洗数据的编码模式,在本实施例中,待清洗数据的编码模式可以为base64编码模式、decoder编码模式或者加密编码模式,本实施例中对其不加以限定。例如可以通过选取的目标数据提取器提取待清洗数据包含的目标数据。In an example implementation of this embodiment, after the target data extractor corresponding to the data to be cleaned is determined, the data to be cleaned may be parsed to determine the encoding mode of the data to be cleaned. In this embodiment, the data to be cleaned The encoding mode may be a base64 encoding mode, a decoder encoding mode, or an encryption encoding mode, which is not limited in this embodiment. For example, the target data contained in the data to be cleaned can be extracted by the selected target data extractor.
步骤130、对目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对待清洗数据进行清洗。Step 130: Decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned.
在本实施例的一个示例实现方式中,在通过目标数据提取器提取到待清洗数据中所包含的目标数据之后,可以根据解析到的待清洗数据的编码模式对提取到的目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,从而对待清洗数据进行清洗,滤除待清洗数据中包含的冗余数据。In an example implementation of this embodiment, after the target data included in the data to be cleaned is extracted by the target data extractor, the extracted target data can be decoded according to the parsed encoding mode of the data to be cleaned, The decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned and filter out redundant data contained in the data to be cleaned.
其中,客户端上传的多个参考数据可以为与用户需求相关的数据,例如,与用户需求相关的关键字等,本实施例中对其不加以限定。Wherein, the plurality of reference data uploaded by the client may be data related to user requirements, for example, keywords related to user requirements, etc., which are not limited in this embodiment.
示例性的,在对目标数据进行解码,得到便于识别的目标数据之后,可以将目标数据与客户端上传的参考数据进行比对,保留与参考数据相对应的目标数据,滤除与参考数据不对应的目标数据,从而实现对目标数据的筛选,以滤除与用户需求不相关的冗余数据。Exemplarily, after the target data is decoded to obtain target data that is easy to identify, the target data can be compared with the reference data uploaded by the client, the target data corresponding to the reference data can be retained, and the target data that is different from the reference data can be filtered out. Corresponding target data, so as to filter the target data to filter out redundant data that is not related to user needs.
本实施例的方案,通过获取待清洗数据,并确定与待清洗数据对应的目标数据提取器;对待清洗数据进行解析,并通过目标数据提取器提取待清洗数据包含的目标数据,目标数据包含属性名称、属性数据或者标签文本数据中至少一项;对目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对待清洗数据进行清洗,实现了对数据中的冗余数据进行清洗,可以节省存储空间、提升数据传输效率。In the solution of this embodiment, the data to be cleaned is obtained and the target data extractor corresponding to the data to be cleaned is determined; the data to be cleaned is parsed, and the target data contained in the data to be cleaned is extracted through the target data extractor, and the target data includes attributes At least one of name, attribute data, or label text data; decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned, realizing the The redundant data is cleaned, which can save storage space and improve data transmission efficiency.
图2是本申请另一实施例中的一种数据清洗方法的流程图,本实施例是对上述技术方案的细化,本实施例中的技术方案可以与上述一个或者多个实施例中的示例方案结合。如图2所示,数据清洗方法可以包括如下步骤:FIG. 2 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 2, the data cleaning method may include the following steps:
步骤210、获取待清洗数据。Step 210: Acquire data to be cleaned.
步骤220、识别待清洗数据包含的数据格式类型,并根据数据格式类型确定目标数据提取器。Step 220: Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
其中,数据格式类型包括:键值对、xml以及JSON等,本实施例中对其不加以限定。The data format types include: key-value pair, xml, JSON, etc., which are not limited in this embodiment.
在本实施例的一个示例实现方式中,在获取到待清洗数据之后,即在获取到待清洗的post数据之后,可以识别待清洗的post数据所包含的数据格式类型,post数据所包含的数据格式类型可以为键值对、xml以及JSON数据格式类型中的一种或者多种,本实施例中对其不加以限定。In an example implementation of this embodiment, after the data to be cleaned is obtained, that is, after the post data to be cleaned is obtained, the data format type contained in the post data to be cleaned can be identified, and the data contained in the post data can be identified. The format type may be one or more of key-value pair, xml, and JSON data format types, which are not limited in this embodiment.
可以根据识别到的待清洗数据所包含的数据格式类型,确定目标数据提取器,其中目标数据提取器可以包括键值对提取器、xml提取器、或者JSON提取器。The target data extractor may be determined according to the data format type contained in the identified data to be cleaned, where the target data extractor may include a key-value pair extractor, an xml extractor, or a JSON extractor.
示例性的,若识别到待清洗数据所包含的数据格式类型为键值对以及xml两种数据格式类型,则可以确定键值对提取器以及xml提取器为目标数据提取器。Exemplarily, if it is identified that the data format types included in the data to be cleaned are key-value pair and xml data format types, the key-value pair extractor and the xml extractor can be determined as the target data extractor.
步骤230、对待清洗数据进行解析,并通过目标数据提取器提取待清洗数据包含的目标数据。Step 230: Analyze the data to be cleaned, and extract the target data contained in the data to be cleaned by using a target data extractor.
在本实施例的一个示例实现方式中,通过目标数据提取器提取待清洗数据包含的目标数据,可以包括:通过目标数据提取器提取待清洗数据中包含的键值对、xml数据或者JSON信息。In an example implementation of this embodiment, extracting the target data included in the data to be cleaned by the target data extractor may include: extracting key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor.
通过目标数据提取器提取待清洗数据中包含的键值对、xml数据或者JSON信息,可以包括:通过键值对提取器提取待清洗数据中包含的键值对;或者,通过xml提取器提取待清洗数据中包含的xml数据;或者,通过JSON提取器提取待清洗数据中包含的JSON信息。Extracting the key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor may include: extracting the key-value pairs contained in the data to be cleaned by the key-value pair extractor; or, extracting the data to be cleaned by using the xml extractor The xml data contained in the cleaning data; or, the JSON information contained in the data to be cleaned is extracted through a JSON extractor.
在本实施例的一个示例实现方式中,也可以同时通过键值对提取器提取待清洗数据中包含的键值对,通过xml提取器提取待清洗数据中包含的xml数据,以及通过JSON提取器提取待清洗数据中包含的JSON信息。In an example implementation of this embodiment, the key-value pairs contained in the data to be cleaned can also be extracted by the key-value pair extractor, the xml data contained in the data to be cleaned can be extracted by the xml extractor, and the JSON extractor can be used to extract the xml data contained in the data to be cleaned at the same time. Extract the JSON information contained in the data to be cleaned.
步骤240、对目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对待清洗数据进行清洗。Step 240: Decode the target data, and filter the decoded target data according to multiple reference data uploaded by the client to clean the data to be cleaned.
本实施例的方案,通过识别待清洗数据包含的数据格式类型,并根据数据格式类型确定目标数据提取器;通过目标数据提取器提取待清洗数据中包含的键值对、xml数据或者JSON信息,为后续清洗post数据中的冗余数据提供依据。In the solution of this embodiment, by identifying the data format type contained in the data to be cleaned, and determining a target data extractor according to the data format type; extracting key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor, Provide a basis for subsequent cleaning of redundant data in post data.
图3是本申请另一实施例中的一种数据清洗方法的流程图,本实施例是对 上述技术方案的细化,本实施例中的技术方案可以与上述一个或者多个实施例中的示例方案结合。如图3所示,数据清洗方法可以包括如下步骤:FIG. 3 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 3, the data cleaning method may include the following steps:
步骤310、获取待清洗数据。Step 310: Acquire data to be cleaned.
步骤320、识别待清洗数据包含的数据格式类型,并根据数据格式类型确定目标数据提取器。Step 320: Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
步骤330、对待清洗数据进行解析。Step 330: Analyze the data to be cleaned.
在本实施例的一个示例实现方式中,对待清洗数据进行解析可以包括,识别待清洗数据的编码模式,其中,待清洗数据的编码模式可以为base64编码模式、decoder编码模式或者加密编码模式等,本实施例中对其不加以限定。In an example implementation of this embodiment, parsing the data to be cleaned may include identifying an encoding mode of the data to be cleaned, where the encoding mode of the data to be cleaned may be a base64 encoding mode, a decoder encoding mode, or an encrypted encoding mode, etc., It is not limited in this embodiment.
步骤340、通过目标数据提取器提取待清洗数据包含的目标数据。Step 340: Extract the target data contained in the data to be cleaned by using the target data extractor.
步骤350、根据解析到的与目标数据对应的编码模式选取目标解码器,对目标数据进行解码。Step 350: Select a target decoder according to the parsed coding mode corresponding to the target data, and decode the target data.
在本实施例的一个示例实现方式中,在通过目标数据提取器提取待清洗数据包含的目标数据之后,可以根据解析得到的目标数据的编码模式,即解析得到的待清洗数据的编码模式选取目标解码器,并根据目标解码器对提取到的目标数据进行解码,以将目标数据从无法识别的字符转换为便于理解的字符,例如,将目标数据由“%%”转换为“abc”等字符。In an example implementation of this embodiment, after the target data included in the data to be cleaned is extracted by the target data extractor, the target can be selected according to the encoding mode of the target data obtained by analysis, that is, the encoding mode of the data to be cleaned obtained by analysis Decoder, and decode the extracted target data according to the target decoder to convert the target data from unrecognizable characters to characters that are easy to understand, for example, convert the target data from "%%" to "abc" and other characters .
示例性的,若识别待清洗数据的编码模式为base64编码模式,即目标数据的编码模式为base64编码模式,则可以选取与base64编码模式对应的目标解码器对目标数据进行解码;若识别待清洗数据的编码模式为decoder编码模式,即目标数据的编码模式为decoder编码模式,则可以选取与decoder编码模式对应的目标解码器对目标数据进行解码。Exemplarily, if it is identified that the encoding mode of the data to be cleaned is the base64 encoding mode, that is, the encoding mode of the target data is the base64 encoding mode, then the target decoder corresponding to the base64 encoding mode can be selected to decode the target data; If the encoding mode of the data is the decoder encoding mode, that is, the encoding mode of the target data is the decoder encoding mode, the target decoder corresponding to the decoder encoding mode can be selected to decode the target data.
在本实施例的一个示例实现方式中,如果编码模式为加密编码模式,则获取与加密编码模式对应的加密秘钥,并根据加密秘钥对目标数据进行解码。In an example implementation of this embodiment, if the encoding mode is the encryption encoding mode, an encryption key corresponding to the encryption encoding mode is obtained, and the target data is decoded according to the encryption key.
如果识别到待清洗数据的编码模式为加密编码模式,即目标数据的编码模式为加密编码模式,则可以获取与加密编码模式对应的加密秘钥,并根据加密秘钥对目标数据进行解码。If it is identified that the encoding mode of the data to be cleaned is the encryption encoding mode, that is, the encoding mode of the target data is the encryption encoding mode, the encryption key corresponding to the encryption encoding mode can be obtained, and the target data can be decoded according to the encryption key.
步骤360、根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对待清洗数据进行清洗。Step 360: Screen the decoded target data according to the multiple reference data uploaded by the client to clean the data to be cleaned.
本实施例的方案,通过对待清洗数据进行解析,以确定待清洗数据的编码模式;根据解析到的与目标数据对应的编码模式选取目标解码器,对目标数据 进行解码,从而将目标数据转换为便于识别的表达形式,为后续清洗post数据中的冗余数据提供依据。In the solution of this embodiment, the data to be cleaned is parsed to determine the encoding mode of the data to be cleaned; a target decoder is selected according to the parsed encoding mode corresponding to the target data, and the target data is decoded, thereby converting the target data into A form of expression that is easy to identify, and provides a basis for subsequent cleaning of redundant data in post data.
图4是本申请另一实施例中的一种数据清洗方法的流程图,本实施例是对上述技术方案的细化,本实施例中的技术方案可以与上述一个或者多个实施例中的示例方案结合。如图4所示,数据清洗方法可以包括如下步骤:FIG. 4 is a flowchart of a data cleaning method in another embodiment of the present application. This embodiment is a refinement of the above technical solution. The technical solution in this embodiment may be the same as that in one or more of the above embodiments. Example scenarios combined. As shown in Figure 4, the data cleaning method may include the following steps:
步骤410、获取待清洗数据。Step 410: Acquire data to be cleaned.
步骤420、识别待清洗数据包含的数据格式类型,并根据数据格式类型确定目标数据提取器。Step 420: Identify the data format type contained in the data to be cleaned, and determine the target data extractor according to the data format type.
步骤430、对待清洗数据进行解析。Step 430: Analyze the data to be cleaned.
步骤440、通过目标数据提取器提取待清洗数据包含的目标数据。Step 440: Extract the target data contained in the data to be cleaned by using the target data extractor.
步骤450、根据解析到的与目标数据对应的编码模式选取目标解码器,对目标数据进行解码。Step 450: Select a target decoder according to the parsed coding mode corresponding to the target data, and decode the target data.
步骤460、将解码后的目标数据,与客户端上传的数据模型输出的多个参考数据进行比对;如果目标数据中的第一数据与多个参考数据相匹配,则保留第一数据;如果目标数据中的第二数据与多个参考数据均不匹配,则滤除第二数据。Step 460: Compare the decoded target data with multiple reference data output by the data model uploaded by the client; if the first data in the target data matches the multiple reference data, keep the first data; if If the second data in the target data does not match the multiple reference data, the second data is filtered out.
其中,目标数据中的第一数据以及目标数据中的第二数据为目标数据中的任一数据,其仅是为了便于对本实施例的描述,而不是对本申请实施例的限定。Wherein, the first data in the target data and the second data in the target data are any data in the target data, which are only for the convenience of describing this embodiment, rather than limiting the embodiment of this application.
在本实施例的一个示例实现方式中,在对目标数据进行解码之后,可以将解码后的目标数据与客户端上传的数据模型输出的多个参考数据进行比对,如果第一数据与多个参考数据中的任一参考数据相匹配,例如,相似度大于设定阈值(例如,0.9、0.85或者0.99等,本实施例中对其不加以限定),则可以保留第一数据;如果第二数据与多个参考数据中的任一参考数据均不匹配,例如,与任一参考数据的相似度均小于设定阈值,则可以确定第二数据为冗余数据,可以将其进行滤除处理。In an example implementation of this embodiment, after the target data is decoded, the decoded target data may be compared with multiple reference data output from the data model uploaded by the client. Any reference data in the reference data matches, for example, the similarity is greater than a set threshold (for example, 0.9, 0.85 or 0.99, etc., which are not limited in this embodiment), the first data can be retained; if the second The data does not match any reference data among the multiple reference data. For example, if the similarity with any reference data is less than the set threshold, it can be determined that the second data is redundant data, and can be filtered out. .
需要说明的是,本实施例中涉及到的数据模型可以预先在客户端或者计算机中训练得到的数据模型;在本实施例中,训练数据模型的过程可以包括:可以根据用户的要求对样本数据进行标注,其中,样本数据可以为大量的post数据,本实施例中对其不加以限定;可以通过朴素贝叶斯训练构建数据模型,输出所有用户要求数据,在此基础上对输出数据进行归一化(数据标准化,去掉 大小写等),而后,对归一化数据进行高度聚合,提炼数据内容,降低数据条数,通过上述方法处理后的输出数据,即为本申请实施例中涉及到的参考数据。It should be noted that the data model involved in this embodiment may be a data model obtained by training on the client or computer in advance; in this embodiment, the process of training the data model may include: the sample data may be Labeling, wherein the sample data can be a large amount of post data, which is not limited in this embodiment; a data model can be constructed through naive Bayesian training, and all user-required data can be output, and the output data can be normalized on this basis. Normalization (data standardization, case removal, etc.), and then highly aggregate the normalized data, refine the data content, and reduce the number of data pieces. reference data.
本实施例的方案,在对目标数据进行解码之后,可以将解码后的目标数据,与客户端上传的数据模型输出的多个参考数据进行比对;如果目标数据中的第一数据与多个参考数据相匹配,则保留第一数据;如果目标数据中的第二数据与多个参考数据均不匹配,则滤除第二数据,实现了对post数据中所包含的冗余数据进行滤除,可以节省post数据的存储空间、提升post数据的传输效率。In the solution of this embodiment, after decoding the target data, the decoded target data can be compared with multiple reference data output by the data model uploaded by the client; If the reference data matches, the first data is retained; if the second data in the target data does not match multiple reference data, the second data is filtered out, which realizes the filtering of redundant data contained in the post data. , which can save the storage space of post data and improve the transmission efficiency of post data.
为了使本领域技术人员更好地理解本实施例涉及到的数据清洗方法,下面采用一个示例进行说明,参考图5,数据清洗方法的过程包括有:In order to make those skilled in the art better understand the data cleaning method involved in this embodiment, an example is used for description below. Referring to FIG. 5 , the process of the data cleaning method includes:
步骤510、post数据识别。 Step 510, post data identification.
其中,post数据识别主要针对post的数据结构进行识别,包含识别数据编码模式、识别cookie部分数据、识别请求头header数据、识别请求体body数据。在此基础上识别数据类型(主要包含json、xml、键值对等类型数据)。通过数据模式的识别,简化后续处理逻辑控制,便于针对每种数据格式应用相对应的提取器进行数据提取。Among them, the post data identification mainly identifies the data structure of the post, including identifying the data encoding mode, identifying the cookie part data, identifying the header data of the request header, and identifying the body data of the request body. On this basis, identify the data type (mainly including json, xml, key-value pair type data). Through the identification of the data pattern, the subsequent processing logic control is simplified, and it is convenient to apply the corresponding extractor for each data format to perform data extraction.
步骤520、post数据提取。 Step 520, post data extraction.
在post数据,由请求url、cookie、body三个部分构成,每个部分均需要进行数据提取,同时,请求url数据和cookie数据一般为键值对数据,在body中包含xml、JSON、键值对三种模式,如图6所示。The post data consists of three parts: request url, cookie, and body. Each part needs to be extracted. At the same time, the request url data and cookie data are generally key-value pair data, and the body contains xml, JSON, and key-value data. For the three modes, as shown in Figure 6.
在post数据提取设计中,结合数据识别技术的结果,职责到具体的提取器,每种提取器专职负责相应类型数据的提取。In the design of post data extraction, combined with the results of data identification technology, the responsibilities are assigned to specific extractors, and each extractor is responsible for the extraction of corresponding types of data.
Xml提取器Xml extractor
针对xml格式数据进行数据解析,依据xml数据结构特点,提取属性以及属性数据,提取标签文本数据等。Perform data analysis for xml format data, extract attributes and attribute data, and extract label text data according to the characteristics of xml data structure.
cookie提取器cookie extractor
针对cookie数据进行数据分析,依据cookie数据结构提点,提取属性名称以及属性数据。Perform data analysis on cookie data, and extract attribute names and attribute data based on the cookie data structure.
JSON提取器JSON extractor
针对JSON格式数据进行数据解析,依据JSON数据结构特点,提取属性以及属性数据,提取标签文本数据等。Data analysis is carried out for JSON format data, and attributes and attribute data are extracted according to the characteristics of JSON data structure, and label text data is extracted.
步骤530、数据逆向解码。Step 530: Decode the data in reverse.
数据逆向解码技术主要从提取获得的数据中识别数据模式,当前阶段主要支持base64编码、普通的decoder编码模式数据。在识别相应的数据编码模式下,调用标准的解码器进行解码,从而实现编码数据的还原,从而,尽最大可能实现数据还原能力,提高数据质量。The data reverse decoding technology mainly identifies the data pattern from the extracted data. The current stage mainly supports base64 encoding and ordinary decoder encoding mode data. Under the condition of identifying the corresponding data encoding mode, the standard decoder is called for decoding, so as to realize the restoration of the encoded data, thereby realizing the data restoration capability as much as possible and improving the data quality.
在实战数据中,存在多重编码模式,暨xml中嵌套JSON或者JSON中嵌套xml,同时,xml的节点text值为编码的JSON数据,JSON数据字段值采用编码技术等。通过数据识别、数据提取、数据逆向编码的组件的设计,组合嵌套应用上述组件。很好的完成了多位复杂场景的实现支持。In actual combat data, there are multiple encoding modes, including JSON nested in xml or xml nested in JSON. At the same time, the node text value of xml is encoded JSON data, and the JSON data field value adopts encoding technology. Through the design of data identification, data extraction, and data reverse coding components, the above components are combined and nested. The implementation support of multiple complex scenes is well completed.
步骤540、数据清洗。 Step 540, data cleaning.
其中,数据清洗技术主要包含两部分内容:Among them, the data cleaning technology mainly includes two parts:
1、样本数据标注与关键字提取:1. Sample data annotation and keyword extraction:
在不同地区,数据的价值方向不同,结合用户要求对样本数据进行标注,通过朴素贝叶斯训练构建数据模型,输出所有用户要求数据,在此基础上对输出数据进行归一化(数据标准化,去掉大小写等),而后,对归一化数据进行高度聚合,提炼数据内容,降低数据条数。In different regions, the value direction of the data is different. The sample data is labeled according to the user's requirements, the data model is constructed through Naive Bayesian training, and all the user-required data is output. On this basis, the output data is normalized (data standardization, Remove case, etc.), and then highly aggregate the normalized data, refine the data content, and reduce the number of data items.
2、数据匹配2. Data matching
在一段字符串中查找所有能匹配上的模式,比如查找一段文字匹配上字典中哪些短语。数据匹配采用Aho-Corasick自动机算法。它的核心思想是通过有限自动机巧妙地将字符比较转化为了状态转移。AC自动机能做到匹配时不需要回溯,而且时间复杂度为O(n),即时间复杂度与词典的规模无关。总体保障数据的高效处理。Find all matching patterns in a string, such as finding which phrases in a dictionary match a piece of text. Data matching adopts Aho-Corasick automata algorithm. Its core idea is to skillfully transform character comparisons into state transitions through finite automata. The AC automaton does not need backtracking when matching, and the time complexity is O(n), that is, the time complexity is independent of the size of the dictionary. The overall guarantee of efficient processing of data.
在Aho-Corasick自动机算法中,关键词的个数对高效处理影响较大。因此,在满足业务需求的前提下,尽可能降低关键词个数,本研究中采用对关键字高度聚合、提炼等手段完成。In the Aho-Corasick automata algorithm, the number of keywords has a great influence on efficient processing. Therefore, under the premise of meeting business needs, the number of keywords should be reduced as much as possible. In this study, the keywords are highly aggregated and refined.
本申请实施例,可以实时分析互联网中的post数据并进行有价值数据提取,降低分布式文件(hadoop)和检索引擎数据存储,提升检索引擎读写数据能力,提取post数据中的有价值数据。The embodiments of the present application can analyze post data in the Internet in real time and extract valuable data, reduce distributed file (hadoop) and retrieval engine data storage, improve the retrieval engine's ability to read and write data, and extract valuable data in post data.
图7是本申请实施例中的一种数据清洗装置的结构示意图,该装置可以执行上述实施例中涉及到的数据清洗方法。参照图7,数据提取器确定模块710、 目标数据提取模块720以及目标数据筛选模块730。FIG. 7 is a schematic structural diagram of a data cleaning apparatus in an embodiment of the present application, and the apparatus can execute the data cleaning method involved in the above-mentioned embodiments. Referring to FIG. 7 , a data extractor determination module 710 , a target data extraction module 720 , and a target data screening module 730 .
其中,数据提取器确定模块710,设置为获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;Wherein, the data extractor determination module 710 is configured to obtain the data to be cleaned, and to determine the target data extractor corresponding to the data to be cleaned;
目标数据提取模块720,设置为对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;The target data extraction module 720 is configured to parse the data to be cleaned, and extract target data contained in the data to be cleaned through the target data extractor, where the target data includes attribute names, attribute data or label text data at least one of the
目标数据筛选模块730,设置为对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data screening module 730 is configured to decode the target data, and screen the decoded target data according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
本实施例的方案,通过数据提取器确定模块获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;通过目标数据提取模块对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据;通过目标数据筛选模块对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗,实现了对数据中的冗余数据进行清洗,可以节省存储空间、提升数据传输效率。In the solution of this embodiment, the data to be cleaned is obtained through the data extractor determination module, and the target data extractor corresponding to the to-be-cleaned data is determined; the target data extraction module is used to parse the to-be-cleaned data, and the The target data extractor extracts the target data contained in the data to be cleaned; the target data is decoded by the target data screening module, and the decoded target data is screened according to a plurality of reference data uploaded by the client, so as to Cleaning the data to be cleaned realizes cleaning of redundant data in the data, which can save storage space and improve data transmission efficiency.
数据提取器确定模块710,设置为识别所述待清洗数据包含的数据格式类型,并根据所述数据格式类型确定目标数据提取器;A data extractor determination module 710, configured to identify the data format type contained in the data to be cleaned, and determine a target data extractor according to the data format type;
其中,所述数据格式类型包括键值对、可扩展标记语言xml、或者JS对象简谱JSON;Wherein, the data format type includes key-value pair, extensible markup language xml, or JS object notation JSON;
所述目标数据提取器包括:键值对提取器、xml提取器或者JSON提取器。The target data extractor includes: a key-value pair extractor, an xml extractor or a JSON extractor.
所述目标数据提取模块720,设置为通过所述目标数据提取器提取所述待清洗数据中包含的键值对、xml数据或者JSON信息;The target data extraction module 720 is configured to extract key-value pairs, xml data or JSON information contained in the data to be cleaned through the target data extractor;
所述目标数据提取模块720,还设置为通过所述键值对提取器提取所述待清洗数据中包含的键值对;The target data extraction module 720 is further configured to extract the key-value pairs contained in the data to be cleaned through the key-value pair extractor;
或者,通过所述xml提取器提取所述待清洗数据中包含的xml数据;Or, extract the xml data contained in the data to be cleaned by using the xml extractor;
或者,通过所述JSON提取器提取所述待清洗数据中包含的JSON信息。Alternatively, the JSON information contained in the data to be cleaned is extracted by the JSON extractor.
所述目标数据筛选模块730,包括解码模块,设置为根据解析到的与所述目标数据对应的编码模式选取目标解码器,对所述目标数据进行解码;The target data screening module 730, including a decoding module, is configured to select a target decoder according to the parsed encoding mode corresponding to the target data, and decode the target data;
所述编码模式包括:base64编码模式、decoder编码模式或者加密编码模式。The encoding mode includes: base64 encoding mode, decoder encoding mode or encryption encoding mode.
所述解码模块,还用于如果所述编码模式为加密编码模式,则获取与所述加密编码模式对应的加密秘钥,并根据所述加密秘钥对所述目标数据进行解码。The decoding module is further configured to obtain an encryption key corresponding to the encryption encoding mode if the encoding mode is an encryption encoding mode, and decode the target data according to the encryption key.
所述目标数据筛选模块730,设置为将所述解码后的目标数据,与所述客户端上传的数据模型输出的多个参考数据进行比对;The target data screening module 730 is configured to compare the decoded target data with a plurality of reference data output by the data model uploaded by the client;
如果所述目标数据中的第一数据与多个所述参考数据相匹配,则保留所述第一数据;If the first data in the target data matches a plurality of the reference data, retaining the first data;
如果所述目标数据中的第二数据与多个所述参考数据均不匹配,则滤除所述第二数据。If the second data in the target data does not match a plurality of the reference data, the second data is filtered out.
本实施例中涉及到的待清洗数据为post数据。The data to be cleaned involved in this embodiment is post data.
本申请实施例所提供的数据清洗装置可执行本申请任意实施例所提供的数据清洗方法,具备执行方法相应的功能模块和有益效果。The data cleaning apparatus provided in the embodiment of the present application can execute the data cleaning method provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
图8为本申请实施例提供的一种数据清洗设备的结构示意图,如图8所示,该数据清洗设备包括处理器80、存储器81、输入装置82和输出装置83;数据清洗设备中处理器80的数量可以是一个或多个,图8中以一个处理器80为例;数据清洗设备中的处理器80、存储器81、输入装置82和输出装置83可以通过总线或其他方式连接,图8中以通过总线连接为例。FIG. 8 is a schematic structural diagram of a data cleaning device provided by an embodiment of the application. As shown in FIG. 8 , the data cleaning device includes a processor 80, a memory 81, an input device 82, and an output device 83; the processor in the data cleaning device The number of 80 can be one or more, and one processor 80 is taken as an example in FIG. 8; the processor 80, the memory 81, the input device 82 and the output device 83 in the data cleaning device can be connected by a bus or other means, as shown in FIG. 8 Take the connection through the bus as an example.
存储器81作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的数据清洗方法对应的程序指令/模块(例如,数据清洗装置中的数据提取器确定模块710、目标数据提取模块720以及目标数据筛选模块730)。处理器80通过运行存储在存储器81中的软件程序、指令以及模块,从而执行数据清洗设备的多种功能应用以及数据处理,即实现上述的数据清洗方法。As a computer-readable storage medium, the memory 81 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data cleaning method in the embodiments of the present application (for example, data extraction in the data cleaning device). Detector determination module 710, target data extraction module 720, and target data screening module 730). The processor 80 executes various functional applications and data processing of the data cleaning device by running the software programs, instructions and modules stored in the memory 81 , that is, the above-mentioned data cleaning method is implemented.
存储器81可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器81可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器81可包括相对于处理器80远程设置的存储器,这些远程存储器可以通过网络连接至数据清洗设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 81 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, memory 81 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 81 may include memory located remotely from processor 80, which may be connected to the data cleaning device via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
输入装置82可设置为接收输入的数字或字符信息,以及产生与数据清洗设备的用户设置以及功能控制有关的键信号输入。输出装置83可包括显示屏等显示设备。The input device 82 may be configured to receive input numerical or character information and to generate key signal input related to user settings and function control of the data cleaning apparatus. The output device 83 may include a display device such as a display screen.
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时设置为执行一种数据清洗方法,该方法包括:Embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions are configured to execute a data cleaning method when executed by a computer processor, and the method includes:
获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;acquiring data to be cleaned, and determining a target data extractor corresponding to the data to be cleaned;
对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;Parsing the data to be cleaned, and extracting target data contained in the data to be cleaned by the target data extractor, where the target data includes at least one of attribute name, attribute data or label text data;
对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data is decoded, and the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的数据清洗方法中的相关操作。存储介质可以是非暂态计算机可读存储介质。Of course, a storage medium containing computer-executable instructions provided by the embodiments of the present application, the computer-executable instructions of which are not limited to the above-mentioned method operations, and can also execute the data cleaning methods provided in any embodiment of the present application. related operations. The storage medium may be a non-transitory computer-readable storage medium.
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请多个实施例所述的方法。From the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and certainly can also be implemented by hardware. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to related technologies, and the computer software products can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), Flash Memory (FLASH), hard disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, A server, or a network device, etc.) executes the methods described in the various embodiments of the present application.
值得注意的是,上述数据清洗装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。It is worth noting that, in the above-mentioned embodiment of the data cleaning device, the multiple units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, The specific names of the multiple functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.

Claims (10)

  1. 一种数据清洗方法,包括:A data cleaning method, comprising:
    获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;acquiring data to be cleaned, and determining a target data extractor corresponding to the data to be cleaned;
    对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;Parsing the data to be cleaned, and extracting target data contained in the data to be cleaned by the target data extractor, where the target data includes at least one of attribute name, attribute data or label text data;
    对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data is decoded, and the decoded target data is screened according to multiple reference data uploaded by the client, so as to clean the data to be cleaned.
  2. 根据权利要求1所述的方法,其中,所述确定与所述待清洗数据对应的目标数据提取器,包括:The method according to claim 1, wherein the determining a target data extractor corresponding to the data to be cleaned comprises:
    识别所述待清洗数据包含的数据格式类型,并根据所述数据格式类型确定目标数据提取器;Identifying the data format type contained in the data to be cleaned, and determining a target data extractor according to the data format type;
    其中,所述数据格式类型包括键值对、可扩展标记语言xml、或者JS对象简谱JSON;Wherein, the data format type includes key-value pair, extensible markup language xml, or JS object notation JSON;
    所述目标数据提取器包括:键值对提取器、xml提取器或者JSON提取器。The target data extractor includes: a key-value pair extractor, an xml extractor or a JSON extractor.
  3. 根据权利要求2所述的方法,其中,所述通过所述目标数据提取器提取所述待清洗数据包含的目标数据,包括:The method according to claim 2, wherein the extracting the target data contained in the data to be cleaned by the target data extractor comprises:
    通过所述目标数据提取器提取所述待清洗数据中包含的键值对、xml数据或者JSON信息;Extract key-value pairs, xml data or JSON information contained in the data to be cleaned by using the target data extractor;
    所述通过所述目标数据提取器提取所述待清洗数据中包含的键值对、xml数据或者JSON信息,包括以下之一:The extraction of key-value pairs, xml data or JSON information contained in the data to be cleaned by the target data extractor includes one of the following:
    通过所述键值对提取器提取所述待清洗数据中包含的键值对;Extract the key-value pairs contained in the data to be cleaned by using the key-value pair extractor;
    通过所述xml提取器提取所述待清洗数据中包含的xml数据;以及Extract the xml data contained in the data to be cleaned by the xml extractor; and
    通过所述JSON提取器提取所述待清洗数据中包含的JSON信息。Extract the JSON information contained in the data to be cleaned by using the JSON extractor.
  4. 根据权利要求1所述的方法,其中,所述对所述目标数据进行解码,包括:The method of claim 1, wherein the decoding the target data comprises:
    根据解析到的与所述目标数据对应的编码模式选取目标解码器,对所述目标数据进行解码;Select a target decoder according to the parsed encoding mode corresponding to the target data, and decode the target data;
    所述编码模式包括:base64编码模式、decoder编码模式或者加密编码模式。The encoding mode includes: base64 encoding mode, decoder encoding mode or encryption encoding mode.
  5. 根据权利要求4所述的方法,其中,所述对所述目标数据进行解码,还包括:The method of claim 4, wherein the decoding the target data further comprises:
    响应于确定所述编码模式为加密编码模式,获取与所述加密编码模式对应的加密秘钥,并根据所述加密秘钥对所述目标数据进行解码。In response to determining that the encoding mode is an encryption encoding mode, an encryption key corresponding to the encryption encoding mode is obtained, and the target data is decoded according to the encryption key.
  6. 根据权利要求1所述的方法,其中,所述根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗,包括:The method according to claim 1, wherein the filtering of the decoded target data according to a plurality of reference data uploaded by the client to clean the data to be cleaned comprises:
    将所述解码后的目标数据,与所述客户端上传的数据模型输出的多个参考数据进行比对;Compare the decoded target data with a plurality of reference data output by the data model uploaded by the client;
    响应于确定所述目标数据中的第一数据与所述多个参考数据相匹配,保留所述第一数据;In response to determining that first data in the target data matches the plurality of reference data, retaining the first data;
    响应于确定所述目标数据中的第二数据与所述多个参考数据都不匹配,滤除所述第二数据。In response to determining that second data in the target data does not match the plurality of reference data, filtering out the second data.
  7. 根据权利要求1-6中任一项所述的方法,其中,所述待清洗数据为post数据。The method according to any one of claims 1-6, wherein the data to be cleaned is post data.
  8. 一种数据清洗装置,包括:A data cleaning device, comprising:
    数据提取器确定模块,设置为获取待清洗数据,并确定与所述待清洗数据对应的目标数据提取器;a data extractor determination module, configured to obtain data to be cleaned, and to determine a target data extractor corresponding to the data to be cleaned;
    目标数据提取模块,设置为对所述待清洗数据进行解析,并通过所述目标数据提取器提取所述待清洗数据包含的目标数据,所述目标数据包含属性名称、属性数据或者标签文本数据中至少一项;A target data extraction module, configured to parse the data to be cleaned, and extract target data contained in the data to be cleaned through the target data extractor, where the target data includes attribute names, attribute data or label text data at least one;
    目标数据筛选模块,设置为对所述目标数据进行解码,并根据客户端上传的多个参考数据对解码后的目标数据进行筛选,以对所述待清洗数据进行清洗。The target data screening module is configured to decode the target data, and screen the decoded target data according to a plurality of reference data uploaded by the client, so as to clean the data to be cleaned.
  9. 一种数据清洗设备,包括:A data cleaning device, comprising:
    一个或多个处理器;one or more processors;
    存储装置,设置为存储一个或多个程序,storage means arranged to store one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的数据清洗方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the data cleaning method according to any one of claims 1-7.
  10. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-7中任一所述的数据清洗方法。A storage medium containing computer-executable instructions, when executed by a computer processor, for performing the data cleaning method according to any one of claims 1-7.
PCT/CN2021/120043 2020-12-16 2021-09-24 Data cleaning method, apparatus and device, and storage medium WO2022127259A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011490975.9 2020-12-16
CN202011490975.9A CN112612761B (en) 2020-12-16 2020-12-16 Data cleaning method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022127259A1 true WO2022127259A1 (en) 2022-06-23

Family

ID=75240187

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120043 WO2022127259A1 (en) 2020-12-16 2021-09-24 Data cleaning method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112612761B (en)
WO (1) WO2022127259A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002243A (en) * 2022-08-02 2022-09-02 上海秉匠信息科技有限公司 Data processing method and device
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112612761B (en) * 2020-12-16 2024-01-30 北京锐安科技有限公司 Data cleaning method, device, equipment and storage medium
CN113988282A (en) * 2021-10-26 2022-01-28 平头哥(上海)半导体技术有限公司 Programmable access engine architecture for graph neural networks and graph applications

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984625A (en) * 2018-06-19 2018-12-11 平安科技(深圳)有限公司 Information filtering method, device, computer equipment and storage medium
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
US20190205366A1 (en) * 2018-01-04 2019-07-04 Fujitsu Limited File generation method, file generation apparatus, and non-transitory computer-readable storage medium for storing program
CN110554877A (en) * 2019-09-05 2019-12-10 北京博睿宏远数据科技股份有限公司 JSON data analysis method, device, equipment and storage medium
CN112052414A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Data processing method and device and readable storage medium
CN112612761A (en) * 2020-12-16 2021-04-06 北京锐安科技有限公司 Data cleaning method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239581A (en) * 2017-07-07 2017-10-10 小草数语(北京)科技有限公司 Data cleaning method and device
CN108052665B (en) * 2017-12-29 2020-05-05 深圳市中易科技有限责任公司 Data cleaning method and device based on distributed platform
CN111640040A (en) * 2020-04-07 2020-09-08 国网新疆电力有限公司 Power supply customer value evaluation method based on customer portrait technology and big data platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190205366A1 (en) * 2018-01-04 2019-07-04 Fujitsu Limited File generation method, file generation apparatus, and non-transitory computer-readable storage medium for storing program
CN108984625A (en) * 2018-06-19 2018-12-11 平安科技(深圳)有限公司 Information filtering method, device, computer equipment and storage medium
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
CN110554877A (en) * 2019-09-05 2019-12-10 北京博睿宏远数据科技股份有限公司 JSON data analysis method, device, equipment and storage medium
CN112052414A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Data processing method and device and readable storage medium
CN112612761A (en) * 2020-12-16 2021-04-06 北京锐安科技有限公司 Data cleaning method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002243A (en) * 2022-08-02 2022-09-02 上海秉匠信息科技有限公司 Data processing method and device
CN115002243B (en) * 2022-08-02 2022-11-01 上海秉匠信息科技有限公司 Data processing method and device
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method

Also Published As

Publication number Publication date
CN112612761A (en) 2021-04-06
CN112612761B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
WO2022127259A1 (en) Data cleaning method, apparatus and device, and storage medium
WO2021088385A1 (en) Online log analysis method, system, and electronic terminal device thereof
US11212358B2 (en) Techniques for compact data storage of network traffic and efficient search thereof
TWI729472B (en) Method, device and server for determining feature words
JP2020027649A (en) Method, apparatus, device and storage medium for generating entity relationship data
US9830316B2 (en) Content availability for natural language processing tasks
WO2019080402A1 (en) Text information extraction method for structured text, storage medium and server
JP2022046759A (en) Retrieval method, device, electronic apparatus and storage medium
CN111552788B (en) Database retrieval method, system and equipment based on entity attribute relationship
CN110413787A (en) Text Clustering Method, device, terminal and storage medium
WO2021114634A1 (en) Text annotation method, device, and storage medium
US20200257724A1 (en) Methods, devices, and storage media for content retrieval
CN117688220A (en) Multi-mode information retrieval method and system based on large language model
CN112883088B (en) Data processing method, device, equipment and storage medium
CN111984797A (en) Customer identity recognition device and method
CN116521626A (en) Personal knowledge management method and system based on content retrieval
US11270155B2 (en) Duplicate image detection based on image content
CN115774797A (en) Video content retrieval method, device, equipment and computer readable storage medium
US10546060B2 (en) Pronoun mapping for sub-context rendering
JP2015022406A (en) Device, method and program for analyzing document including visual representation by text
US20230112132A1 (en) Storage medium, database construction method, and information processing apparatus
Zhao et al. Mining Service Tags with Enriched Information from the Internet
JP6050175B2 (en) Evaluation expression extraction apparatus, method, and program
CN116450123A (en) Application program interface usage clustering method, searching method and device
CN114020865A (en) Search statement processing method, device, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21905190

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21905190

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21905190

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 160224)