CN113641815A

CN113641815A - Data screening method and device based on big data and electronic equipment

Info

Publication number: CN113641815A
Application number: CN202110845992.8A
Authority: CN
Inventors: 吴博; 朱昕宇; 刘宜帆; 周春辉
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-12
Anticipated expiration: 2041-07-26
Also published as: CN113641815B

Abstract

The invention relates to a data screening method, a data screening device, electronic equipment and a computer readable storage medium based on big data, wherein the method comprises the steps of obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining data documents corresponding to the screening conditions; extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information; and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result. The data screening method based on big data can simplify the data screening operation process and improve the data screening efficiency.

Description

Data screening method and device based on big data and electronic equipment

Technical Field

The invention relates to the technical field of internet, in particular to a data screening method and device based on big data, electronic equipment and a computer readable storage medium.

Background

With the development of a big data environment, data are rapidly accumulated, the value contained in massive data is analyzed, and the screening of valuable data is very important, so that the data screening is in a crucial position in the whole data processing flow. For example, in the e-commerce field, data documents containing condition, date, age and product specification information are screened. The purpose of data screening is to improve the usability of the related data collected and stored before and to facilitate later data analysis.

The method for realizing data screening in the prior art adopts a mode of exporting data through an excel table and then manually screening, and the data screening method disclosed in the prior art is used for carrying out customized screening configuration on required configuration information in a web page and generating a corresponding data screening template for carrying out data screening without manual multiple exporting and screening.

However, the data screening methods do not sequence the acquired data, and have the problems that the acquired data is redundant and the data is difficult to observe visually, so that the data screening operation process is complex and the data screening efficiency is low.

Disclosure of Invention

In view of the above, it is necessary to provide a data screening method, an apparatus, an electronic device and a computer-readable storage medium based on big data, so as to solve the problems of complex data screening operation process and low data screening efficiency of big data documents in the e-commerce field in the prior art.

In order to solve the above problems, the present invention provides a data screening method based on big data, comprising:

obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;

extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.

Further, obtaining a screening condition, and screening data to be screened according to the screening condition, specifically comprising:

and screening the data to be screened according to the initial screening condition and the secondary screening condition by taking at least one of the characters, the character strings and the hypertext links as the initial screening condition and at least one of the conditions, the date, the age and the product specification information as the secondary screening condition.

Further, the extracting the data document by using the inverted index to obtain the user screening information specifically includes:

the data documents are numbered, the interior of each data document is divided into a plurality of words, the inverted index is utilized to enable each word and the data document number to form a corresponding relation, and the data documents are extracted through retrieval to obtain user screening information.

Further, the obtaining of the user screening information by retrieving and extracting the data documents by using the inverted index to form a corresponding relationship between each word and the data document number specifically includes:

adopting a hash table structure to perform inverted indexing on the data documents so as to obtain the corresponding relation from words to all the numbers of the data documents containing the words;

resolving the screening condition into a plurality of words, and inquiring the serial numbers of all data documents containing the words corresponding to the screening condition according to the corresponding relation;

and taking intersection sets of all the inquired data document numbers to obtain user screening information.

Further, the performing inverted indexing on the data documents by using the hash table structure to obtain the corresponding relationship from the word to all the data document numbers containing the word specifically includes:

and sequentially accessing each data document, acquiring the value of each word in the data document in the hash table, and inserting the value into the data document number so as to form the corresponding relation from the word to all data document numbers containing the word.

Further, the preferentially sorting the cleaned user screening information according to the query condition of the pre-calibrated user specifically includes:

and acquiring the relative frequency of the query conditions of the pre-calibrated user in the data documents in the cleaned user screening information, and performing priority ordering on the cleaned user screening information according to the relative frequency.

Further, obtaining the relative frequency of the query condition of the pre-calibrated user appearing in the data document in the cleaned user screening information, and performing priority ranking on the cleaned user screening information according to the relative frequency, specifically comprising:

acquiring the relative frequency of the query condition of the pre-calibrated user in the data document in the cleaned user screening information according to the query condition and the sorting characteristic function of the pre-calibrated user, and performing priority sorting on the cleaned user screening information according to the relative frequency;

the ranking characteristic function is

Wherein q is a query condition of a pre-calibrated user, d is a data file in the cleaned user screening information, and f_i(d, q) pre-scaling the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, f_t(t_iD) is the word t_iThe relative frequency of occurrence in the data document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, N_tAnd screening the total number of the data documents in the information for the cleaned user.

The invention also provides a data screening device based on the big data, which comprises a data screening module, an information extraction module and a priority sorting module;

the data screening module is used for acquiring screening conditions, screening the data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;

the information extraction module is used for extracting the data documents by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and the priority ordering module is used for carrying out priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.

The invention further provides an electronic device, which comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the data screening method based on big data according to any technical scheme is realized.

The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed by a processor, the method for screening data based on big data according to any of the above technical solutions is implemented.

The beneficial effects of adopting the above embodiment are: the data screening method based on big data provided by the invention screens data to be screened according to screening conditions input by a user to obtain related data documents and complete screening, the data documents are numbered by utilizing an inverted index in the specific implementation process, the data documents are convenient to visually observe, the data documents are extracted by utilizing Boolean search to obtain user screening information, the user screening information is cleaned to achieve the purpose of checking the user screening information, the cleaned user screening information is preferentially sorted, and a diagram is generated by utilizing a diagram library, so that the data screening operation process can be simplified, and the data screening efficiency is improved.

Drawings

Fig. 1 is a schematic view of an application scenario of a big data-based data screening apparatus provided in the present invention;

FIG. 2 is a schematic flow chart illustrating an embodiment of a big data-based data screening method according to the present invention;

FIG. 3 is a diagram illustrating a Boolean search method according to an embodiment of the present invention;

FIG. 4 is a block diagram of an embodiment of a big data-based data filtering apparatus according to the present invention;

fig. 5 is a block diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

The invention provides a data screening method and device based on big data, electronic equipment and a computer readable storage medium, which are respectively described in detail below.

Fig. 1 is a schematic view of an application scenario of a big data-based data filtering apparatus provided in the present invention, and the system may include a server 100, where the big data-based data filtering apparatus, such as the server in fig. 1, is integrated in the server 100.

The server 100 in the embodiment of the present invention is mainly used for:

In this embodiment of the present invention, the server 100 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the server 100 described in this embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server composed of a plurality of servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).

It is to be understood that the terminal 200 used in the embodiments of the present invention may be a device that includes both receiving and transmitting hardware, i.e., a device having receiving and transmitting hardware capable of performing two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, and the like, and the type of the terminal 200 is not limited in this embodiment.

Those skilled in the art will understand that the application environment shown in fig. 1 is only one application scenario of the present invention, and does not constitute a limitation on the application scenario of the present invention, and that other application environments may further include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it is understood that the big data based data filtering apparatus may further include one or more other terminals, which is not limited herein.

In addition, as shown in fig. 1, the big data based data filtering apparatus may further include a memory 200 for storing data such as conditions, dates, ages, and product specifications.

It should be noted that the scenario diagram of the big data-based data filtering apparatus shown in fig. 1 is only an example, and the big data-based data filtering apparatus and the scenario described in the embodiment of the present invention are for more clearly illustrating the technical solution of the embodiment of the present invention, and do not form a limitation on the technical solution provided in the embodiment of the present invention.

The embodiment of the invention provides a data screening method based on big data, which has a flow diagram, as shown in fig. 2, the data screening method based on big data comprises the following steps:

step S201, obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining data documents corresponding to the screening conditions;

step S202, extracting the data documents by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and S203, performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated user to obtain a priority ordering result.

In a specific embodiment, the user screening information is cleaned to obtain cleaned user screening information, and the specific process is that a bayesian formula or a decision tree method is used for deleting and filling data documents in the user screening information, and the obtained data documents are transferred to a data warehouse according to a data document format in the data warehouse to obtain the cleaned user screening information.

As a preferred embodiment, the obtaining of the screening condition and the screening of the data to be screened according to the screening condition specifically include:

In a specific embodiment, a user input text box and a screening condition box are respectively provided through a user input module and a condition screening module, a user inputs keywords, key character strings or key hypertext connections in the input text box as initial screening conditions, screening data to be screened is performed by checking the screening conditions in the screening condition box as secondary screening conditions, and data documents corresponding to the screening conditions are obtained.

As a preferred embodiment, the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:

In a specific embodiment, each data document is numbered and sequentially arranged from 0, each data document is divided into a plurality of words by using a parser, and each word and the data document number form a corresponding relation by using an inverted index, so that the acquired data documents are prevented from being excessively redundant and are convenient to observe visually.

As a preferred embodiment, the obtaining of the user screening information by retrieving and extracting the data document by using the inverted index to make each word form a corresponding relationship with the data document number specifically includes:

The data document is extracted by a boolean search method.

As a preferred embodiment, the performing inverted indexing on the data documents by using the hash table structure to obtain the corresponding relationship from a word to all data document numbers containing the word specifically includes:

In a specific embodiment, a schematic diagram of a boolean search method is shown in fig. 3, where boolean searches are performed on data documents containing a filtering condition a and data documents containing a filtering condition B, where the boolean searches include numbering all the data documents in advance, and the numbering may be performed in a format of category + time node + number; taking the user query as a screening node when screening conditions are selected each time to obtain data to be screened; screening the data to be screened to obtain all data document numbers which accord with screening conditions selected by a user; and taking intersection from the screened data document numbers to obtain user screening information.

As a preferred embodiment, the prioritizing the cleaned user screening information according to the pre-calibrated query condition of the user specifically includes:

It should be noted that, in this embodiment, the advantage of performing priority ranking on the cleaned user screening information according to the pre-calibrated query condition of the user is that it is avoided that too many data documents in the cleaned user screening information are not favorable for visual observation, and the screening efficiency of screening the cleaned user screening information is improved.

As a preferred embodiment, the method for obtaining the relative frequency of the query condition of the pre-calibrated user appearing in the data document in the cleaned user screening information and performing the priority ranking on the cleaned user screening information according to the relative frequency specifically includes:

the ranking characteristic function is

Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information,f_i(d, q) pre-scaling the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, f_t(t_iD) is the word t_iThe relative frequency of occurrence in the data document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, N_tAnd screening the total number of the data documents in the information for the cleaned user.

In a specific embodiment, the query condition of the user is pre-calibrated to q ═ W1, W2.. Ws }, the data document in the cleaned user screening information is taken as a candidate document d ═ d1, d2.. dk }, q is the screening condition, d is the screened data document, and a score is calculated for q and d: s_k＝score(q,d_k) A 1 is to f_i(d, q) is set to pre-calibrate the relative frequency, expressed as the relative frequency, of the occurrence of the ith word in the query condition q of the user in the data document d

Wherein f is_i(d, q) is a ranking feature function, understood as the weight of the ith word in the query condition q in the candidate document d, f_t(t_iD) is the word t_iThe relative frequency of occurrence in the candidate document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, N_tScreening the total number of data documents in the information for the cleaned user, wherein denominator is a normalization factor, f_iThe value of (d, q) is represented by S_kThe scores in the order are that the larger the value is, the higher the priority is shown.

In another embodiment, the generating of the chart by the template generating module for the priority ranking result includes obtaining a screening condition, obtaining a screening data document, forming a mapping relationship between the screening condition and the screening data document, and calling a chart library to generate the chart through the mapping relationship.

The embodiment of the invention provides a data screening device based on big data, which has a structural block diagram, as shown in fig. 4, the data screening device based on big data comprises a data screening module 401, an information extraction module 402 and a priority module 403;

the data screening module 401 is configured to obtain a screening condition, screen the data to be screened according to the screening condition, and obtain a data document corresponding to the screening condition;

the information extraction module 402 is configured to extract the data document by using the inverted index to obtain user screening information, and clean the user screening information to obtain cleaned user screening information;

the prioritization module 403 is configured to prioritize the cleaned user screening information according to a pre-calibrated query condition of the user, so as to obtain a prioritization result.

The data screening module 401 includes a user input module, a condition screening module and an information uploading module, the user input module and the condition screening module are respectively used for providing a user input text box and a screening condition box, a user inputs a keyword, a key character string or a key hypertext connection in the input text box as an initial screening condition, the screening condition box checks the screening condition as a secondary screening condition to obtain a screening condition, and the information uploading module is used for uploading the obtained screening condition to the cloud database;

the information extraction module 402 performs information extraction operation in the cloud database, including obtaining screening condition information uploaded by the information uploading module, performing reverse indexing according to the screening condition information to extract data documents, performing boolean search on the extracted data documents to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

the prioritization module 403 includes a prioritization module, a template generation module, and an information acquisition module, where the information acquisition module is configured to acquire data documents in the cleaned user screening information, the prioritization module is configured to prioritize the data documents in the cleaned user screening information according to a pre-calibrated user query condition, and the template generation module is configured to generate a chart for a prioritization result, so as to complete data screening.

As shown in fig. 5, in the data screening method based on big data, the present invention also provides an electronic device, which may be a mobile terminal, a desktop computer, a notebook, a palm computer, a server, or other computing devices. The electronic device comprises a processor 10, a memory 20 and a display 30.

The storage 20 may in some embodiments be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 20 may also be an external storage device of the computer device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Further, the memory 20 may also include both an internal storage unit and an external storage device of the computer device. The memory 20 is used for storing application software installed in the computer device and various data, such as program codes installed in the computer device. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a big data based data filtering method program 40, and the big data based data filtering method program 40 can be executed by the processor 10, so as to implement the big data based data filtering method according to the embodiments of the present invention.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), microprocessor or other data Processing chip for executing program codes stored in the memory 20 or Processing data, such as executing a big data based data screening method.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the computer device and for displaying a visual user interface. The components 10-30 of the computer device communicate with each other via a system bus.

In one embodiment, when processor 10 executes big-data based data screening method program 40 in memory 20, the following steps are implemented:

The present embodiment also provides a computer-readable storage medium, on which a big data based data filtering method program is stored, and when executed by a processor, the big data based data filtering method program implements the following steps:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A data screening method based on big data is characterized by comprising the following steps:

2. The big data-based data screening method according to claim 1, wherein the obtaining of the screening condition and the screening of the data to be screened according to the screening condition specifically include:

3. The big data-based data screening method according to claim 1, wherein the extracting the data documents by using the inverted index to obtain the user screening information specifically comprises:

4. The big data-based data screening method according to claim 3, wherein the obtaining of the user screening information by retrieving and extracting the data documents by using the inverted index to make each word form a corresponding relationship with the data document number specifically comprises:

5. The big data-based data screening method according to claim 4, wherein the data documents are inversely indexed by using the hash table structure to obtain correspondence from a word to all data document numbers containing the word, and the method specifically comprises:

and sequentially accessing each data document, acquiring the value of each word in the data document in the hash table, and inserting the value into the data document number to determine the corresponding relation from the word to all data document numbers containing the word.

6. The big-data-based data screening method according to claim 1, wherein the prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user specifically comprises:

7. The big-data-based data screening method according to claim 6, wherein the method includes the steps of obtaining a relative frequency of a query condition of a pre-calibrated user appearing in a data document in the cleaned user screening information, and performing priority ranking on the cleaned user screening information according to the relative frequency, and specifically includes:

the ranking characteristic function is

8. A data screening device based on big data is characterized by comprising a data screening module, an information extraction module and a priority sorting module;

9. An electronic device comprising a processor and a memory, wherein the memory stores a computer program, and the computer program, when executed by the processor, implements the big-data based data filtering method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the big data based data filtering method according to any one of claims 1 to 7.