CN113641815A - Data screening method and device based on big data and electronic equipment - Google Patents

Data screening method and device based on big data and electronic equipment Download PDF

Info

Publication number
CN113641815A
CN113641815A CN202110845992.8A CN202110845992A CN113641815A CN 113641815 A CN113641815 A CN 113641815A CN 202110845992 A CN202110845992 A CN 202110845992A CN 113641815 A CN113641815 A CN 113641815A
Authority
CN
China
Prior art keywords
data
screening
user
information
cleaned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110845992.8A
Other languages
Chinese (zh)
Other versions
CN113641815B (en
Inventor
吴博
朱昕宇
刘宜帆
周春辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202110845992.8A priority Critical patent/CN113641815B/en
Publication of CN113641815A publication Critical patent/CN113641815A/en
Application granted granted Critical
Publication of CN113641815B publication Critical patent/CN113641815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data screening method, a data screening device, electronic equipment and a computer readable storage medium based on big data, wherein the method comprises the steps of obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining data documents corresponding to the screening conditions; extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information; and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result. The data screening method based on big data can simplify the data screening operation process and improve the data screening efficiency.

Description

Data screening method and device based on big data and electronic equipment
Technical Field
The invention relates to the technical field of internet, in particular to a data screening method and device based on big data, electronic equipment and a computer readable storage medium.
Background
With the development of a big data environment, data are rapidly accumulated, the value contained in massive data is analyzed, and the screening of valuable data is very important, so that the data screening is in a crucial position in the whole data processing flow. For example, in the e-commerce field, data documents containing condition, date, age and product specification information are screened. The purpose of data screening is to improve the usability of the related data collected and stored before and to facilitate later data analysis.
The method for realizing data screening in the prior art adopts a mode of exporting data through an excel table and then manually screening, and the data screening method disclosed in the prior art is used for carrying out customized screening configuration on required configuration information in a web page and generating a corresponding data screening template for carrying out data screening without manual multiple exporting and screening.
However, the data screening methods do not sequence the acquired data, and have the problems that the acquired data is redundant and the data is difficult to observe visually, so that the data screening operation process is complex and the data screening efficiency is low.
Disclosure of Invention
In view of the above, it is necessary to provide a data screening method, an apparatus, an electronic device and a computer-readable storage medium based on big data, so as to solve the problems of complex data screening operation process and low data screening efficiency of big data documents in the e-commerce field in the prior art.
In order to solve the above problems, the present invention provides a data screening method based on big data, comprising:
obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
Further, obtaining a screening condition, and screening data to be screened according to the screening condition, specifically comprising:
and screening the data to be screened according to the initial screening condition and the secondary screening condition by taking at least one of the characters, the character strings and the hypertext links as the initial screening condition and at least one of the conditions, the date, the age and the product specification information as the secondary screening condition.
Further, the extracting the data document by using the inverted index to obtain the user screening information specifically includes:
the data documents are numbered, the interior of each data document is divided into a plurality of words, the inverted index is utilized to enable each word and the data document number to form a corresponding relation, and the data documents are extracted through retrieval to obtain user screening information.
Further, the obtaining of the user screening information by retrieving and extracting the data documents by using the inverted index to form a corresponding relationship between each word and the data document number specifically includes:
adopting a hash table structure to perform inverted indexing on the data documents so as to obtain the corresponding relation from words to all the numbers of the data documents containing the words;
resolving the screening condition into a plurality of words, and inquiring the serial numbers of all data documents containing the words corresponding to the screening condition according to the corresponding relation;
and taking intersection sets of all the inquired data document numbers to obtain user screening information.
Further, the performing inverted indexing on the data documents by using the hash table structure to obtain the corresponding relationship from the word to all the data document numbers containing the word specifically includes:
and sequentially accessing each data document, acquiring the value of each word in the data document in the hash table, and inserting the value into the data document number so as to form the corresponding relation from the word to all data document numbers containing the word.
Further, the preferentially sorting the cleaned user screening information according to the query condition of the pre-calibrated user specifically includes:
and acquiring the relative frequency of the query conditions of the pre-calibrated user in the data documents in the cleaned user screening information, and performing priority ordering on the cleaned user screening information according to the relative frequency.
Further, obtaining the relative frequency of the query condition of the pre-calibrated user appearing in the data document in the cleaned user screening information, and performing priority ranking on the cleaned user screening information according to the relative frequency, specifically comprising:
acquiring the relative frequency of the query condition of the pre-calibrated user in the data document in the cleaned user screening information according to the query condition and the sorting characteristic function of the pre-calibrated user, and performing priority sorting on the cleaned user screening information according to the relative frequency;
the ranking characteristic function is
Figure BDA0003180576470000031
Wherein q is a query condition of a pre-calibrated user, d is a data file in the cleaned user screening information, and fi(d, q) pre-scaling the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, ft(tiD) is the word tiThe relative frequency of occurrence in the data document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, NtAnd screening the total number of the data documents in the information for the cleaned user.
The invention also provides a data screening device based on the big data, which comprises a data screening module, an information extraction module and a priority sorting module;
the data screening module is used for acquiring screening conditions, screening the data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;
the information extraction module is used for extracting the data documents by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and the priority ordering module is used for carrying out priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
The invention further provides an electronic device, which comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the data screening method based on big data according to any technical scheme is realized.
The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed by a processor, the method for screening data based on big data according to any of the above technical solutions is implemented.
The beneficial effects of adopting the above embodiment are: the data screening method based on big data provided by the invention screens data to be screened according to screening conditions input by a user to obtain related data documents and complete screening, the data documents are numbered by utilizing an inverted index in the specific implementation process, the data documents are convenient to visually observe, the data documents are extracted by utilizing Boolean search to obtain user screening information, the user screening information is cleaned to achieve the purpose of checking the user screening information, the cleaned user screening information is preferentially sorted, and a diagram is generated by utilizing a diagram library, so that the data screening operation process can be simplified, and the data screening efficiency is improved.
Drawings
Fig. 1 is a schematic view of an application scenario of a big data-based data screening apparatus provided in the present invention;
FIG. 2 is a schematic flow chart illustrating an embodiment of a big data-based data screening method according to the present invention;
FIG. 3 is a diagram illustrating a Boolean search method according to an embodiment of the present invention;
FIG. 4 is a block diagram of an embodiment of a big data-based data filtering apparatus according to the present invention;
fig. 5 is a block diagram of an embodiment of an electronic device provided in the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
The invention provides a data screening method and device based on big data, electronic equipment and a computer readable storage medium, which are respectively described in detail below.
Fig. 1 is a schematic view of an application scenario of a big data-based data filtering apparatus provided in the present invention, and the system may include a server 100, where the big data-based data filtering apparatus, such as the server in fig. 1, is integrated in the server 100.
The server 100 in the embodiment of the present invention is mainly used for:
obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
In this embodiment of the present invention, the server 100 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the server 100 described in this embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server composed of a plurality of servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).
It is to be understood that the terminal 200 used in the embodiments of the present invention may be a device that includes both receiving and transmitting hardware, i.e., a device having receiving and transmitting hardware capable of performing two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, and the like, and the type of the terminal 200 is not limited in this embodiment.
Those skilled in the art will understand that the application environment shown in fig. 1 is only one application scenario of the present invention, and does not constitute a limitation on the application scenario of the present invention, and that other application environments may further include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it is understood that the big data based data filtering apparatus may further include one or more other terminals, which is not limited herein.
In addition, as shown in fig. 1, the big data based data filtering apparatus may further include a memory 200 for storing data such as conditions, dates, ages, and product specifications.
It should be noted that the scenario diagram of the big data-based data filtering apparatus shown in fig. 1 is only an example, and the big data-based data filtering apparatus and the scenario described in the embodiment of the present invention are for more clearly illustrating the technical solution of the embodiment of the present invention, and do not form a limitation on the technical solution provided in the embodiment of the present invention.
The embodiment of the invention provides a data screening method based on big data, which has a flow diagram, as shown in fig. 2, the data screening method based on big data comprises the following steps:
step S201, obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining data documents corresponding to the screening conditions;
step S202, extracting the data documents by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and S203, performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated user to obtain a priority ordering result.
In a specific embodiment, the user screening information is cleaned to obtain cleaned user screening information, and the specific process is that a bayesian formula or a decision tree method is used for deleting and filling data documents in the user screening information, and the obtained data documents are transferred to a data warehouse according to a data document format in the data warehouse to obtain the cleaned user screening information.
As a preferred embodiment, the obtaining of the screening condition and the screening of the data to be screened according to the screening condition specifically include:
and screening the data to be screened according to the initial screening condition and the secondary screening condition by taking at least one of the characters, the character strings and the hypertext links as the initial screening condition and at least one of the conditions, the date, the age and the product specification information as the secondary screening condition.
In a specific embodiment, a user input text box and a screening condition box are respectively provided through a user input module and a condition screening module, a user inputs keywords, key character strings or key hypertext connections in the input text box as initial screening conditions, screening data to be screened is performed by checking the screening conditions in the screening condition box as secondary screening conditions, and data documents corresponding to the screening conditions are obtained.
As a preferred embodiment, the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:
the data documents are numbered, the interior of each data document is divided into a plurality of words, the inverted index is utilized to enable each word and the data document number to form a corresponding relation, and the data documents are extracted through retrieval to obtain user screening information.
In a specific embodiment, each data document is numbered and sequentially arranged from 0, each data document is divided into a plurality of words by using a parser, and each word and the data document number form a corresponding relation by using an inverted index, so that the acquired data documents are prevented from being excessively redundant and are convenient to observe visually.
As a preferred embodiment, the obtaining of the user screening information by retrieving and extracting the data document by using the inverted index to make each word form a corresponding relationship with the data document number specifically includes:
adopting a hash table structure to perform inverted indexing on the data documents so as to obtain the corresponding relation from words to all the numbers of the data documents containing the words;
resolving the screening condition into a plurality of words, and inquiring the serial numbers of all data documents containing the words corresponding to the screening condition according to the corresponding relation;
and taking intersection sets of all the inquired data document numbers to obtain user screening information.
The data document is extracted by a boolean search method.
As a preferred embodiment, the performing inverted indexing on the data documents by using the hash table structure to obtain the corresponding relationship from a word to all data document numbers containing the word specifically includes:
and sequentially accessing each data document, acquiring the value of each word in the data document in the hash table, and inserting the value into the data document number so as to form the corresponding relation from the word to all data document numbers containing the word.
In a specific embodiment, a schematic diagram of a boolean search method is shown in fig. 3, where boolean searches are performed on data documents containing a filtering condition a and data documents containing a filtering condition B, where the boolean searches include numbering all the data documents in advance, and the numbering may be performed in a format of category + time node + number; taking the user query as a screening node when screening conditions are selected each time to obtain data to be screened; screening the data to be screened to obtain all data document numbers which accord with screening conditions selected by a user; and taking intersection from the screened data document numbers to obtain user screening information.
As a preferred embodiment, the prioritizing the cleaned user screening information according to the pre-calibrated query condition of the user specifically includes:
and acquiring the relative frequency of the query conditions of the pre-calibrated user in the data documents in the cleaned user screening information, and performing priority ordering on the cleaned user screening information according to the relative frequency.
It should be noted that, in this embodiment, the advantage of performing priority ranking on the cleaned user screening information according to the pre-calibrated query condition of the user is that it is avoided that too many data documents in the cleaned user screening information are not favorable for visual observation, and the screening efficiency of screening the cleaned user screening information is improved.
As a preferred embodiment, the method for obtaining the relative frequency of the query condition of the pre-calibrated user appearing in the data document in the cleaned user screening information and performing the priority ranking on the cleaned user screening information according to the relative frequency specifically includes:
acquiring the relative frequency of the query condition of the pre-calibrated user in the data document in the cleaned user screening information according to the query condition and the sorting characteristic function of the pre-calibrated user, and performing priority sorting on the cleaned user screening information according to the relative frequency;
the ranking characteristic function is
Figure BDA0003180576470000091
Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information,fi(d, q) pre-scaling the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, ft(tiD) is the word tiThe relative frequency of occurrence in the data document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, NtAnd screening the total number of the data documents in the information for the cleaned user.
In a specific embodiment, the query condition of the user is pre-calibrated to q ═ W1, W2.. Ws }, the data document in the cleaned user screening information is taken as a candidate document d ═ d1, d2.. dk }, q is the screening condition, d is the screened data document, and a score is calculated for q and d: sk=score(q,dk) A 1 is to fi(d, q) is set to pre-calibrate the relative frequency, expressed as the relative frequency, of the occurrence of the ith word in the query condition q of the user in the data document d
Figure BDA0003180576470000101
Wherein f isi(d, q) is a ranking feature function, understood as the weight of the ith word in the query condition q in the candidate document d, ft(tiD) is the word tiThe relative frequency of occurrence in the candidate document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, NtScreening the total number of data documents in the information for the cleaned user, wherein denominator is a normalization factor, fiThe value of (d, q) is represented by SkThe scores in the order are that the larger the value is, the higher the priority is shown.
In another embodiment, the generating of the chart by the template generating module for the priority ranking result includes obtaining a screening condition, obtaining a screening data document, forming a mapping relationship between the screening condition and the screening data document, and calling a chart library to generate the chart through the mapping relationship.
The embodiment of the invention provides a data screening device based on big data, which has a structural block diagram, as shown in fig. 4, the data screening device based on big data comprises a data screening module 401, an information extraction module 402 and a priority module 403;
the data screening module 401 is configured to obtain a screening condition, screen the data to be screened according to the screening condition, and obtain a data document corresponding to the screening condition;
the information extraction module 402 is configured to extract the data document by using the inverted index to obtain user screening information, and clean the user screening information to obtain cleaned user screening information;
the prioritization module 403 is configured to prioritize the cleaned user screening information according to a pre-calibrated query condition of the user, so as to obtain a prioritization result.
The data screening module 401 includes a user input module, a condition screening module and an information uploading module, the user input module and the condition screening module are respectively used for providing a user input text box and a screening condition box, a user inputs a keyword, a key character string or a key hypertext connection in the input text box as an initial screening condition, the screening condition box checks the screening condition as a secondary screening condition to obtain a screening condition, and the information uploading module is used for uploading the obtained screening condition to the cloud database;
the information extraction module 402 performs information extraction operation in the cloud database, including obtaining screening condition information uploaded by the information uploading module, performing reverse indexing according to the screening condition information to extract data documents, performing boolean search on the extracted data documents to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
the prioritization module 403 includes a prioritization module, a template generation module, and an information acquisition module, where the information acquisition module is configured to acquire data documents in the cleaned user screening information, the prioritization module is configured to prioritize the data documents in the cleaned user screening information according to a pre-calibrated user query condition, and the template generation module is configured to generate a chart for a prioritization result, so as to complete data screening.
As shown in fig. 5, in the data screening method based on big data, the present invention also provides an electronic device, which may be a mobile terminal, a desktop computer, a notebook, a palm computer, a server, or other computing devices. The electronic device comprises a processor 10, a memory 20 and a display 30.
The storage 20 may in some embodiments be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 20 may also be an external storage device of the computer device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Further, the memory 20 may also include both an internal storage unit and an external storage device of the computer device. The memory 20 is used for storing application software installed in the computer device and various data, such as program codes installed in the computer device. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a big data based data filtering method program 40, and the big data based data filtering method program 40 can be executed by the processor 10, so as to implement the big data based data filtering method according to the embodiments of the present invention.
The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), microprocessor or other data Processing chip for executing program codes stored in the memory 20 or Processing data, such as executing a big data based data screening method.
The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the computer device and for displaying a visual user interface. The components 10-30 of the computer device communicate with each other via a system bus.
In one embodiment, when processor 10 executes big-data based data screening method program 40 in memory 20, the following steps are implemented:
obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
The present embodiment also provides a computer-readable storage medium, on which a big data based data filtering method program is stored, and when executed by a processor, the big data based data filtering method program implements the following steps:
obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A data screening method based on big data is characterized by comprising the following steps:
obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and performing priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
2. The big data-based data screening method according to claim 1, wherein the obtaining of the screening condition and the screening of the data to be screened according to the screening condition specifically include:
and screening the data to be screened according to the initial screening condition and the secondary screening condition by taking at least one of the characters, the character strings and the hypertext links as the initial screening condition and at least one of the conditions, the date, the age and the product specification information as the secondary screening condition.
3. The big data-based data screening method according to claim 1, wherein the extracting the data documents by using the inverted index to obtain the user screening information specifically comprises:
the data documents are numbered, the interior of each data document is divided into a plurality of words, the inverted index is utilized to enable each word and the data document number to form a corresponding relation, and the data documents are extracted through retrieval to obtain user screening information.
4. The big data-based data screening method according to claim 3, wherein the obtaining of the user screening information by retrieving and extracting the data documents by using the inverted index to make each word form a corresponding relationship with the data document number specifically comprises:
adopting a hash table structure to perform inverted indexing on the data documents so as to obtain the corresponding relation from words to all the numbers of the data documents containing the words;
resolving the screening condition into a plurality of words, and inquiring the serial numbers of all data documents containing the words corresponding to the screening condition according to the corresponding relation;
and taking intersection sets of all the inquired data document numbers to obtain user screening information.
5. The big data-based data screening method according to claim 4, wherein the data documents are inversely indexed by using the hash table structure to obtain correspondence from a word to all data document numbers containing the word, and the method specifically comprises:
and sequentially accessing each data document, acquiring the value of each word in the data document in the hash table, and inserting the value into the data document number to determine the corresponding relation from the word to all data document numbers containing the word.
6. The big-data-based data screening method according to claim 1, wherein the prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user specifically comprises:
and acquiring the relative frequency of the query conditions of the pre-calibrated user in the data documents in the cleaned user screening information, and performing priority ordering on the cleaned user screening information according to the relative frequency.
7. The big-data-based data screening method according to claim 6, wherein the method includes the steps of obtaining a relative frequency of a query condition of a pre-calibrated user appearing in a data document in the cleaned user screening information, and performing priority ranking on the cleaned user screening information according to the relative frequency, and specifically includes:
acquiring the relative frequency of the query condition of the pre-calibrated user in the data document in the cleaned user screening information according to the query condition and the sorting characteristic function of the pre-calibrated user, and performing priority sorting on the cleaned user screening information according to the relative frequency;
the ranking characteristic function is
Figure FDA0003180576460000021
Wherein q is a query condition of a pre-calibrated user, d is a data file in the cleaned user screening information, and fi(d, q) pre-scaling the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, ft(tiD) is the word tiThe relative frequency of occurrence in the data document d, V is the number of data documents selected according to the query condition of the pre-calibrated user, N is the number of training data documents selected as part of the cleaned user screening information, NtAnd screening the total number of the data documents in the information for the cleaned user.
8. A data screening device based on big data is characterized by comprising a data screening module, an information extraction module and a priority sorting module;
the data screening module is used for acquiring screening conditions, screening the data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;
the information extraction module is used for extracting the data documents by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and the priority ordering module is used for carrying out priority ordering on the cleaned user screening information according to the query conditions of the pre-calibrated users to obtain a priority ordering result.
9. An electronic device comprising a processor and a memory, wherein the memory stores a computer program, and the computer program, when executed by the processor, implements the big-data based data filtering method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the big data based data filtering method according to any one of claims 1 to 7.
CN202110845992.8A 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment Active CN113641815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110845992.8A CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110845992.8A CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Publications (2)

Publication Number Publication Date
CN113641815A true CN113641815A (en) 2021-11-12
CN113641815B CN113641815B (en) 2023-06-13

Family

ID=78418374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110845992.8A Active CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Country Status (1)

Country Link
CN (1) CN113641815B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361042A (en) * 2014-10-29 2015-02-18 中国建设银行股份有限公司 Information retrieval method and device
CN107341221A (en) * 2017-06-28 2017-11-10 百度在线网络技术(北京)有限公司 Foundation, associative search method, apparatus, equipment and the storage medium of index structure
CN108680163A (en) * 2018-04-25 2018-10-19 武汉理工大学 A kind of unmanned boat route search system and method based on topological map
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
CN110377558A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Document searching method, device, computer equipment and storage medium
CN110727663A (en) * 2019-09-09 2020-01-24 光通天下网络科技股份有限公司 Data cleaning method, device, equipment and medium
WO2020142649A1 (en) * 2019-01-04 2020-07-09 Proofpoint, Inc. System and method for scalable file filtering using wildcards
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database
CN112540986A (en) * 2020-12-07 2021-03-23 吴娟 Dynamic indexing method and system for quick combined query of big electric power data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361042A (en) * 2014-10-29 2015-02-18 中国建设银行股份有限公司 Information retrieval method and device
CN107341221A (en) * 2017-06-28 2017-11-10 百度在线网络技术(北京)有限公司 Foundation, associative search method, apparatus, equipment and the storage medium of index structure
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
CN108680163A (en) * 2018-04-25 2018-10-19 武汉理工大学 A kind of unmanned boat route search system and method based on topological map
WO2020142649A1 (en) * 2019-01-04 2020-07-09 Proofpoint, Inc. System and method for scalable file filtering using wildcards
CN110377558A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Document searching method, device, computer equipment and storage medium
CN110727663A (en) * 2019-09-09 2020-01-24 光通天下网络科技股份有限公司 Data cleaning method, device, equipment and medium
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database
CN112540986A (en) * 2020-12-07 2021-03-23 吴娟 Dynamic indexing method and system for quick combined query of big electric power data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张英杰: "基于文档层词频重排序的特征选择方法的研究与应用", 《知网》, pages 1 - 58 *

Also Published As

Publication number Publication date
CN113641815B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
EP2823410B1 (en) Entity augmentation service from latent relational data
US8082264B2 (en) Automated scheme for identifying user intent in real-time
CN109766438A (en) Biographic information extracting method, device, computer equipment and storage medium
US9946753B2 (en) Method and system for document indexing and data querying
US11100121B1 (en) Systems and methods for electronically mining intellectual property
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
US20130232157A1 (en) Systems and methods for processing unstructured numerical data
US8631097B1 (en) Methods and systems for finding a mobile and non-mobile page pair
CN111813905B (en) Corpus generation method, corpus generation device, computer equipment and storage medium
WO2009009192A2 (en) Adaptive archive data management
CN107844493B (en) File association method and system
CN110232126B (en) Hot spot mining method, server and computer readable storage medium
CN108427702B (en) Target document acquisition method and application server
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN113407785A (en) Data processing method and system based on distributed storage system
CN111400323A (en) Data retrieval method, system, device and storage medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN113220672A (en) Military and civil fusion policy information database system
CN110569419A (en) question-answering system optimization method and device, computer equipment and storage medium
CN106570196B (en) Video program searching method and device
CN114117242A (en) Data query method and device, computer equipment and storage medium
CN113641815B (en) Data screening method and device based on big data and electronic equipment
JP2012104051A (en) Document index creating device
Joglekar et al. Search engine optimization using unsupervised learning
EP2026216A1 (en) Data processing method, computer program product and data processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant