CN113505273B - Data sorting method, device, equipment and medium based on repeated data screening - Google Patents

Data sorting method, device, equipment and medium based on repeated data screening Download PDF

Info

Publication number
CN113505273B
CN113505273B CN202110566211.1A CN202110566211A CN113505273B CN 113505273 B CN113505273 B CN 113505273B CN 202110566211 A CN202110566211 A CN 202110566211A CN 113505273 B CN113505273 B CN 113505273B
Authority
CN
China
Prior art keywords
data
result sequence
classification
classification result
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110566211.1A
Other languages
Chinese (zh)
Other versions
CN113505273A (en
Inventor
李珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202110566211.1A priority Critical patent/CN113505273B/en
Publication of CN113505273A publication Critical patent/CN113505273A/en
Application granted granted Critical
Publication of CN113505273B publication Critical patent/CN113505273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the field of intelligent decision making, and discloses a data sorting method based on repeated data screening, which comprises the following steps: carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence; performing label classification on the query result sequence to obtain a first classification result sequence; carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence; screening the repeated data of the second classification result sequence, and performing exponential drop calculation on the screened repeated data to obtain a third classification result sequence; and sequencing all the resource data in the third classification result sequence according to the relevance scores corresponding to each resource data to obtain a target query result sequence. The invention also relates to a blockchain technique, and the query result sequence can be stored in a blockchain node. The invention also provides a data sorting device, equipment and medium based on repeated data screening. The invention can improve the efficiency of data sequencing.

Description

Data sorting method, device, equipment and medium based on repeated data screening
Technical Field
The present invention relates to the field of intelligent decision making, and in particular, to a method and apparatus for sorting data based on repeated data screening, an electronic device, and a readable storage medium.
Background
Currently, data sorting is very widely applied in the field of data retrieval and data recommendation. In such search and recommendation scenarios, the searched or recommended data is generally scored for relevance, and all the data is displayed in a descending order of score from high to low.
However, since the retrieved or recommended data is usually very abundant and even repeated, the current data sorting method has the problem of stacking the same or similar data together for display, and the similar content stacking covers a large amount of display space, so that the effective information is difficult to obtain, and the data sorting efficiency is low.
Disclosure of Invention
The invention provides a data sorting method, a device, electronic equipment and a computer readable storage medium based on repeated data screening, and mainly aims to improve the efficiency of data sorting.
In order to achieve the above object, the present invention provides a data sorting method based on repeated data screening, including:
Carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence;
performing label classification on the query result sequence to obtain a first classification result sequence;
carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
screening the repeated data of the second classification result sequence, and performing exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
sorting all the resource data in the third classification result sequence according to the relevance scores corresponding to each resource data to obtain a target query result sequence;
and sending the target query result sequence to the terminal equipment corresponding to the query request.
Optionally, the performing relevance screening and sorting on the preset resource data set according to the query request to obtain a query result sequence, including:
extracting a query field in the query request, and converting the query field into a vector to obtain a query vector;
converting each resource data in the resource data set into a vector to obtain a corresponding resource vector;
calculating the relevance of the query vector and the resource vector to obtain a corresponding relevance score;
Screening the resource data with the relevance score larger than a preset relevance in the resource data set to obtain the initial query result sequence;
and sequencing all the resource data in the initial query result sequence according to the corresponding relevance score to obtain the query result sequence.
Optionally, the performing relevance classification on the first classification result sequence to obtain a second classification result sequence includes:
constructing a score interval according to the query result sequence;
and classifying the first classification result sequence by using the score interval to obtain the second classification result sequence.
Optionally, the constructing a score interval according to the query result sequence includes:
screening the maximum correlation score of the query result sequence to obtain first interval data;
screening the minimum correlation score of the query result sequence to obtain second interval data;
average calculation is carried out on the first interval data and the second interval data to obtain third interval data;
and constructing two continuous intervals by taking the first interval data, the second interval data and the third interval data as interval endpoint values to obtain the score interval.
Optionally, the screening the repeated data of the second classification result sequence, and performing exponential downgrading calculation on the screened repeated data to obtain a third classification result sequence, including:
coding each resource data in the second classification result sequence by using a preset algorithm to obtain a corresponding data code;
calculating the text distance of any two data codes in all data codes corresponding to the second classification result sequence;
determining the text distance smaller than a preset threshold value as a similar text distance;
performing association classification on the resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list;
and carrying out index descending calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence.
Optionally, the performing association classification on the resource data corresponding to all the similar text distances in the second classification result sequence to obtain a repeated data list includes:
taking resource data corresponding to all similar text distances in the second classification result sequence as nodes to carry out tree classification to obtain a classification tree;
And sequencing all the resource data corresponding to the classification tree according to the relevance score corresponding to each resource data to obtain the repeated data list.
Optionally, the performing exponential-down calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence includes:
performing index calculation on the preset ordering positions and the relevance scores corresponding to all the resource data in the repeated data list corresponding to the second classification result sequence to obtain corresponding updated relevance scores;
and replacing the corresponding relevance score by the updated relevance score to obtain the third classification result sequence.
In order to solve the above problems, the present invention further provides a data sorting device based on repeated data screening, the device comprising:
the data classification module is used for carrying out correlation screening and sorting on a preset resource data set according to the received query request to obtain a query result sequence; performing label classification on the query result sequence to obtain a first classification result sequence; carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
The data screening module is used for carrying out repeated data screening on the second classification result sequence, and carrying out exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
the data sorting module is used for sorting all the resource data in the third classification result sequence according to the relevance score corresponding to each resource data to obtain a target query result sequence; and sending the target query result sequence to the terminal equipment corresponding to the query request.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one computer program; a kind of electronic device with high-pressure air-conditioning system
And the processor executes the computer program stored in the memory to realize the data sorting method based on repeated data screening.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the above-mentioned data sorting method based on repeated data screening.
According to the embodiment of the invention, the correlation screening and sorting are carried out on the preset resource data set according to the received query request, so that a query result sequence is obtained; performing label classification on the query result sequence to obtain a first classification result sequence, classifying data of different labels, and avoiding similar data from being displayed in a bundle; carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence, classifying the data of each type of label according to high relevance and ground relevance, and screening the data based on repeated data to obtain more balanced data ordering; screening the repeated data of the second classification result sequence, and performing exponential descending calculation on the screened repeated data to obtain a third classification result sequence; sorting all the resource data in the third classification result sequence according to the corresponding relevance score of each resource data to obtain a target query result sequence, and reducing the relevance score of the repeated data to avoid bundling similar data, so that the sorted data are more various in display, and the efficiency of sorting the data based on repeated data screening is improved; and sending the target query result sequence to the terminal equipment corresponding to the query request. Therefore, the data sorting method, the device, the electronic equipment and the readable storage medium based on the repeated data screening improve the efficiency of data sorting based on the repeated data screening.
Drawings
Fig. 1 is a flow chart of a data sorting method based on repeated data screening according to an embodiment of the present application;
FIG. 2 is a schematic block diagram of a data sorting apparatus based on repeated data screening according to an embodiment of the present application;
fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a data sorting method based on repeated data screening according to an embodiment of the present application;
the achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application provides a data sorting method based on repeated data screening. The execution subject of the data sorting method based on repeated data screening includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the data sorting method based on repeated data filtering may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, a flowchart of a data sorting method based on repeated data screening according to an embodiment of the present invention is shown, where in the embodiment of the present invention, the data sorting method based on repeated data screening includes:
s1, carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence;
in an embodiment of the present invention, the query request includes: and querying a field, wherein the resource data set is a set containing different resource data, and the resource data can be consultation data, product data, activity data and the like.
In detail, in the embodiment of the invention, a query field in the query request is extracted, and the query field is converted into a vector to obtain a query vector; converting each resource data in the resource data set into a vector to obtain a corresponding resource vector; calculating the relevance of the query vector and the resource vector to obtain a corresponding relevance score; data screening is carried out on the resource data set according to the relevance score, and an initial query result sequence is obtained; and sequencing all the resource data in the initial query result sequence according to the corresponding relevance score to obtain the query result sequence.
Optionally, in the embodiment of the present invention, a preset Word2vec model that is trained by migration learning based on a professional domain knowledge text (such as a teaching material and a training material) may be used to perform vector conversion. Further, in the embodiment of the invention, resource data with the relevance score larger than a preset relevance value in the resource data set is screened to obtain the initial query result sequence.
Alternatively, the embodiment of the present invention may calculate the correlation with the following formula:
wherein X is i The ith element, Y, representing query vector X i For the ith element of resource vector Y, n represents Sim and represents the relevance score of query vector X and resource vector Y.
In another embodiment of the present invention, the query result sequence may be stored in a blockchain node, and the high throughput characteristic of the blockchain is utilized to improve the data access efficiency.
S2, carrying out label classification on the query result sequence to obtain a first classification result sequence;
in the embodiment of the invention, the first classification result sequence contains resource data with different attribute categories, and in order to better order the resource data in the query result sequence, the query result sequence is subjected to label classification to obtain the first classification result sequence.
In detail, in the implementation of the present invention, the label classification of the query result sequence includes: labeling each resource data in the query result sequence by using a preset label classification model to obtain a label query result sequence; and classifying all the resource data in the tag query result sequence according to different tags to obtain the corresponding first classification result sequence.
In the embodiment of the invention, because the labels corresponding to different resource data in the query result sequence are different, each type of label corresponds to a first classification result sequence, the query result sequence is subjected to label classification, and a plurality of obtained first classification result sequences are obtained, for example: all the resource data in the query result sequence can be marked by using 4 types of labels, and then each type of label corresponds to one first classification result sequence, so that 4 first classification result sequences can be obtained in total.
In the embodiment of the invention, the label classification model may be a deep learning model constructed by a Bert network.
In detail, before the labeling of each resource data in the query result sequence by using the preset label classification model in the embodiment of the present invention, the method further includes: acquiring a historical resource data set; the method comprises the steps of marking a preset label on the historical resource data set to obtain a training set, wherein in one application scene, the preset label can comprise: information class labels, credit card product class labels, insurance product class labels, mall commodity class labels, preferential activity class labels, and the like; and performing iterative training on the pre-constructed deep learning model by using the training set to obtain the label classification model.
Optionally, in another embodiment of the present invention, each resource data in the first classification result sequence includes a corresponding tag, and a preset tag classification model is not required to be used to label each resource data in the query result sequence, and the resource data corresponding to the same resource category tag in the first classification result sequence is summarized to obtain a corresponding first classification result sequence.
S3, carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
in the embodiment of the present invention, as known from the above S1, each resource data in the first classification result sequence has a relevance score, so as to ensure that all the resource data in the first classification result sequence are ordered more uniformly, and prevent the resource data with a high relevance score from interfering with the resource data with a low relevance score, which affects the accuracy of data ordering.
In detail, in the embodiment of the present invention, performing relevance classification on the first classification result sequence includes: and constructing a score interval according to the query result sequence, and classifying the first classification result sequence by using the score interval to obtain the second classification result sequence.
Further, in the embodiment of the present invention, a score interval is constructed according to the query result sequence; comprising the following steps: screening the maximum correlation score and the minimum correlation score of the query result sequence, and carrying out average calculation on the maximum correlation score and the minimum correlation score to obtain an average correlation score; and constructing two continuous intervals by taking the maximum correlation score, the minimum correlation score and the average correlation score as interval endpoint values to obtain the score interval. For example: in another embodiment of the present invention, the method further includes sorting all resource data in the first classification result sequence according to a corresponding relevance score size to obtain a first classification result sequence, sorting the second classification result sequence according to a preset classification result sequence to obtain a first class of resource, for example, sorting the first classification result sequence according to a preset classification result sequence to obtain a first class of resource, wherein the number of the second classification result sequence corresponding to the first classification result sequence is determined by the number of intervals contained in the score interval, for example, the branch interval contains two intervals, and then the number of the second classification result sequence corresponding to the first classification result sequence is 2.
S4, screening the repeated data of the second classification result sequence, and performing exponential downgrading calculation on the screened repeated data to obtain a third classification result sequence;
in the embodiment of the invention, in order to prevent repeated or similar resource data from being displayed in a binding manner, the data display type is narrow, so that repeated data screening is performed on the second classification result sequence, and index descending calculation is performed on the repeated data in the screened second classification result sequence to obtain a third classification result sequence, wherein the repeated data comprise the same or similar data.
In detail, in the embodiment of the present invention, repeated data screening is performed on the second classification result sequence, and index reduction calculation is performed on the screened repeated data to obtain a third classification result sequence, which includes: coding each resource data in the second classification result sequence by using a preset algorithm to obtain a data code corresponding to each resource data; calculating the text distance of any two data codes in all data codes corresponding to the second classification result sequence; determining the text distance smaller than a preset threshold value as a similar text distance; performing association classification on the resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list; and carrying out index descending calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence.
Optionally, in the embodiment of the present invention, the preset algorithm is a simhash algorithm,
in detail, in the embodiment of the present invention, performing association classification on resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list, where the method includes: taking resource data corresponding to all similar text distances in the second classification result sequence as nodes to carry out tree classification to obtain a classification tree; and ordering all the resource data corresponding to the classification tree according to the relevance score corresponding to each resource data to obtain a corresponding repeated data list. For example: the text distance of A and B is a similar distance; the text distance of A and C is the similar distance; and if the text distance between B and E is a similar distance, constructing a corresponding classification tree by taking A as a node of a first layer of the classification number, B, C as a node of a second layer of the classification tree and E as a node of a third layer of the classification tree.
Further, in order to avoid bundling of repeated data, in the embodiment of the present invention, performing exponential-drop calculation on the resource data in the repeated data list includes: and carrying out index calculation on the preset ordering positions and the relevance scores corresponding to all the resource data in the repeated data list corresponding to the second classification result sequence to obtain the corresponding updated relevance scores.
Further, the embodiment of the invention replaces the corresponding relevance score by the updated relevance score to obtain the third classification result sequence.
Optionally, the preset ordering position is the second one.
Optionally, in the embodiment of the present invention, the following formula is used to perform the index calculation:
N=a lgi *C i
wherein a is a preset ordering parameter, preferably a is 0.5, C i The relevance score corresponding to the ith resource data in the repeated data list is i is the sorting number of the resource data in the repeated data list, and N is theRepeating the relevance score of the ith resource data in the data list after updating.
Optionally, in the embodiment of the present invention, the text distance of the two data codes is a hamming distance of the corresponding two data codes.
S5, sorting all the resource data in the third classification result sequence according to the relevance score corresponding to each resource data to obtain a target query result sequence;
in the embodiment of the present invention, as can be seen from the above description, the number of the second classification result sequences is plural, so that the number of the third classification result sequences is plural, and further, in the embodiment of the present invention, all the resource data in the third classification result sequences are ordered according to the relevance score corresponding to each resource data, so as to obtain a target query result sequence, for example: the method comprises the steps that two third classification result sequences are shared, wherein one third classification result sequence comprises resource data A and resource data B, the relevance score of A is 10, the relevance score of B is 8, the other third classification result sequence comprises resource data C and resource data D, the relevance score of C is 9, and the relevance score of C is 7, all the resource data in the third classification result sequence are A, B, C, D, and the target query result sequence obtained by sequencing all the resource data in the third classification result sequence according to the relevance score is [ A, C, B and D ].
S6, the target query result sequence is sent to the terminal equipment corresponding to the query request.
In detail, in the embodiment of the present invention, the target query result sequence is sent to a terminal device corresponding to the query request, where the terminal device includes: intelligent terminals such as computers, tablets, cell phones, etc., for example: and the user initiates a query request on the mobile phone A, and then the target query result sequence is sent to the mobile phone A, so that the user can check conveniently.
FIG. 2 is a functional block diagram of a data sorting apparatus according to the present invention based on repeated data screening.
The data sorting device 100 based on repeated data screening according to the present invention may be installed in an electronic device. Depending on the functions implemented, the data sorting apparatus based on repeated data screening may include a data sorting module 101, a data screening module 102, and a data sorting module 103, which may also be referred to as a unit, refers to a series of computer program segments capable of being executed by a processor of an electronic device and of performing a fixed function, which are stored in a memory of the electronic device.
In the present embodiment, the functions concerning the respective modules/units are as follows:
The data classification module 101 is configured to perform relevance screening and sorting on a preset resource data set according to a received query request, so as to obtain a query result sequence; performing label classification on the query result sequence to obtain a first classification result sequence; carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
in an embodiment of the present invention, the query request includes: and querying a field, wherein the resource data set is a set containing different resource data, and the resource data can be consultation data, product data, activity data and the like.
In detail, in the embodiment of the present invention, the data classification module 101 extracts a query field in the query request, and converts the query field into a vector to obtain a query vector; converting each resource data in the resource data set into a vector to obtain a corresponding resource vector; calculating the relevance of the query vector and the resource vector to obtain a corresponding relevance score; data screening is carried out on the resource data set according to the relevance score, and an initial query result sequence is obtained; and sequencing all the resource data in the initial query result sequence according to the corresponding relevance score to obtain the query result sequence.
Optionally, in the embodiment of the present invention, the data classification module 101 may perform vector conversion by using a Word2vec model that is formed by training based on a preset text (e.g. teaching materials and training materials) of professional domain knowledge through transfer learning. Further, in the embodiment of the present invention, the data classification module 101 screens the resource data in the resource data set, where the relevance score is greater than a preset relevance value, to obtain the initial query result sequence.
Alternatively, the data classification module 101 according to the embodiment of the present invention may calculate the correlation with the following formula:
wherein X is i The ith element, Y, representing query vector X i For the ith element of resource vector Y, n represents Sim and represents the relevance score of query vector X and resource vector Y.
In the embodiment of the present invention, the first classification result sequence includes resource data with different attribute categories, and in order to better order the resource data in the query result sequence, the data classification module 101 performs label classification on the query result sequence to obtain the first classification result sequence.
In another embodiment of the present invention, the query result sequence may be stored in a blockchain node, and the high throughput characteristic of the blockchain is utilized to improve the data access efficiency.
In detail, in the implementation of the present invention, the data classification module 101 performs tag classification on the query result sequence, including: labeling each resource data in the query result sequence by using a preset label classification model to obtain a label query result sequence; and classifying all the resource data in the tag query result sequence according to different tags to obtain the corresponding first classification result sequence.
In the embodiment of the invention, because the labels corresponding to different resource data in the query result sequence are different, each type of label corresponds to a first classification result sequence, the query result sequence is subjected to label classification, and a plurality of obtained first classification result sequences are obtained, for example: all the resource data in the query result sequence can be marked by using 4 types of labels, and then each type of label corresponds to one first classification result sequence, so that 4 first classification result sequences can be obtained in total.
In the embodiment of the invention, the label classification model may be a deep learning model constructed by a Bert network.
In detail, before the data classification module 101 in the embodiment of the present invention labels each resource data in the query result sequence by using a preset label classification model, the method further includes: acquiring a historical resource data set; the method comprises the steps of marking a preset label on the historical resource data set to obtain a training set, wherein in one application scene, the preset label can comprise: information class labels, credit card product class labels, insurance product class labels, mall commodity class labels, preferential activity class labels, and the like; and performing iterative training on the pre-constructed deep learning model by using the training set to obtain the label classification model.
Optionally, in another embodiment of the present invention, each resource data in the first classification result sequence includes a corresponding tag, and a preset tag classification model is not required to label each resource data in the query result sequence, and the data classification module 101 gathers the resource data corresponding to the same resource category tag in the first classification result sequence to obtain a corresponding first classification result sequence.
In the embodiment of the present invention, each resource data in the first classification result sequence has a relevance score, so as to ensure that all the resource data in the first classification result sequence are ordered more uniformly, and prevent the resource data with a subsequent high relevance score from interfering with the resource data with a low relevance score, and affecting the accuracy of data ordering.
In detail, in the embodiment of the present invention, the data classification module 101 performs relevance classification on the first classification result sequence, including: and constructing a score interval according to the query result sequence, and classifying the first classification result sequence by using the score interval to obtain the second classification result sequence.
Further, in the embodiment of the present invention, the data classification module 101 constructs a score interval according to the query result sequence; comprising the following steps: screening the maximum correlation score and the minimum correlation score of the query result sequence, and carrying out average calculation on the maximum correlation score and the minimum correlation score to obtain an average correlation score; and constructing two continuous intervals by taking the maximum correlation score, the minimum correlation score and the average correlation score as interval endpoint values to obtain the score interval. For example: in another embodiment of the present invention, the number of the second classification result sequences corresponding to each of the first classification result sequences is 2, and the number of the second classification result sequences corresponding to each of the first classification result sequences is determined by the number of the intervals included in the score interval, for example, the branch interval includes two intervals, and the number of the second classification result sequences corresponding to each of the first classification result sequences is 2.
In another embodiment of the present invention, the data classification module 101 performs relevance classification on the first classification result sequence, including: sequencing all the resource data in the first classification result sequence according to the corresponding correlation score to obtain a standard first classification result sequence; classifying the data in the standard first classification result sequence according to a preset sorting percentage to obtain the second classification result sequence. Such as: and if the preset sorting percentage is 50%, 10 resource data are shared by the standard first classification result sequences, sorting the first 50% of the resource data in the standard first classification result sequences into one class, and sorting the rest of the resource data into one class to obtain the corresponding second classification result sequences.
The data screening module 102 is configured to perform repeated data screening on the second classification result sequence, and perform exponential downgrading calculation on the screened repeated data to obtain a third classification result sequence;
in the embodiment of the present invention, in order to prevent repeated or similar resource data from being displayed in a bundled manner, the data filtering module 102 filters repeated data in the second classification result sequence, and performs exponential-down calculation on the repeated data in the screened second classification result sequence to obtain a third classification result sequence, where the repeated data includes the same or similar data.
In detail, in the embodiment of the present invention, the data screening module 102 performs repeated data screening on the second classification result sequence, and performs exponential downgrading calculation on the screened repeated data to obtain a third classification result sequence, where the method includes: coding each resource data in the second classification result sequence by using a preset algorithm to obtain a data code corresponding to each resource data; calculating the text distance of any two data codes in all data codes corresponding to the second classification result sequence; determining the text distance smaller than a preset threshold value as a similar text distance; performing association classification on the resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list; and carrying out index descending calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence.
Optionally, in the embodiment of the present invention, the preset algorithm is a simhash algorithm,
in detail, in the embodiment of the present invention, the data filtering module 102 performs association classification on resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list, which includes: taking resource data corresponding to all similar text distances in the second classification result sequence as nodes to carry out tree classification to obtain a classification tree; and ordering all the resource data corresponding to the classification tree according to the relevance score corresponding to each resource data to obtain a corresponding repeated data list. For example: the text distance of A and B is a similar distance; the text distance of A and C is the similar distance; and if the text distance between B and E is a similar distance, constructing a corresponding classification tree by taking A as a node of a first layer of the classification number, B, C as a node of a second layer of the classification tree and E as a node of a third layer of the classification tree.
Further, in order to avoid bundling the repeated data, in an embodiment of the present invention, the data filtering module 102 performs an exponential drop calculation on the resource data in the repeated data list, including: and carrying out index calculation on the preset ordering positions and the relevance scores corresponding to all the resource data in the repeated data list corresponding to the second classification result sequence to obtain the corresponding updated relevance scores.
Further, in the embodiment of the present invention, the data filtering module 102 replaces the corresponding relevance score with the updated relevance score to obtain the third classification result sequence.
Optionally, the preset ordering position is the second one.
Optionally, in the embodiment of the present invention, the data filtering module 102 performs the index calculation according to the following formula:
N=a lgi *C i
wherein a is a preset ordering parameter, preferably a is 0.5, C i And (3) the relevance score corresponding to the ith resource data in the repeated data list, wherein i is the sequencing number of the resource data in the repeated data list, and N is the updated relevance score of the ith resource data in the repeated data list.
Optionally, in the embodiment of the present invention, the text distance of the two data codes is a hamming distance of the corresponding two data codes.
The data sorting module 103 is configured to sort all the resource data in the third classification result sequence according to the relevance score corresponding to each resource data, so as to obtain a target query result sequence; and sending the target query result sequence to the terminal equipment corresponding to the query request.
In the embodiment of the present invention, as can be seen from the above description, the number of the second classification result sequences is plural, and therefore, the number of the third classification result sequences is plural, and further, in the embodiment of the present invention, the data sorting module 103 sorts all the resource data in the third classification result sequences according to the relevance score corresponding to each resource data, so as to obtain a target query result sequence, for example: the method comprises the steps that two third classification result sequences are shared, wherein one third classification result sequence comprises resource data A and resource data B, the relevance score of A is 10, the relevance score of B is 8, the other third classification result sequence comprises resource data C and resource data D, the relevance score of C is 9, and the relevance score of C is 7, all the resource data in the third classification result sequence are A, B, C, D, and the target query result sequence obtained by sequencing all the resource data in the third classification result sequence according to the relevance score is [ A, C, B and D ].
In detail, in the embodiment of the present invention, the target query result sequence is sent to a terminal device corresponding to the query request, where the terminal device includes: intelligent terminals such as computers, tablets, cell phones, etc., for example: and the user initiates a query request on the mobile phone A, and then the target query result sequence is sent to the mobile phone A, so that the user can check conveniently.
Fig. 3 is a schematic structural diagram of an electronic device implementing a data sorting method based on repeated data screening according to the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program stored in the memory 11 and executable on the processor 10, such as a data sorting program based on repeated data screening.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only for storing application software installed in an electronic device and various types of data, such as codes of a data sorting program based on repeated data screening, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, executes programs or modules (e.g., a data sorting program based on repeated data screening, etc.) stored in the memory 11 by running or executing the programs or modules, and invokes data stored in the memory 11 to perform various functions of the electronic device and process the data.
The communication bus 12 may be a peripheral component interconnect standard (perIPheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The communication bus 12 is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 is not limiting of the electronic device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
Optionally, the communication interface 13 may comprise a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the communication interface 13 may further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The data sorting program based on repeated data screening stored by the memory 11 in the electronic device is a combination of a plurality of computer programs, which when run in the processor 10, can realize:
carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence;
performing label classification on the query result sequence to obtain a first classification result sequence;
Carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
screening the repeated data of the second classification result sequence, and performing exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
sorting all the resource data in the third classification result sequence according to the relevance scores corresponding to each resource data to obtain a target query result sequence;
and sending the target query result sequence to the terminal equipment corresponding to the query request.
In particular, the specific implementation method of the processor 10 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
Embodiments of the present invention may also provide a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, may implement:
carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence;
performing label classification on the query result sequence to obtain a first classification result sequence;
carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
screening the repeated data of the second classification result sequence, and performing exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
sorting all the resource data in the third classification result sequence according to the relevance scores corresponding to each resource data to obtain a target query result sequence;
and sending the target query result sequence to the terminal equipment corresponding to the query request.
Further, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (7)

1. A method for sorting data based on repeated data screening, the method comprising:
carrying out correlation screening and sorting on a preset resource data set according to a received query request to obtain a query result sequence;
performing label classification on the query result sequence to obtain a first classification result sequence;
carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
screening the repeated data of the second classification result sequence, and performing exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
sorting all the resource data in the third classification result sequence according to the relevance scores corresponding to each resource data to obtain a target query result sequence;
the target query result sequence is sent to terminal equipment corresponding to the query request;
The step of performing relevance screening and sorting on the preset resource data set according to the query request to obtain a query result sequence comprises the following steps: extracting a query field in the query request, and converting the query field into a vector to obtain a query vector; converting each resource data in the resource data set into a vector to obtain a corresponding resource vector; calculating the relevance of the query vector and the resource vector to obtain a corresponding relevance score; screening the resource data with the relevance score larger than the preset relevance in the resource data set to obtain an initial query result sequence; sequencing all resource data in the initial query result sequence according to the corresponding relevance score to obtain the query result sequence;
the step of classifying the correlation degree of the first classification result sequence to obtain a second classification result sequence comprises the following steps: constructing a score interval according to the query result sequence; classifying the first classification result sequence by using the score interval to obtain the second classification result sequence;
the constructing a score interval according to the query result sequence comprises the following steps: screening the maximum correlation score of the query result sequence to obtain first interval data; screening the minimum correlation score of the query result sequence to obtain second interval data; average calculation is carried out on the first interval data and the second interval data to obtain third interval data; and constructing two continuous intervals by taking the first interval data, the second interval data and the third interval data as interval endpoint values to obtain the score interval.
2. The method for sorting data based on repeated data screening according to claim 1, wherein the step of performing repeated data screening on the second classification result sequence and performing exponential downgrading calculation on the screened repeated data to obtain a third classification result sequence comprises:
coding each resource data in the second classification result sequence by using a preset algorithm to obtain a corresponding data code;
calculating the text distance of any two data codes in all data codes corresponding to the second classification result sequence;
determining the text distance smaller than a preset threshold value as a similar text distance;
performing association classification on the resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list;
and carrying out index descending calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence.
3. The method for sorting data based on repeated data filtering according to claim 2, wherein said performing association classification on resource data corresponding to all similar text distances in the second classification result sequence to obtain a repeated data list includes:
Taking resource data corresponding to all similar text distances in the second classification result sequence as nodes to carry out tree classification to obtain a classification tree;
and sequencing all the resource data corresponding to the classification tree according to the relevance score corresponding to each resource data to obtain the repeated data list.
4. The method for sorting data based on repeated data filtering according to claim 2, wherein the performing an exponential downgrading calculation on the resource data in the repeated data list corresponding to the second classification result sequence to obtain the third classification result sequence includes:
performing index calculation on the preset ordering positions and the relevance scores corresponding to all the resource data in the repeated data list corresponding to the second classification result sequence to obtain corresponding updated relevance scores;
and replacing the corresponding relevance score by the updated relevance score to obtain the third classification result sequence.
5. A data sorting apparatus based on repeated data screening for implementing the repeated data sorting method based on repeated data screening according to any one of claims 1 to 4, comprising:
The data classification module is used for carrying out correlation screening and sorting on a preset resource data set according to the received query request to obtain a query result sequence; performing label classification on the query result sequence to obtain a first classification result sequence; carrying out relevance classification on the first classification result sequence to obtain a second classification result sequence;
the data screening module is used for carrying out repeated data screening on the second classification result sequence, and carrying out exponential descending calculation on the screened repeated data to obtain a third classification result sequence;
the data sorting module is used for sorting all the resource data in the third classification result sequence according to the relevance score corresponding to each resource data to obtain a target query result sequence; and sending the target query result sequence to the terminal equipment corresponding to the query request.
6. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the repeated data screening-based data ordering method of any one of claims 1 to 4.
7. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the repeated data screening based data sorting method according to any one of claims 1 to 4.
CN202110566211.1A 2021-05-24 2021-05-24 Data sorting method, device, equipment and medium based on repeated data screening Active CN113505273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110566211.1A CN113505273B (en) 2021-05-24 2021-05-24 Data sorting method, device, equipment and medium based on repeated data screening

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110566211.1A CN113505273B (en) 2021-05-24 2021-05-24 Data sorting method, device, equipment and medium based on repeated data screening

Publications (2)

Publication Number Publication Date
CN113505273A CN113505273A (en) 2021-10-15
CN113505273B true CN113505273B (en) 2023-08-22

Family

ID=78008663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110566211.1A Active CN113505273B (en) 2021-05-24 2021-05-24 Data sorting method, device, equipment and medium based on repeated data screening

Country Status (1)

Country Link
CN (1) CN113505273B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114445207B (en) * 2022-04-11 2022-07-26 广东企数标普科技有限公司 Tax administration system based on digital RMB
CN114943021B (en) 2022-07-20 2022-11-08 之江实验室 TB-level incremental data screening method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477554A (en) * 2009-01-16 2009-07-08 西安电子科技大学 User interest based personalized meta search engine and search result processing method
CN109207606A (en) * 2018-09-26 2019-01-15 西南民族大学 The screening technique in the site SSR for paternity identification and application
CN109598307A (en) * 2018-12-06 2019-04-09 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN110046298A (en) * 2019-04-24 2019-07-23 中国人民解放军国防科技大学 Query word recommendation method and device, terminal device and computer readable medium
CN110378560A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Arbitrator's data screening method, apparatus, computer equipment and storage medium
CN111008321A (en) * 2019-11-18 2020-04-14 广东技术师范大学 Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN111859057A (en) * 2020-09-22 2020-10-30 上海冰鉴信息科技有限公司 Data feature processing method and data feature processing device
CN112328657A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Feature derivation method, feature derivation device, computer equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824855B (en) * 2015-01-09 2019-12-13 阿里巴巴集团控股有限公司 Method and device for screening and classifying data objects and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477554A (en) * 2009-01-16 2009-07-08 西安电子科技大学 User interest based personalized meta search engine and search result processing method
CN109207606A (en) * 2018-09-26 2019-01-15 西南民族大学 The screening technique in the site SSR for paternity identification and application
CN109598307A (en) * 2018-12-06 2019-04-09 北京达佳互联信息技术有限公司 Data screening method, apparatus, server and storage medium
CN110046298A (en) * 2019-04-24 2019-07-23 中国人民解放军国防科技大学 Query word recommendation method and device, terminal device and computer readable medium
CN110378560A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Arbitrator's data screening method, apparatus, computer equipment and storage medium
CN111008321A (en) * 2019-11-18 2020-04-14 广东技术师范大学 Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN111859057A (en) * 2020-09-22 2020-10-30 上海冰鉴信息科技有限公司 Data feature processing method and data feature processing device
CN112328657A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Feature derivation method, feature derivation device, computer equipment and medium

Also Published As

Publication number Publication date
CN113505273A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN112528616B (en) Service form generation method and device, electronic equipment and computer storage medium
CN113505273B (en) Data sorting method, device, equipment and medium based on repeated data screening
CN112860905A (en) Text information extraction method, device and equipment and readable storage medium
CN114781832A (en) Course recommendation method and device, electronic equipment and storage medium
CN114491047A (en) Multi-label text classification method and device, electronic equipment and storage medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN113658002B (en) Transaction result generation method and device based on decision tree, electronic equipment and medium
CN113868528A (en) Information recommendation method and device, electronic equipment and readable storage medium
CN111930897B (en) Patent retrieval method, device, electronic equipment and computer-readable storage medium
CN113268665A (en) Information recommendation method, device and equipment based on random forest and storage medium
CN112633988A (en) User product recommendation method and device, electronic equipment and readable storage medium
CN116339882B (en) Office system collaborative display method, device, equipment and medium based on Internet of things
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
CN113626605B (en) Information classification method, device, electronic equipment and readable storage medium
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN113435308B (en) Text multi-label classification method, device, equipment and storage medium
CN113656690B (en) Product recommendation method and device, electronic equipment and readable storage medium
CN113343102A (en) Data recommendation method and device based on feature screening, electronic equipment and medium
CN113705201A (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN113064984A (en) Intention recognition method and device, electronic equipment and readable storage medium
CN113268614A (en) Label system updating method and device, electronic equipment and readable storage medium
CN113344674A (en) Product recommendation method, device, equipment and storage medium based on user purchasing power
CN113592606B (en) Product recommendation method, device, equipment and storage medium based on multiple decisions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant