CN112948370A - Data classification method and device and computer equipment - Google Patents
Data classification method and device and computer equipment Download PDFInfo
- Publication number
- CN112948370A CN112948370A CN201911175983.1A CN201911175983A CN112948370A CN 112948370 A CN112948370 A CN 112948370A CN 201911175983 A CN201911175983 A CN 201911175983A CN 112948370 A CN112948370 A CN 112948370A
- Authority
- CN
- China
- Prior art keywords
- data
- characteristic value
- values
- characteristic
- classified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data classification method, which comprises the following steps: acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. The invention also provides a data classification device, computer equipment and a computer readable storage medium. The invention can compare the simple characteristic value of the data to be classified with the characteristic value data table, thereby greatly reducing the data processing amount, shortening the time and improving the efficiency.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data classification method and apparatus, a computer device, and a computer-readable storage medium.
Background
In the prior art, for the data classification problem, generally, a data class is created according to existing data, and then data to be classified is compared with all data in each data class one by one, so as to compare whether the data to be classified belongs to the data class. However, the classification method by enumerating each of the existing classes of data requires a huge amount of computation, which consumes many computer processing resources and takes a long time and a low efficiency.
Disclosure of Invention
In view of this, the present invention provides a data classification method, an apparatus, a computer device, and a computer-readable storage medium, which can solve the problems that a large amount of computer processing resources are required to be consumed and time is consumed in the data classification process.
First, to achieve the above object, the present invention provides a data classification method, including:
acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table, and the first characteristic value data table is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter.
In one example, the characteristic value database further includes at least a second characteristic value data table, wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
In one example, the method further comprises: when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified does not belong to the existing class data; and returning a warning of classification failure.
In addition, to achieve the above object, the present invention also provides a data sorting apparatus, comprising:
the acquisition module is used for acquiring data to be classified; the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; a comparison module, configured to compare the M feature values with each feature value data table in a feature value database in sequence, where the feature value database at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated by the feature value calculation rule; and the classification module is used for classifying the data to be classified into first class data corresponding to the first characteristic value data table when the M characteristic values are included in the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter manner, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
Further, the present invention also proposes a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and the computer program implements the steps of the data classification method as described above when being executed by the processor.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method as described above.
Compared with the prior art, the data classification method, the data classification device, the computer equipment and the computer readable storage medium provided by the invention can be used for calculating M characteristic values of the data to be classified according to the characteristic value calculation rule after the data to be classified is acquired; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of an application environment of an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data classification method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a specific embodiment of the process of comparing the M eigenvalues with each of the eigenvalue data tables in the eigenvalue database in turn in step S204 of FIG. 2;
FIG. 4 is a schematic illustration of the effect of the step shown in FIG. 3;
FIG. 5 is a diagram of an alternative hardware architecture for the computer device of the present invention;
FIG. 6 is a block diagram of a data sorting apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an application environment according to an embodiment of the present invention. Referring to fig. 1, the computer device 1 is connected to a user terminal and a data server, receives data to be classified sent by the user terminal, and classifies the data to be classified according to a characteristic value database stored in the data server. In the present embodiment, the computer device 1 can be used as a terminal device such as a server, a mobile phone, a user portable device, a PC, and the like. In other embodiments, the computer device 1 may also be a stand-alone functional module, and then attached to a data server or a user terminal to implement the function of data classification. Of course, in this embodiment, the characteristic value database is disposed on the data server, and in other embodiments, the characteristic value database may also be disposed on the computer device 1, which is not limited herein.
FIG. 2 is a flowchart illustrating a data classification method according to an embodiment of the present invention. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by taking a computer device as an execution subject.
As shown in fig. 2, the data classification method may include steps S200 to S206, in which:
step S200, acquiring data to be classified.
Specifically, after the computer device 1 is connected to a user terminal, when a user has data to be classified, the data to be classified is sent to the computer device 1 through the user terminal, and then the computer device 1 receives the data to be classified. Of course, in other embodiments, the computer device 1 may also provide an interactive interface, then receive a classification request of a user for the data to be classified stored on the computer device 1 through the interactive interface, and then obtain the data to be classified from the storage unit of the computer device 1 itself.
Step S202, calculating M characteristic values of the data to be classified according to a characteristic value calculation rule.
Specifically, after the computer device 1 acquires the data to be classified, M feature values of the data to be classified are calculated according to a preset feature value calculation rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device 1 calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set in a differentiation manner according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the video duration; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In summary, for different data classifications, the computer device 1 may calculate M feature values of the data to be classified according to a preset feature value calculation rule.
And S204, comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
Step S206, when all the M feature values are included in the first feature value data table, classifying the data to be classified into first class data corresponding to the first feature value data table.
In this embodiment, after the computer device 1 calculates M feature values of the data to be classified, the M feature values are sent to the data server, and the data service is requested to compare the M feature values with each feature value data table in the feature value database in sequence. Of course, in other embodiments, the computer device 1 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the computer device 1 determines, by comparison, that the M characteristic values are included in the first characteristic value data table, it is determined that the data to be classified is included in the existing data corresponding to the first characteristic value data table, and therefore, the data to be classified is classified into the first category data corresponding to the first characteristic value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. As shown in fig. 3, when the characteristic value database further includes a second characteristic value data table, the comparing the M characteristic values with each characteristic value data table in the characteristic value database in sequence in step S204 includes steps S300 to S304:
and step S300, sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in the first bloom filter and the second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table.
Step S302 is to determine that all the M eigenvalues are included in the first eigenvalue data table or the second eigenvalue data table when all the boolean values of the storage order corresponding to each of the M eigenvalues in the first bloom filter or the second bloom filter are 1.
Step S304, when the M characteristic values are not completely included in the first characteristic value data table or the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
Specifically, when the feature value database is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the computer device 1 calculates the M feature values of the data to be classified, the M feature values are sequentially compared with each bloom filter, and whether the M feature values are included in any bloom filter is determined. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is the arrangement order on the storage unit, and the boolean value includes 1 and 0. Therefore, the computer device 1 sequentially searches whether or not the boolean values of the storage order corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage orders respectively corresponding to the M eigenvalues in the first bloom filter are all 1, determining that the M eigenvalues are all included in the first eigenvalue data table; and when the M characteristic values are not completely included in the first characteristic value data table and not completely included in the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
Referring to fig. 4, the computer device 1 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4(a), when the M feature values do not exist in bloom filter 1 but exist in bloom filter 2, they are classified into the second class data; in fig. 4(B), when the M feature values do not exist in the bloom filter 1 or the bloom filter 2, the classification failure is indicated, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the data classification method provided in this embodiment can acquire data to be classified, M feature values of the data to be classified are calculated according to a feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
In addition, the present invention also provides a computer device, which is shown in fig. 5 and is a schematic diagram of an optional hardware architecture of the computer device of the present invention.
In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. The computer device 1 is connected to a network (not shown in fig. 6) through a network interface 13, and is connected to a server (not shown in fig. 5) through the network for data interaction. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division MultIPle Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.
It is noted that fig. 5 only shows the computer device 1 with components 11-13, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided in the computer device 1. Of course, the memory 11 may also comprise both an internal storage unit of the computer device 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the computer device 1 and various types of application software, such as program codes of the barrier application, program codes of the data sorting apparatus 200, and the like. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally used for controlling the overall operation of the computer device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run an application program of the data classification apparatus 200, which is not limited herein.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing a communication connection between the computer device 1 and a user terminal and a data server.
In this embodiment, when the data classification device 200 is installed and operated in the computer device 1, after the data classification device 200 is operated, the data to be classified can be acquired, and M feature values of the data to be classified are calculated according to the feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The hardware structure and functions of the computer apparatus of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described computer apparatus.
Referring to FIG. 6, a block diagram of a data sorting apparatus 200 according to an embodiment of the invention is shown.
In this embodiment, the data classification apparatus 200 includes a series of computer program instructions stored on the memory 11, which when executed by the processor 12, can implement the data classification function of the embodiment of the present invention. In some embodiments, the data classification apparatus 200 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 6, the data classification apparatus 200 may be divided into an acquisition module 201, a calculation module 202, an alignment module 203, and a classification module 204. Wherein:
the obtaining module 201 is configured to obtain data to be classified.
Specifically, after the computer device is connected to the user terminal, when the user has data to be classified, the data to be classified is sent to the computer device through the user terminal, and then the obtaining module 201 receives the data to be classified. Of course, in other embodiments, the computer device may also provide an interactive interface, then receive a classification request of a user for the data to be classified stored on the computer device through the interactive interface, and then the obtaining module 201 obtains the data to be classified from the storage unit of the computer device itself.
The calculating module 202 is configured to calculate M feature values of the data to be classified according to a feature value calculation rule.
Specifically, after the obtaining module 201 obtains the data to be classified, the calculating module 202 calculates M feature values of the data to be classified according to a preset feature value calculating rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set in a differentiation manner according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the video duration; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In short, for different data classifications, the calculating module 202 may calculate M feature values of the data to be classified according to a preset feature value calculating rule.
The comparison module 203 is configured to compare the M characteristic values with each characteristic value data table in a characteristic value database in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
The classifying module 204 is configured to classify the data to be classified into first class data corresponding to the first characteristic value data table when all the M characteristic values are included in the first characteristic value data table.
In this embodiment, after the calculating module 202 calculates M eigenvalues of the data to be classified, the comparing module 203 sends the M eigenvalues to the data server, and requests the data service to compare the M eigenvalues with each eigenvalue data table in the eigenvalue database in sequence. Of course, in other embodiments, the comparing module 203 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the comparison module 203 determines that the M characteristic values are included in the first characteristic value data table through comparison, the data to be classified is considered to be included in the existing data corresponding to the first characteristic value data table, and therefore the classification module 204 classifies the data to be classified as the first class data corresponding to the first characteristic value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. When the characteristic value database further includes a second characteristic value data table, the comparison module 203 is further configured to sequentially query whether the boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; and when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table. The classification module 204 is further configured to, when the M characteristic values are not completely included in the first characteristic value data table nor the second characteristic value data table, determine that the data to be classified does not belong to the existing category data, and return a warning of classification failure.
Specifically, when the feature value database is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the calculating module 202 calculates the M feature values of the data to be classified, the comparing module 203 compares the M feature values with each bloom filter in sequence, and determines whether the M feature values are included in any bloom filter. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is the arrangement order on the storage unit, and the boolean value includes 1 and 0. Therefore, the comparison module 203 sequentially queries whether the boolean values of the storage orders corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage order corresponding to the M feature values in the first bloom filter are all 1, the comparison module 203 determines that the M feature values are all included in the first feature value data table; when the comparison module 203 determines that the M characteristic values are not completely included in the first characteristic value data table nor in the second characteristic value data table, the classification module 204 determines that the data to be classified does not belong to the existing class data, and returns a warning of classification failure.
Referring to fig. 4, the comparison module 203 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4(a), when the comparing module 203 determines that the M feature values are not present in the bloom filter 1 but present in the bloom filter 2, the classifying module 204 classifies the data to be classified into the second category data; in fig. 4(B), when the comparison module 203 determines that the M feature values do not exist in the bloom filter 1 or the bloom filter 2, the classification module 204 indicates that the classification fails, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the computer device can acquire the data to be classified, M feature values of the data to be classified are calculated according to the feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A method of data classification, the method comprising:
acquiring data to be classified;
calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table, and the first characteristic value data table is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule;
and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table.
2. The data classification method of claim 1, wherein the feature value calculation rule comprises:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
3. The data classification method according to claim 1 or 2, characterized in that the characteristic value data table stores all characteristic values of the same category data in a bloom filter, each characteristic value having a boolean value of 1 in a storage order corresponding to the bloom filter.
4. The data classification method according to claim 3, wherein the characteristic value database further includes at least a second characteristic value data table, and wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
5. The data classification method of claim 4, characterized in that the method further comprises:
when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified does not belong to the existing class data;
and returning a warning of classification failure.
6. An apparatus for classifying data, the apparatus comprising:
the acquisition module is used for acquiring data to be classified;
the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
a comparison module, configured to compare the M feature values with each feature value data table in a feature value database in sequence, where the feature value database at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated by the feature value calculation rule;
and the classification module is used for classifying the data to be classified into first class data corresponding to the first characteristic value data table when the M characteristic values are included in the first characteristic value data table.
7. The data classification apparatus of claim 6, wherein the feature value calculation rule comprises:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
8. The data classification apparatus according to claim 6, wherein the characteristic value data table stores all characteristic values of the same category data in a bloom filter, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
9. A computer arrangement, characterized in that the computer arrangement comprises a memory, a processor, the memory having stored thereon a computer program being executable on the processor, the computer program, when being executed by the processor, realizing the steps of the data classification method according to any one of claims 1-5.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method according to any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911175983.1A CN112948370B (en) | 2019-11-26 | 2019-11-26 | Data classification method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911175983.1A CN112948370B (en) | 2019-11-26 | 2019-11-26 | Data classification method and device and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112948370A true CN112948370A (en) | 2021-06-11 |
CN112948370B CN112948370B (en) | 2023-04-11 |
Family
ID=76225198
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911175983.1A Active CN112948370B (en) | 2019-11-26 | 2019-11-26 | Data classification method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112948370B (en) |
Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100269024A1 (en) * | 2009-04-18 | 2010-10-21 | Fang Hao | Method and apparatus for multiset membership testing using combinatorial bloom filters |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
CN102203773A (en) * | 2008-09-19 | 2011-09-28 | 甲骨文国际公司 | Hash join using collaborative parallel filtering in intelligent storage with offloaded bloom filters |
CN102253991A (en) * | 2011-05-25 | 2011-11-23 | 北京星网锐捷网络技术有限公司 | Uniform resource locator (URL) storage method, web filtering method, device and system |
CN103761494A (en) * | 2014-01-10 | 2014-04-30 | 清华大学 | Method and system for identifying lost tag of RFID system |
US20150356196A1 (en) * | 2014-06-04 | 2015-12-10 | International Business Machines Corporation | Classifying uniform resource locators |
CN105843931A (en) * | 2016-03-30 | 2016-08-10 | 广州酷狗计算机科技有限公司 | Classification method and device |
CN106096042A (en) * | 2016-06-28 | 2016-11-09 | 乐视控股(北京)有限公司 | Data message sorting technique and system |
CN107911315A (en) * | 2017-11-17 | 2018-04-13 | 成都西加云杉科技有限公司 | Packet classification method and the network equipment |
CN107967322A (en) * | 2017-11-23 | 2018-04-27 | 努比亚技术有限公司 | Document classification display methods, mobile terminal and computer-readable recording medium |
CN108021605A (en) * | 2017-10-30 | 2018-05-11 | 北京奇艺世纪科技有限公司 | A kind of keyword classification method and apparatus |
CN108259811A (en) * | 2018-04-03 | 2018-07-06 | 北京理工大学 | A kind of the covert timing channel device and its construction method of package location adjustment of classifying |
CN108304882A (en) * | 2018-02-07 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of image classification method, device and server, user terminal, storage medium |
CN108763952A (en) * | 2018-05-03 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of data classification method, device and electronic equipment |
CN109784351A (en) * | 2017-11-10 | 2019-05-21 | 财付通支付科技有限公司 | Data classification method, disaggregated model training method and device |
CN110019785A (en) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | A kind of file classification method and device |
CN110083666A (en) * | 2018-01-25 | 2019-08-02 | 丰田自动车株式会社 | Server unit, Information Collection System, formation gathering method and recording medium |
CN110362580A (en) * | 2019-07-25 | 2019-10-22 | 重庆市筑智建信息技术有限公司 | BIM (building information modeling) construction engineering data retrieval optimization classification method and system thereof |
CN110390011A (en) * | 2018-04-12 | 2019-10-29 | 北京京东尚科信息技术有限公司 | The method and apparatus of data classification |
-
2019
- 2019-11-26 CN CN201911175983.1A patent/CN112948370B/en active Active
Patent Citations (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102203773A (en) * | 2008-09-19 | 2011-09-28 | 甲骨文国际公司 | Hash join using collaborative parallel filtering in intelligent storage with offloaded bloom filters |
US20100269024A1 (en) * | 2009-04-18 | 2010-10-21 | Fang Hao | Method and apparatus for multiset membership testing using combinatorial bloom filters |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
CN102253991A (en) * | 2011-05-25 | 2011-11-23 | 北京星网锐捷网络技术有限公司 | Uniform resource locator (URL) storage method, web filtering method, device and system |
CN103761494A (en) * | 2014-01-10 | 2014-04-30 | 清华大学 | Method and system for identifying lost tag of RFID system |
US20150356196A1 (en) * | 2014-06-04 | 2015-12-10 | International Business Machines Corporation | Classifying uniform resource locators |
CN105843931A (en) * | 2016-03-30 | 2016-08-10 | 广州酷狗计算机科技有限公司 | Classification method and device |
CN106096042A (en) * | 2016-06-28 | 2016-11-09 | 乐视控股(北京)有限公司 | Data message sorting technique and system |
CN110019785A (en) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | A kind of file classification method and device |
CN108021605A (en) * | 2017-10-30 | 2018-05-11 | 北京奇艺世纪科技有限公司 | A kind of keyword classification method and apparatus |
CN109784351A (en) * | 2017-11-10 | 2019-05-21 | 财付通支付科技有限公司 | Data classification method, disaggregated model training method and device |
CN107911315A (en) * | 2017-11-17 | 2018-04-13 | 成都西加云杉科技有限公司 | Packet classification method and the network equipment |
CN107967322A (en) * | 2017-11-23 | 2018-04-27 | 努比亚技术有限公司 | Document classification display methods, mobile terminal and computer-readable recording medium |
CN110083666A (en) * | 2018-01-25 | 2019-08-02 | 丰田自动车株式会社 | Server unit, Information Collection System, formation gathering method and recording medium |
CN108304882A (en) * | 2018-02-07 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of image classification method, device and server, user terminal, storage medium |
WO2019154262A1 (en) * | 2018-02-07 | 2019-08-15 | 腾讯科技(深圳)有限公司 | Image classification method, server, user terminal, and storage medium |
CN108259811A (en) * | 2018-04-03 | 2018-07-06 | 北京理工大学 | A kind of the covert timing channel device and its construction method of package location adjustment of classifying |
CN110390011A (en) * | 2018-04-12 | 2019-10-29 | 北京京东尚科信息技术有限公司 | The method and apparatus of data classification |
CN108763952A (en) * | 2018-05-03 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of data classification method, device and electronic equipment |
CN110362580A (en) * | 2019-07-25 | 2019-10-22 | 重庆市筑智建信息技术有限公司 | BIM (building information modeling) construction engineering data retrieval optimization classification method and system thereof |
Non-Patent Citations (2)
Title |
---|
李睿;李晋国;陈浩;: "两层传感器网络中安全分类协议研究" * |
饶文;陈旭;: "基于布隆过滤器的海量数据查询技术的优化与应用" * |
Also Published As
Publication number | Publication date |
---|---|
CN112948370B (en) | 2023-04-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309125B (en) | Data verification method, electronic device and storage medium | |
CN107844634B (en) | Modeling method of multivariate general model platform, electronic equipment and computer readable storage medium | |
CN108416485B (en) | User identity recognition method, electronic device and computer readable storage medium | |
CN108470045B (en) | Electronic device, data chain archiving method and storage medium | |
CN110599354B (en) | Online checking method, online checking system, computer device and computer readable storage medium | |
CN108170551B (en) | Crawler system based front-end and back-end error processing method, server and storage medium | |
CN113132267B (en) | Distributed system, data aggregation method and computer readable storage medium | |
CN111177129A (en) | Label system construction method, device, equipment and storage medium | |
CN112328641B (en) | Multi-dimensional data aggregation method and device and computer equipment | |
CN109670091B (en) | Metadata intelligent maintenance method and device based on data standard | |
CN107944931A (en) | Seed user expanding method, electronic equipment and computer-readable recording medium | |
CN113704243A (en) | Data analysis method, data analysis device, computer device, and storage medium | |
CN112130936B (en) | Data calling method, device, equipment and storage medium based on polling | |
CN110457255B (en) | Method, server and computer readable storage medium for archiving data | |
CN112416957A (en) | Data increment updating method and device based on data model layer and computer equipment | |
CN112422450A (en) | Computer equipment, and flow control method and device for service request | |
CN111414395B (en) | Data processing method, system and computer equipment | |
CN112560939B (en) | Model verification method and device and computer equipment | |
CN113656098A (en) | Configuration acquisition method and system | |
CN112948370B (en) | Data classification method and device and computer equipment | |
CN113448747B (en) | Data transmission method, device, computer equipment and storage medium | |
CN113259154B (en) | Method and device for informing middle station data verification, computer equipment and storage medium | |
CN108415922B (en) | Database modification method and application server | |
CN113392131A (en) | Data processing method and device and computer equipment | |
CN109582680B (en) | Business processing method based on new product development, electronic device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |