CN112948370A - Data classification method and device and computer equipment - Google Patents

Data classification method and device and computer equipment Download PDF

Info

Publication number
CN112948370A
CN112948370A CN201911175983.1A CN201911175983A CN112948370A CN 112948370 A CN112948370 A CN 112948370A CN 201911175983 A CN201911175983 A CN 201911175983A CN 112948370 A CN112948370 A CN 112948370A
Authority
CN
China
Prior art keywords
data
characteristic value
values
characteristic
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911175983.1A
Other languages
Chinese (zh)
Other versions
CN112948370B (en
Inventor
唐君行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN201911175983.1A priority Critical patent/CN112948370B/en
Publication of CN112948370A publication Critical patent/CN112948370A/en
Application granted granted Critical
Publication of CN112948370B publication Critical patent/CN112948370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data classification method, which comprises the following steps: acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. The invention also provides a data classification device, computer equipment and a computer readable storage medium. The invention can compare the simple characteristic value of the data to be classified with the characteristic value data table, thereby greatly reducing the data processing amount, shortening the time and improving the efficiency.

Description

Data classification method and device and computer equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data classification method and apparatus, a computer device, and a computer-readable storage medium.
Background
In the prior art, for the data classification problem, generally, a data class is created according to existing data, and then data to be classified is compared with all data in each data class one by one, so as to compare whether the data to be classified belongs to the data class. However, the classification method by enumerating each of the existing classes of data requires a huge amount of computation, which consumes many computer processing resources and takes a long time and a low efficiency.
Disclosure of Invention
In view of this, the present invention provides a data classification method, an apparatus, a computer device, and a computer-readable storage medium, which can solve the problems that a large amount of computer processing resources are required to be consumed and time is consumed in the data classification process.
First, to achieve the above object, the present invention provides a data classification method, including:
acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table, and the first characteristic value data table is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter.
In one example, the characteristic value database further includes at least a second characteristic value data table, wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
In one example, the method further comprises: when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified does not belong to the existing class data; and returning a warning of classification failure.
In addition, to achieve the above object, the present invention also provides a data sorting apparatus, comprising:
the acquisition module is used for acquiring data to be classified; the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; a comparison module, configured to compare the M feature values with each feature value data table in a feature value database in sequence, where the feature value database at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated by the feature value calculation rule; and the classification module is used for classifying the data to be classified into first class data corresponding to the first characteristic value data table when the M characteristic values are included in the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter manner, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
Further, the present invention also proposes a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and the computer program implements the steps of the data classification method as described above when being executed by the processor.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method as described above.
Compared with the prior art, the data classification method, the data classification device, the computer equipment and the computer readable storage medium provided by the invention can be used for calculating M characteristic values of the data to be classified according to the characteristic value calculation rule after the data to be classified is acquired; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of an application environment of an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data classification method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a specific embodiment of the process of comparing the M eigenvalues with each of the eigenvalue data tables in the eigenvalue database in turn in step S204 of FIG. 2;
FIG. 4 is a schematic illustration of the effect of the step shown in FIG. 3;
FIG. 5 is a diagram of an alternative hardware architecture for the computer device of the present invention;
FIG. 6 is a block diagram of a data sorting apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an application environment according to an embodiment of the present invention. Referring to fig. 1, the computer device 1 is connected to a user terminal and a data server, receives data to be classified sent by the user terminal, and classifies the data to be classified according to a characteristic value database stored in the data server. In the present embodiment, the computer device 1 can be used as a terminal device such as a server, a mobile phone, a user portable device, a PC, and the like. In other embodiments, the computer device 1 may also be a stand-alone functional module, and then attached to a data server or a user terminal to implement the function of data classification. Of course, in this embodiment, the characteristic value database is disposed on the data server, and in other embodiments, the characteristic value database may also be disposed on the computer device 1, which is not limited herein.
FIG. 2 is a flowchart illustrating a data classification method according to an embodiment of the present invention. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by taking a computer device as an execution subject.
As shown in fig. 2, the data classification method may include steps S200 to S206, in which:
step S200, acquiring data to be classified.
Specifically, after the computer device 1 is connected to a user terminal, when a user has data to be classified, the data to be classified is sent to the computer device 1 through the user terminal, and then the computer device 1 receives the data to be classified. Of course, in other embodiments, the computer device 1 may also provide an interactive interface, then receive a classification request of a user for the data to be classified stored on the computer device 1 through the interactive interface, and then obtain the data to be classified from the storage unit of the computer device 1 itself.
Step S202, calculating M characteristic values of the data to be classified according to a characteristic value calculation rule.
Specifically, after the computer device 1 acquires the data to be classified, M feature values of the data to be classified are calculated according to a preset feature value calculation rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device 1 calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set in a differentiation manner according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the video duration; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In summary, for different data classifications, the computer device 1 may calculate M feature values of the data to be classified according to a preset feature value calculation rule.
And S204, comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
Step S206, when all the M feature values are included in the first feature value data table, classifying the data to be classified into first class data corresponding to the first feature value data table.
In this embodiment, after the computer device 1 calculates M feature values of the data to be classified, the M feature values are sent to the data server, and the data service is requested to compare the M feature values with each feature value data table in the feature value database in sequence. Of course, in other embodiments, the computer device 1 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the computer device 1 determines, by comparison, that the M characteristic values are included in the first characteristic value data table, it is determined that the data to be classified is included in the existing data corresponding to the first characteristic value data table, and therefore, the data to be classified is classified into the first category data corresponding to the first characteristic value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. As shown in fig. 3, when the characteristic value database further includes a second characteristic value data table, the comparing the M characteristic values with each characteristic value data table in the characteristic value database in sequence in step S204 includes steps S300 to S304:
and step S300, sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in the first bloom filter and the second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table.
Step S302 is to determine that all the M eigenvalues are included in the first eigenvalue data table or the second eigenvalue data table when all the boolean values of the storage order corresponding to each of the M eigenvalues in the first bloom filter or the second bloom filter are 1.
Step S304, when the M characteristic values are not completely included in the first characteristic value data table or the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
Specifically, when the feature value database is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the computer device 1 calculates the M feature values of the data to be classified, the M feature values are sequentially compared with each bloom filter, and whether the M feature values are included in any bloom filter is determined. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is the arrangement order on the storage unit, and the boolean value includes 1 and 0. Therefore, the computer device 1 sequentially searches whether or not the boolean values of the storage order corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage orders respectively corresponding to the M eigenvalues in the first bloom filter are all 1, determining that the M eigenvalues are all included in the first eigenvalue data table; and when the M characteristic values are not completely included in the first characteristic value data table and not completely included in the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
Referring to fig. 4, the computer device 1 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4(a), when the M feature values do not exist in bloom filter 1 but exist in bloom filter 2, they are classified into the second class data; in fig. 4(B), when the M feature values do not exist in the bloom filter 1 or the bloom filter 2, the classification failure is indicated, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the data classification method provided in this embodiment can acquire data to be classified, M feature values of the data to be classified are calculated according to a feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
In addition, the present invention also provides a computer device, which is shown in fig. 5 and is a schematic diagram of an optional hardware architecture of the computer device of the present invention.
In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. The computer device 1 is connected to a network (not shown in fig. 6) through a network interface 13, and is connected to a server (not shown in fig. 5) through the network for data interaction. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division MultIPle Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.
It is noted that fig. 5 only shows the computer device 1 with components 11-13, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided in the computer device 1. Of course, the memory 11 may also comprise both an internal storage unit of the computer device 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the computer device 1 and various types of application software, such as program codes of the barrier application, program codes of the data sorting apparatus 200, and the like. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally used for controlling the overall operation of the computer device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run an application program of the data classification apparatus 200, which is not limited herein.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing a communication connection between the computer device 1 and a user terminal and a data server.
In this embodiment, when the data classification device 200 is installed and operated in the computer device 1, after the data classification device 200 is operated, the data to be classified can be acquired, and M feature values of the data to be classified are calculated according to the feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The hardware structure and functions of the computer apparatus of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described computer apparatus.
Referring to FIG. 6, a block diagram of a data sorting apparatus 200 according to an embodiment of the invention is shown.
In this embodiment, the data classification apparatus 200 includes a series of computer program instructions stored on the memory 11, which when executed by the processor 12, can implement the data classification function of the embodiment of the present invention. In some embodiments, the data classification apparatus 200 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 6, the data classification apparatus 200 may be divided into an acquisition module 201, a calculation module 202, an alignment module 203, and a classification module 204. Wherein:
the obtaining module 201 is configured to obtain data to be classified.
Specifically, after the computer device is connected to the user terminal, when the user has data to be classified, the data to be classified is sent to the computer device through the user terminal, and then the obtaining module 201 receives the data to be classified. Of course, in other embodiments, the computer device may also provide an interactive interface, then receive a classification request of a user for the data to be classified stored on the computer device through the interactive interface, and then the obtaining module 201 obtains the data to be classified from the storage unit of the computer device itself.
The calculating module 202 is configured to calculate M feature values of the data to be classified according to a feature value calculation rule.
Specifically, after the obtaining module 201 obtains the data to be classified, the calculating module 202 calculates M feature values of the data to be classified according to a preset feature value calculating rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set in a differentiation manner according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the video duration; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In short, for different data classifications, the calculating module 202 may calculate M feature values of the data to be classified according to a preset feature value calculating rule.
The comparison module 203 is configured to compare the M characteristic values with each characteristic value data table in a characteristic value database in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
The classifying module 204 is configured to classify the data to be classified into first class data corresponding to the first characteristic value data table when all the M characteristic values are included in the first characteristic value data table.
In this embodiment, after the calculating module 202 calculates M eigenvalues of the data to be classified, the comparing module 203 sends the M eigenvalues to the data server, and requests the data service to compare the M eigenvalues with each eigenvalue data table in the eigenvalue database in sequence. Of course, in other embodiments, the comparing module 203 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the comparison module 203 determines that the M characteristic values are included in the first characteristic value data table through comparison, the data to be classified is considered to be included in the existing data corresponding to the first characteristic value data table, and therefore the classification module 204 classifies the data to be classified as the first class data corresponding to the first characteristic value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. When the characteristic value database further includes a second characteristic value data table, the comparison module 203 is further configured to sequentially query whether the boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; and when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table. The classification module 204 is further configured to, when the M characteristic values are not completely included in the first characteristic value data table nor the second characteristic value data table, determine that the data to be classified does not belong to the existing category data, and return a warning of classification failure.
Specifically, when the feature value database is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the calculating module 202 calculates the M feature values of the data to be classified, the comparing module 203 compares the M feature values with each bloom filter in sequence, and determines whether the M feature values are included in any bloom filter. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is the arrangement order on the storage unit, and the boolean value includes 1 and 0. Therefore, the comparison module 203 sequentially queries whether the boolean values of the storage orders corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage order corresponding to the M feature values in the first bloom filter are all 1, the comparison module 203 determines that the M feature values are all included in the first feature value data table; when the comparison module 203 determines that the M characteristic values are not completely included in the first characteristic value data table nor in the second characteristic value data table, the classification module 204 determines that the data to be classified does not belong to the existing class data, and returns a warning of classification failure.
Referring to fig. 4, the comparison module 203 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4(a), when the comparing module 203 determines that the M feature values are not present in the bloom filter 1 but present in the bloom filter 2, the classifying module 204 classifies the data to be classified into the second category data; in fig. 4(B), when the comparison module 203 determines that the M feature values do not exist in the bloom filter 1 or the bloom filter 2, the classification module 204 indicates that the classification fails, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the computer device can acquire the data to be classified, M feature values of the data to be classified are calculated according to the feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method of data classification, the method comprising:
acquiring data to be classified;
calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table, and the first characteristic value data table is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule;
and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table.
2. The data classification method of claim 1, wherein the feature value calculation rule comprises:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
3. The data classification method according to claim 1 or 2, characterized in that the characteristic value data table stores all characteristic values of the same category data in a bloom filter, each characteristic value having a boolean value of 1 in a storage order corresponding to the bloom filter.
4. The data classification method according to claim 3, wherein the characteristic value database further includes at least a second characteristic value data table, and wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
5. The data classification method of claim 4, characterized in that the method further comprises:
when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified does not belong to the existing class data;
and returning a warning of classification failure.
6. An apparatus for classifying data, the apparatus comprising:
the acquisition module is used for acquiring data to be classified;
the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
a comparison module, configured to compare the M feature values with each feature value data table in a feature value database in sequence, where the feature value database at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated by the feature value calculation rule;
and the classification module is used for classifying the data to be classified into first class data corresponding to the first characteristic value data table when the M characteristic values are included in the first characteristic value data table.
7. The data classification apparatus of claim 6, wherein the feature value calculation rule comprises:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
8. The data classification apparatus according to claim 6, wherein the characteristic value data table stores all characteristic values of the same category data in a bloom filter, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
9. A computer arrangement, characterized in that the computer arrangement comprises a memory, a processor, the memory having stored thereon a computer program being executable on the processor, the computer program, when being executed by the processor, realizing the steps of the data classification method according to any one of claims 1-5.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method according to any one of claims 1-5.
CN201911175983.1A 2019-11-26 2019-11-26 Data classification method and device and computer equipment Active CN112948370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911175983.1A CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911175983.1A CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN112948370A true CN112948370A (en) 2021-06-11
CN112948370B CN112948370B (en) 2023-04-11

Family

ID=76225198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911175983.1A Active CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112948370B (en)

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100269024A1 (en) * 2009-04-18 2010-10-21 Fang Hao Method and apparatus for multiset membership testing using combinatorial bloom filters
CN101923568A (en) * 2010-06-23 2010-12-22 北京星网锐捷网络技术有限公司 Method for increasing and canceling elements of Bloom filter and Bloom filter
CN102203773A (en) * 2008-09-19 2011-09-28 甲骨文国际公司 Hash join using collaborative parallel filtering in intelligent storage with offloaded bloom filters
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN103761494A (en) * 2014-01-10 2014-04-30 清华大学 Method and system for identifying lost tag of RFID system
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
CN105843931A (en) * 2016-03-30 2016-08-10 广州酷狗计算机科技有限公司 Classification method and device
CN106096042A (en) * 2016-06-28 2016-11-09 乐视控股(北京)有限公司 Data message sorting technique and system
CN107911315A (en) * 2017-11-17 2018-04-13 成都西加云杉科技有限公司 Packet classification method and the network equipment
CN107967322A (en) * 2017-11-23 2018-04-27 努比亚技术有限公司 Document classification display methods, mobile terminal and computer-readable recording medium
CN108021605A (en) * 2017-10-30 2018-05-11 北京奇艺世纪科技有限公司 A kind of keyword classification method and apparatus
CN108259811A (en) * 2018-04-03 2018-07-06 北京理工大学 A kind of the covert timing channel device and its construction method of package location adjustment of classifying
CN108304882A (en) * 2018-02-07 2018-07-20 腾讯科技(深圳)有限公司 A kind of image classification method, device and server, user terminal, storage medium
CN108763952A (en) * 2018-05-03 2018-11-06 阿里巴巴集团控股有限公司 A kind of data classification method, device and electronic equipment
CN109784351A (en) * 2017-11-10 2019-05-21 财付通支付科技有限公司 Data classification method, disaggregated model training method and device
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN110083666A (en) * 2018-01-25 2019-08-02 丰田自动车株式会社 Server unit, Information Collection System, formation gathering method and recording medium
CN110362580A (en) * 2019-07-25 2019-10-22 重庆市筑智建信息技术有限公司 BIM (building information modeling) construction engineering data retrieval optimization classification method and system thereof
CN110390011A (en) * 2018-04-12 2019-10-29 北京京东尚科信息技术有限公司 The method and apparatus of data classification

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102203773A (en) * 2008-09-19 2011-09-28 甲骨文国际公司 Hash join using collaborative parallel filtering in intelligent storage with offloaded bloom filters
US20100269024A1 (en) * 2009-04-18 2010-10-21 Fang Hao Method and apparatus for multiset membership testing using combinatorial bloom filters
CN101923568A (en) * 2010-06-23 2010-12-22 北京星网锐捷网络技术有限公司 Method for increasing and canceling elements of Bloom filter and Bloom filter
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN103761494A (en) * 2014-01-10 2014-04-30 清华大学 Method and system for identifying lost tag of RFID system
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
CN105843931A (en) * 2016-03-30 2016-08-10 广州酷狗计算机科技有限公司 Classification method and device
CN106096042A (en) * 2016-06-28 2016-11-09 乐视控股(北京)有限公司 Data message sorting technique and system
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN108021605A (en) * 2017-10-30 2018-05-11 北京奇艺世纪科技有限公司 A kind of keyword classification method and apparatus
CN109784351A (en) * 2017-11-10 2019-05-21 财付通支付科技有限公司 Data classification method, disaggregated model training method and device
CN107911315A (en) * 2017-11-17 2018-04-13 成都西加云杉科技有限公司 Packet classification method and the network equipment
CN107967322A (en) * 2017-11-23 2018-04-27 努比亚技术有限公司 Document classification display methods, mobile terminal and computer-readable recording medium
CN110083666A (en) * 2018-01-25 2019-08-02 丰田自动车株式会社 Server unit, Information Collection System, formation gathering method and recording medium
CN108304882A (en) * 2018-02-07 2018-07-20 腾讯科技(深圳)有限公司 A kind of image classification method, device and server, user terminal, storage medium
WO2019154262A1 (en) * 2018-02-07 2019-08-15 腾讯科技(深圳)有限公司 Image classification method, server, user terminal, and storage medium
CN108259811A (en) * 2018-04-03 2018-07-06 北京理工大学 A kind of the covert timing channel device and its construction method of package location adjustment of classifying
CN110390011A (en) * 2018-04-12 2019-10-29 北京京东尚科信息技术有限公司 The method and apparatus of data classification
CN108763952A (en) * 2018-05-03 2018-11-06 阿里巴巴集团控股有限公司 A kind of data classification method, device and electronic equipment
CN110362580A (en) * 2019-07-25 2019-10-22 重庆市筑智建信息技术有限公司 BIM (building information modeling) construction engineering data retrieval optimization classification method and system thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李睿;李晋国;陈浩;: "两层传感器网络中安全分类协议研究" *
饶文;陈旭;: "基于布隆过滤器的海量数据查询技术的优化与应用" *

Also Published As

Publication number Publication date
CN112948370B (en) 2023-04-11

Similar Documents

Publication Publication Date Title
CN110309125B (en) Data verification method, electronic device and storage medium
CN107844634B (en) Modeling method of multivariate general model platform, electronic equipment and computer readable storage medium
CN108416485B (en) User identity recognition method, electronic device and computer readable storage medium
CN108470045B (en) Electronic device, data chain archiving method and storage medium
CN110599354B (en) Online checking method, online checking system, computer device and computer readable storage medium
CN108170551B (en) Crawler system based front-end and back-end error processing method, server and storage medium
CN113132267B (en) Distributed system, data aggregation method and computer readable storage medium
CN111177129A (en) Label system construction method, device, equipment and storage medium
CN112328641B (en) Multi-dimensional data aggregation method and device and computer equipment
CN109670091B (en) Metadata intelligent maintenance method and device based on data standard
CN107944931A (en) Seed user expanding method, electronic equipment and computer-readable recording medium
CN113704243A (en) Data analysis method, data analysis device, computer device, and storage medium
CN112130936B (en) Data calling method, device, equipment and storage medium based on polling
CN110457255B (en) Method, server and computer readable storage medium for archiving data
CN112416957A (en) Data increment updating method and device based on data model layer and computer equipment
CN112422450A (en) Computer equipment, and flow control method and device for service request
CN111414395B (en) Data processing method, system and computer equipment
CN112560939B (en) Model verification method and device and computer equipment
CN113656098A (en) Configuration acquisition method and system
CN112948370B (en) Data classification method and device and computer equipment
CN113448747B (en) Data transmission method, device, computer equipment and storage medium
CN113259154B (en) Method and device for informing middle station data verification, computer equipment and storage medium
CN108415922B (en) Database modification method and application server
CN113392131A (en) Data processing method and device and computer equipment
CN109582680B (en) Business processing method based on new product development, electronic device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant