CN112948370B - Data classification method and device and computer equipment - Google Patents

Data classification method and device and computer equipment Download PDF

Info

Publication number
CN112948370B
CN112948370B CN201911175983.1A CN201911175983A CN112948370B CN 112948370 B CN112948370 B CN 112948370B CN 201911175983 A CN201911175983 A CN 201911175983A CN 112948370 B CN112948370 B CN 112948370B
Authority
CN
China
Prior art keywords
data
values
characteristic value
characteristic
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911175983.1A
Other languages
Chinese (zh)
Other versions
CN112948370A (en
Inventor
唐君行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN201911175983.1A priority Critical patent/CN112948370B/en
Publication of CN112948370A publication Critical patent/CN112948370A/en
Application granted granted Critical
Publication of CN112948370B publication Critical patent/CN112948370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data classification method, which comprises the following steps: acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. The invention also provides a data classification device, computer equipment and a computer readable storage medium. The invention can compare the simple characteristic value of the data to be classified with the characteristic value data table, thereby greatly reducing the data processing amount, shortening the time and improving the efficiency.

Description

Data classification method and device and computer equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data classification method and apparatus, a computer device, and a computer-readable storage medium.
Background
In the prior art, for the data classification problem, generally, a data class is created according to existing data, and then data to be classified is compared with all data in each data class one by one, so as to compare whether the data to be classified belongs to the data class. However, the classification method by enumerating each of the existing classes of data requires a huge amount of computation, which consumes many computer processing resources and takes a long time and a low efficiency.
Disclosure of Invention
In view of this, the present invention provides a data classification method, an apparatus, a computer device, and a computer-readable storage medium, which can solve the problems that a large amount of computer processing resources are required to be consumed and time is consumed in the data classification process.
First, to achieve the above object, the present invention provides a data classification method, including:
acquiring data to be classified; calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table, and the first characteristic value data table is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter.
In one example, the characteristic value database further includes at least a second characteristic value data table, wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
In one example, the method further comprises: when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified does not belong to the existing class data; and returning a warning of classification failure.
In addition, to achieve the above object, the present invention also provides a data sorting apparatus, comprising:
the acquisition module is used for acquiring data to be classified; the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule; a comparison module, configured to compare the M feature values with each feature value data table in a feature value database in sequence, where the feature value database at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated by the feature value calculation rule; and the classification module is used for classifying the data to be classified into first class data corresponding to the first characteristic value data table when the M characteristic values are included in the first characteristic value data table.
In one example, the feature value calculation rule includes: calculating M hash values of the data to be classified through M different hash functions; or dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
In one example, the characteristic value data table stores all characteristic values of the same category data in a bloom filter manner, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to: sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
Further, the present invention also proposes a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and the computer program implements the steps of the data classification method as described above when being executed by the processor.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method as described above.
Compared with the prior art, the data classification method, the data classification device, the computer equipment and the computer readable storage medium can calculate M characteristic values of the data to be classified according to the characteristic value calculation rule after the data to be classified is obtained; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of an application environment of an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data classification method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a specific embodiment of the process of comparing the M eigenvalues with each of the eigenvalue data tables in the eigenvalue database in turn in step S204 of FIG. 2;
FIG. 4 is a schematic illustration of the effect of the step shown in FIG. 3;
FIG. 5 is a diagram of an alternative hardware architecture for the computer device of the present invention;
FIG. 6 is a block diagram of a data sorting apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an application environment according to an embodiment of the present invention. Referring to fig. 1, the computer device 1 is connected to a user terminal and a data server, receives data to be classified sent by the user terminal, and classifies the data to be classified according to a characteristic value database stored in the data server. In the present embodiment, the computer device 1 can be used as a terminal device such as a server, a mobile phone, a user portable device, a PC, and the like. In other embodiments, the computer device 1 may also be a stand-alone functional module, and then attached to a data server or a user terminal to implement the function of data classification. Of course, in this embodiment, the characteristic value database is disposed on the data server, and in other embodiments, the characteristic value database may also be disposed on the computer device 1, which is not limited herein.
FIG. 2 is a flowchart illustrating a data classification method according to an embodiment of the present invention. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by taking a computer device as an execution subject.
As shown in fig. 2, the data classification method may include steps S200 to S206, in which:
step S200, acquiring data to be classified.
Specifically, after the computer device 1 is connected to a user terminal, when a user has data to be classified, the data to be classified is sent to the computer device 1 through the user terminal, and then the computer device 1 receives the data to be classified. Of course, in other embodiments, the computer device 1 may also provide an interactive interface, then receive a classification request of a user for the data to be classified stored on the computer device 1 through the interactive interface, and then obtain the data to be classified from the storage unit of the computer device 1 itself.
Step S202, calculating M characteristic values of the data to be classified according to a characteristic value calculation rule.
Specifically, after the computer device 1 acquires the data to be classified, M feature values of the data to be classified are calculated according to a preset feature value calculation rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device 1 calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another specific embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set differently according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the duration of video; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In summary, for different data classifications, the computer device 1 may calculate M feature values of the data to be classified according to a preset feature value calculation rule.
And S204, comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
Step S206, when all the M feature values are included in the first feature value data table, classifying the data to be classified into first class data corresponding to the first feature value data table.
In this embodiment, after the computer device 1 calculates M feature values of the data to be classified, the M feature values are sent to the data server, and the data service is requested to compare the M feature values with each feature value data table in a feature value database in sequence. Of course, in other embodiments, the computer device 1 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the computer device 1 determines that the M feature values are included in the first feature value data table by comparison, the data to be classified is considered to be included in the existing data corresponding to the first feature value data table, and thus, the data to be classified is classified into the first category data corresponding to the first feature value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. As shown in fig. 3, when the characteristic value database further includes a second characteristic value data table, the comparing the M characteristic values with each characteristic value data table in the characteristic value database in sequence in step S204 includes steps S300 to S304:
and step S300, sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in the first bloom filter and the second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table.
Step S302 is to determine that all the M eigenvalues are included in the first eigenvalue data table or the second eigenvalue data table when all the boolean values of the storage order corresponding to each of the M eigenvalues in the first bloom filter or the second bloom filter are 1.
Step S304, when the M characteristic values are not completely included in the first characteristic value data table or the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
In particular, when the feature value data table is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the computer device 1 calculates the M feature values of the data to be classified, the M feature values are sequentially compared with each bloom filter, and whether the M feature values are included in any bloom filter is determined. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is that the storage unit has an arrangement order, and the boolean value includes 1 and 0. Therefore, the computer device 1 sequentially searches whether the boolean values in the storage order corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage order corresponding to the M feature values in the first bloom filter are all 1, determining that all the M feature values are included in the first feature value data table; and when the M characteristic values are not completely included in the first characteristic value data table and not completely included in the second characteristic value data table, judging that the data to be classified does not belong to the existing class data, and returning a warning of classification failure.
Referring to fig. 4, the computer device 1 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4 (a), when the M feature values do not exist in bloom filter 1 but exist in bloom filter 2, they are classified into second class data; in fig. 4 (B), when the M feature values do not exist in bloom filter 1 or bloom filter 2, the classification is not successful, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the data classification method provided in this embodiment can acquire data to be classified, M feature values of the data to be classified are calculated according to a feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the mode, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
In addition, the present invention also provides a computer device, which is shown in fig. 5 and is a schematic diagram of an optional hardware architecture of the computer device of the present invention.
In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. The computer device 1 is connected to a network (not shown in fig. 6) through a network interface 13, and is connected to a server (not shown in fig. 5) through the network for data interaction. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), wideband Code Division MultIPle Access (WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a communication network.
It is noted that fig. 5 only shows the computer device 1 with components 11-13, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped with the computer device 1. Of course, the memory 11 may also comprise both an internal storage unit of the computer device 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the computer device 1 and various types of application software, such as program codes of the barrier application, and program codes of the data sorting apparatus 200. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the computer device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run a program code stored in the memory 11 or process data, for example, an application program of the data classification apparatus 200, which is not limited herein.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing a communication connection between the computer device 1 and a user terminal and a data server.
In this embodiment, when the data classification device 200 is installed and operated in the computer device 1, after the data classification device 200 is operated, and data to be classified can be acquired, M feature values of the data to be classified are calculated according to a feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the method, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The hardware structure and functions of the computer apparatus of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described computer apparatus.
Referring to FIG. 6, a block diagram of a data sorting apparatus 200 according to an embodiment of the invention is shown.
In this embodiment, the data classification apparatus 200 includes a series of computer program instructions stored on the memory 11, which when executed by the processor 12, can implement the data classification function of the embodiment of the present invention. In some embodiments, the data classification apparatus 200 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 6, the data classification apparatus 200 may be divided into an acquisition module 201, a calculation module 202, an alignment module 203, and a classification module 204. Wherein:
the obtaining module 201 is configured to obtain data to be classified.
Specifically, after the computer device is connected to the user terminal, when the user has data to be classified, the data to be classified is sent to the computer device through the user terminal, and then the obtaining module 201 receives the data to be classified. Of course, in other embodiments, the computer device may also provide an interactive interface, then receive a classification request of the user for the data to be classified stored on the computer device through the interactive interface, and then the obtaining module 201 obtains the data to be classified from the storage unit of the computer device itself.
The calculating module 202 is configured to calculate M feature values of the data to be classified according to a feature value calculation rule.
Specifically, after the obtaining module 201 obtains the data to be classified, the calculating module 202 calculates M feature values of the data to be classified according to a preset feature value calculating rule. In one embodiment, the feature value calculation rule includes: m hash values of the data to be classified are calculated through M different hash functions, wherein the hash functions mainly calculate corresponding hash values, namely characteristic values, according to the data to be classified. That is, the computer device calculates M feature values of the data to be classified by M different hash functions set in advance, and associates the M feature values with the data to be classified.
Of course, in another embodiment, the feature value calculation rule includes: dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions. For example, when the data to be classified belongs to large-capacity data, the data to be classified may be divided into M parts, and then the characteristic values of the data to be classified are calculated sequentially according to M preset hash functions, so as to obtain M corresponding characteristic values. The process of dividing the data to be classified can be set in a differentiation manner according to the characteristics of the data to be classified, for example, in the process of classifying video data, the data to be classified can be divided according to the video duration; in the process of classifying the text data, the data to be classified can be divided according to paragraphs. In short, for different data classifications, the calculating module 202 may calculate M feature values of the data to be classified according to a preset feature value calculating rule.
The comparison module 203 is configured to compare the M characteristic values with each characteristic value data table in a characteristic value database in sequence. Wherein the characteristic value database includes at least a first characteristic value data table which is a set of all characteristic values of the same category data calculated by the characteristic value calculation rule.
The classifying module 204 is configured to classify the data to be classified into first class data corresponding to the first characteristic value data table when all the M characteristic values are included in the first characteristic value data table.
In this embodiment, after the calculating module 202 calculates M eigenvalues of the data to be classified, the comparing module 203 sends the M eigenvalues to the data server, and requests the data service to compare the M eigenvalues with each eigenvalue data table in the eigenvalue database in sequence. Of course, in other embodiments, the comparing module 203 may also obtain the characteristic value database from the data server, and then directly compare the M characteristic values with each characteristic value data table in the characteristic value database in sequence. The characteristic value data table is obtained by calculating the characteristic value of the same type of data according to the characteristic value calculation rule.
When the comparison module 203 determines that the M feature values are included in the first feature value data table through comparison, the data to be classified is considered to be included in the existing data corresponding to the first feature value data table, and therefore the classification module 204 classifies the data to be classified into the first type data corresponding to the first feature value data table. And finally, returning the classification result to the user terminal.
In an exemplary embodiment, the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is assigned to each characteristic value in the bloom filter. When the characteristic value database further includes a second characteristic value data table, the comparison module 203 is further configured to sequentially query whether the boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table; and when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table. The classification module 204 is further configured to, when the M characteristic values are not completely included in the first characteristic value data table nor the second characteristic value data table, determine that the data to be classified does not belong to the existing category data, and return a warning of classification failure.
Specifically, when the feature value database is set as a bloom filter, then the feature value database represents a plurality of bloom filters. Therefore, after the calculating module 202 calculates the M feature values of the data to be classified, the comparing module 203 compares the M feature values with each bloom filter in sequence, and determines whether the M feature values are included in any bloom filter. In this embodiment, since the bloom filter is a storage unit of a specific size that is stored in an array form, the storage unit includes a storage order and a boolean value in the storage order, the storage order is the arrangement order on the storage unit, and the boolean value includes 1 and 0. Therefore, the comparison module 203 sequentially queries whether the boolean values of the storage orders corresponding to the M feature values are both 1 in the first bloom filter and the second bloom filter corresponding to the first feature value data table and the second feature value data table. When the boolean values of the storage order corresponding to the M feature values in the first bloom filter are all 1, the comparison module 203 determines that the M feature values are all included in the first feature value data table; when the comparison module 203 determines that the M characteristic values are not completely included in the first characteristic value data table nor in the second characteristic value data table, the classification module 204 determines that the data to be classified does not belong to the existing class data, and returns a warning of classification failure.
Referring to fig. 4, the comparison module 203 compares M feature values of the data to be classified with the bloom filter 1 and the bloom filter 2 in sequence, and determines whether the M feature values exist in the bloom filter 1 or the bloom filter 2: in fig. 4 (a), when the comparing module 203 determines that the M feature values are not present in the bloom filter 1 but present in the bloom filter 2, the classifying module 204 classifies the data to be classified into the second category data; in fig. 4 (B), when the comparing module 203 determines that the M feature values do not exist in the bloom filter 1 or do not exist in the bloom filter 2, the classifying module 204 prompts that the classification fails, and the data to be classified does not belong to the existing class data.
As can be seen from the above, after the computer device can acquire the data to be classified, M feature values of the data to be classified are calculated according to the feature value calculation rule; comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence; and when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table. Through the mode, the simple characteristic value of the data to be classified can be compared with the characteristic value data table, so that the data processing amount is greatly reduced, the time is shortened, and the efficiency is improved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A method of data classification, the method comprising:
acquiring data to be classified;
calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
comparing the M characteristic values with each characteristic value data table in a characteristic value data base in sequence, wherein the characteristic value data base at least comprises a first characteristic value data table which is a set of all characteristic values of the same category of data calculated according to the characteristic value calculation rule;
when the M characteristic values are included in the first characteristic value data table, classifying the data to be classified into first class data corresponding to the first characteristic value data table;
wherein the feature value calculation rule includes:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
2. The data classification method according to claim 1, wherein the characteristic value data table stores all characteristic values of the same category data in a bloom filter, and a boolean value of 1 is corresponding to each characteristic value in the bloom filter.
3. The data classification method according to claim 2, wherein the characteristic value database further includes at least a second characteristic value data table, and wherein the sequentially comparing the M characteristic values with each of the characteristic value data tables in the characteristic value database includes:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when the boolean values of the storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are all 1, it is determined that the M feature values are all included in the first feature value data table or the second feature value data table.
4. The data classification method of claim 3, characterized in that the method further comprises:
when the M characteristic values are not completely included in the first bloom filter and not completely included in the second bloom filter, judging that the data to be classified do not belong to the existing class data;
and returning a warning of classification failure.
5. An apparatus for classifying data, the apparatus comprising:
the acquisition module is used for acquiring data to be classified;
the calculation module is used for calculating M characteristic values of the data to be classified according to a characteristic value calculation rule;
a comparison module, configured to compare the M feature values with each feature value data table in a feature value data base in sequence, where the feature value data base at least includes a first feature value data table, and the first feature value data table is a set of all feature values of the same category data calculated according to the feature value calculation rule;
a classification module, configured to classify the data to be classified into first class data corresponding to the first characteristic value data table when all the M characteristic values are included in the first characteristic value data table;
wherein the feature value calculation rule includes:
calculating M hash values of the data to be classified through M different hash functions; or
Dividing the data to be classified into M parts, and respectively calculating the hash values of the M parts through M hash functions.
6. The data classification apparatus according to claim 5, wherein the characteristic value data table stores all characteristic values of the same category data in a bloom filter, each characteristic value has a boolean value of 1 in a corresponding storage order in the bloom filter, the characteristic value database further includes at least a second characteristic value data table, and the comparison module is further configured to:
sequentially inquiring whether the Boolean values of the storage orders corresponding to the M characteristic values are all 1 in a first bloom filter and a second bloom filter corresponding to the first characteristic value data table and the second characteristic value data table;
when all of boolean values in storage order corresponding to the M feature values in the first bloom filter or the second bloom filter are 1, it is determined that all of the M feature values are included in the first feature value data table or the second feature value data table.
7. A computer arrangement, characterized in that the computer arrangement comprises a memory, a processor, the memory having stored thereon a computer program being executable on the processor, the computer program, when being executed by the processor, realizing the steps of the data classification method according to any one of claims 1-4.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executable by at least one processor to cause the at least one processor to perform the steps of the data classification method according to any one of claims 1-4.
CN201911175983.1A 2019-11-26 2019-11-26 Data classification method and device and computer equipment Active CN112948370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911175983.1A CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911175983.1A CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN112948370A CN112948370A (en) 2021-06-11
CN112948370B true CN112948370B (en) 2023-04-11

Family

ID=76225198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911175983.1A Active CN112948370B (en) 2019-11-26 2019-11-26 Data classification method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112948370B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390011A (en) * 2018-04-12 2019-10-29 北京京东尚科信息技术有限公司 The method and apparatus of data classification

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5484471B2 (en) * 2008-09-19 2014-05-07 オラクル・インターナショナル・コーポレイション Storage-side storage request management
US8266506B2 (en) * 2009-04-18 2012-09-11 Alcatel Lucent Method and apparatus for multiset membership testing using combinatorial bloom filters
CN101923568B (en) * 2010-06-23 2013-06-19 北京星网锐捷网络技术有限公司 Method for increasing and canceling elements of Bloom filter and Bloom filter
CN102253991B (en) * 2011-05-25 2014-07-30 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN103761494B (en) * 2014-01-10 2017-02-01 清华大学 Method and system for identifying lost tag of RFID system
US9569522B2 (en) * 2014-06-04 2017-02-14 International Business Machines Corporation Classifying uniform resource locators
CN105843931A (en) * 2016-03-30 2016-08-10 广州酷狗计算机科技有限公司 Classification method and device
CN106096042A (en) * 2016-06-28 2016-11-09 乐视控股(北京)有限公司 Data message sorting technique and system
CN110019785B (en) * 2017-09-29 2022-03-01 北京国双科技有限公司 Text classification method and device
CN108021605A (en) * 2017-10-30 2018-05-11 北京奇艺世纪科技有限公司 A kind of keyword classification method and apparatus
CN109784351B (en) * 2017-11-10 2023-03-24 财付通支付科技有限公司 Behavior data classification method and device and classification model training method and device
CN107911315B (en) * 2017-11-17 2020-09-11 成都西加云杉科技有限公司 Message classification method and network equipment
CN107967322B (en) * 2017-11-23 2021-09-21 努比亚技术有限公司 File classification display method, mobile terminal and computer readable storage medium
JP6699676B2 (en) * 2018-01-25 2020-05-27 トヨタ自動車株式会社 Server device, information collection system, and program
CN108304882B (en) * 2018-02-07 2022-03-04 腾讯科技(深圳)有限公司 Image classification method and device, server, user terminal and storage medium
CN108259811B (en) * 2018-04-03 2020-06-05 北京理工大学 Time hidden channel device for packet position classification adjustment and construction method thereof
CN108763952B (en) * 2018-05-03 2022-04-05 创新先进技术有限公司 Data classification method and device and electronic equipment
CN110362580B (en) * 2019-07-25 2021-09-24 重庆市筑智建信息技术有限公司 BIM (building information modeling) construction engineering data retrieval optimization classification method and system thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390011A (en) * 2018-04-12 2019-10-29 北京京东尚科信息技术有限公司 The method and apparatus of data classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李睿 ; 李晋国 ; 陈浩 ; .两层传感器网络中安全分类协议研究.通信学报.2015,(第02期),全文. *
饶文 ; 陈旭 ; .基于布隆过滤器的海量数据查询技术的优化与应用.微型电脑应用.2018,(第02期),全文. *

Also Published As

Publication number Publication date
CN112948370A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN110309125B (en) Data verification method, electronic device and storage medium
CN108427705B (en) Electronic device, distributed system log query method and storage medium
CN107844634B (en) Modeling method of multivariate general model platform, electronic equipment and computer readable storage medium
CN109474578B (en) Message checking method, device, computer equipment and storage medium
CN108416485B (en) User identity recognition method, electronic device and computer readable storage medium
CN109672888B (en) Picture compression method, equipment and computer readable storage medium
CN108415925B (en) Electronic device, data call log generation and query method and storage medium
CN110599354B (en) Online checking method, online checking system, computer device and computer readable storage medium
CN108470045B (en) Electronic device, data chain archiving method and storage medium
CN111177129A (en) Label system construction method, device, equipment and storage medium
CN110457255B (en) Method, server and computer readable storage medium for archiving data
CN111414395B (en) Data processing method, system and computer equipment
CN112560939B (en) Model verification method and device and computer equipment
CN114356898A (en) Data storage method and device, electronic equipment and storage medium
CN112948370B (en) Data classification method and device and computer equipment
CN113656098A (en) Configuration acquisition method and system
CN110166530B (en) Processing method based on micro-service return value, electronic device and computer equipment
CN107844520A (en) Electronic installation, vehicle data introduction method and storage medium
CN110852893A (en) Risk identification method, system, equipment and storage medium based on mass data
CN113259154B (en) Method and device for informing middle station data verification, computer equipment and storage medium
CN113448747B (en) Data transmission method, device, computer equipment and storage medium
CN109902098A (en) Similar cases are searched and sort method, server and computer readable storage medium
CN112328641B (en) Multi-dimensional data aggregation method and device and computer equipment
CN112130936B (en) Data calling method, device, equipment and storage medium based on polling
CN108415922B (en) Database modification method and application server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant