CN109299260B - Data classification method, device and computer readable storage medium - Google Patents
Data classification method, device and computer readable storage medium Download PDFInfo
- Publication number
- CN109299260B CN109299260B CN201811147293.0A CN201811147293A CN109299260B CN 109299260 B CN109299260 B CN 109299260B CN 201811147293 A CN201811147293 A CN 201811147293A CN 109299260 B CN109299260 B CN 109299260B
- Authority
- CN
- China
- Prior art keywords
- data
- natural language
- code data
- field value
- calculation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
A data classification method, apparatus, and computer-readable storage medium, the method comprising: acquiring natural language data; preprocessing the natural language data to obtain code data corresponding to each piece of natural language data; dividing each piece of code data into n pieces of label data; n is more than or equal to 2; and dividing code data with the same label data and the same bit order of the same label data into a full set according to the bit order corresponding to the n pieces of label data. By adopting the scheme, the operation time is reduced and the operation cost is reduced when the similarity between the code data is calculated.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a data classification method and apparatus, and a computer-readable storage medium.
Background
Nowadays, internet technology is rapidly developed, digitalized information in various industries is rapidly increased, storage space occupied by data is getting larger and larger, and processing, mining and application of mass data become the vital capacity in the competition of current science and technology type enterprises.
After collecting massive data, a scientific and technological enterprise generally needs to process the data, change natural language into data which can be identified by a computer, and exclude a large amount of similar data, thereby avoiding wasting time and cost due to repeated operation.
In the prior art, the scheme for calculating the similarity is to process the acquired natural language data into binary data, use all the binary data as a set, and calculate the similarity between each piece of binary data in the set and other binary data through a computer. However, if the above scheme is adopted, if there are x binary data, the number of times of calculation is x (x +1)/2, and when the amount of data in the set is very large, the above method requires too long calculation time and is high in cost.
Disclosure of Invention
The invention solves the technical problems that unreasonable data classification causes overlong operation time and higher cost when the data in the set is calculated.
To solve the foregoing technical problem, an embodiment of the present invention provides a data classification method, including: acquiring natural language data; preprocessing the natural language data to obtain code data corresponding to each piece of natural language data; dividing each piece of code data into n pieces of label data; n is more than or equal to 2; and dividing code data with the same label data and the same bit order of the same label data into a full set according to the bit order corresponding to the n pieces of label data.
Optionally, a natural language field value corresponding to each piece of natural language data is obtained; performing word segmentation processing on each natural language field value, and extracting corresponding keywords; obtaining a hash value of a keyword corresponding to each natural language field value; and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language.
Optionally, the similarity calculation includes at least one of: weighting calculation, combining calculation and dimension reduction calculation.
Optionally, the code data is a SimHash signature.
Optionally, similarity calculation is performed on the hash value of the keyword corresponding to each natural language field value, and code data of a preset digit corresponding to each natural language is obtained.
Optionally, the code data with m parts of same tag data in the same bit order is used as a full set; m is less than n.
The present invention also provides a data classification apparatus, comprising: an acquisition unit configured to acquire natural language data; the processing unit is used for preprocessing the natural language data to obtain code data corresponding to each piece of natural language data; a dividing unit configured to divide each piece of code data into n pieces of tag data, respectively; n is more than or equal to 2; and the classification unit is used for dividing the code data with the same label data and the same bit order of the same label data into a full set according to the bit order corresponding to the n parts of label data.
Optionally, the processing unit is configured to obtain a natural language field value corresponding to each piece of natural language data; performing word segmentation processing on each natural language field value, and extracting corresponding keywords; obtaining a hash value of a keyword corresponding to each natural language field value; and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language.
Optionally, the similarity calculation includes at least one of: weighting calculation, combining calculation and dimension reduction calculation.
Optionally, the code data is a SimHash signature.
Optionally, the processing unit is configured to perform similarity calculation on the hash value of the keyword corresponding to each natural language field value, and obtain code data of a preset number of bits corresponding to each natural language.
Optionally, the sorting unit is configured to use code data having m identical tag data sets in the same bit order as a full set; m is less than n.
The present invention also provides a computer readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed, perform the steps of any of the above-described data classification methods.
The invention also provides a data classification device, which comprises a memory and a processor, wherein the memory is stored with computer instructions, and the processor is characterized in that when the computer instructions are operated, the processor executes any one of the steps of the data classification method.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
preprocessing the acquired natural language data, acquiring code data corresponding to each natural language, dividing the code data into n parts of label data, taking the code data with the same label data in the same bit order as a full set, and finally classifying the code data into a plurality of full sets. When similarity calculation is carried out, only the similarity between code data in each full set needs to be calculated respectively, so that the calculation time can be greatly reduced, and the calculation cost can be reduced.
Drawings
Fig. 1 is a schematic flow chart of a data classification method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a data classification apparatus according to an embodiment of the present invention.
Detailed Description
After collecting massive data, a scientific and technological enterprise generally needs to process the data, change natural language into data which can be identified by a computer, and exclude a large amount of similar data, thereby avoiding wasting time and cost due to repeated operation.
In the prior art, the scheme for calculating the similarity is to process the acquired natural language data into binary data, use all the binary data as a set, and calculate the similarity between each piece of binary data in the set and other binary data through a computer. However, if the above scheme is adopted, if there are x binary data, the number of times of calculation is x (x +1)/2, and when the amount of data in the set is very large, the above method requires too long calculation time and is high in cost.
In the embodiment of the invention, the acquired natural language data is preprocessed to acquire code data corresponding to each natural language, the code data is divided into n parts of label data, the code data with the same label data in the same bit is taken as a full set, and finally the code data is classified into a plurality of full sets. When similarity calculation is carried out, only the similarity between code data in each full set needs to be calculated respectively, so that the calculation time is greatly reduced, and the calculation cost is reduced.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Referring to fig. 1, a data classification method provided by the embodiment of the present invention includes the following specific steps;
step S101, natural language data is acquired.
In particular implementations, the natural language data may be obtained from an internet platform or from a database. In practical application, a user can determine the source of the natural language data according to actual requirements, and the acquisition source of the natural language data is not limited by the invention.
And step S102, preprocessing the natural language data to obtain code data corresponding to each piece of natural language data.
In a specific implementation, since the computer cannot directly process the natural language data, the natural language data may be preprocessed to convert the natural language data into computer-readable code data.
In the specific implementation, the code data readable by the computer is usually represented as binary data, and the user can also determine the format of the code data according to actual requirements.
Step S103, dividing each piece of code data into n pieces of tag data.
In specific implementations, n is typically greater than or equal to 2.
In specific implementation, the value of n is determined by a user according to actual requirements.
In a specific implementation, the code data is divided according to actual requirements of users.
For example, the code data 10011100, n is 4, and the corresponding tag data may be 10, 01, 11, and 00, or 100,111,0, and 0, respectively.
In particular implementations, each piece of code data is typically partitioned in the same manner. For example, the first piece of code data is 10011100, and the corresponding tag data are 10, 01, 11, and 00, respectively; the second piece of code data is 11100011, and the corresponding label data is 11, 10, 00, and 11, respectively.
Step S104, dividing code data which have the same label data and have the same bit order of the same label data into a full set.
In a specific implementation, the order of the bits corresponding to the n pieces of tag data is determined.
For example, code data a is divided into four tag data of 11,00,11, and 00 in its own data arrangement order, and code data B is divided into four tag data of 11,11,00, and 11 in its own data arrangement order, and since the tag data of the first order in code data a and code data B are the same, code data a and code data B can be divided into the same full-size set.
In the specific implementation, when the similarity between the code data is calculated, the similarity is calculated according to the data with the same order between the code data, so that the different code data in different full sets have no similarity or have too low similarity and no reference.
By adopting the scheme to divide the full set, the computer can calculate the similarity between the code data in the full set when calculating the similarity between the codes. If the total X code data are divided into 4 full-volume sets, the number of the code data in each full-volume set is X respectively1,X2,X3And X4Then the degree of similarity calculation is X1(X1+1)/2+X2(X2+1)/2+X3(X3+1)/2+X4(X4+1)/2, compared with the similarity calculated directly between all code data, the calculation amount is reduced.
In specific implementations, typically X1+X2+X3+X4X, however, there may be the same piece of code data divided into more than one full-size set, so X1+X2+X3+X4May be slightly larger than X.
In the embodiment of the invention, the natural language field value can be extracted from the natural language data by converting the natural language data into the code data.
In the embodiment of the invention, the natural language field value is subjected to word segmentation processing, wherein the word segmentation processing comprises the steps of removing brackets and the content in the brackets, deleting spaces and converting English into lower case, then removing stop words in the natural language field value, extracting key words in the natural language field value and determining the importance degree of each key word.
In the embodiment of the invention, after the keywords in the natural language field value are obtained, the keywords can be used as the feature vectors, the Hash value (Hash) of each feature vector is calculated, and then the similarity calculation is performed on the Hash values corresponding to the keywords, so as to obtain the code data corresponding to each natural language.
In an embodiment of the present invention, the similarity calculation includes at least one of: weighting calculation, combining calculation and dimension reduction calculation.
For example, the natural language field value "the employee in area 51 of the united states says that there are 9 flying saucer in the interior and the gray alien is seen", and the above natural language field value is participled to obtain "the employee in area 51 of the united states (4), 5, (3) the employee in area 51 of the united states (3), (3) the employee in area (1), (2), (1) the flying saucer (3), (5) the gray alien (4), (5)" the word in parentheses represents the importance of the word in the whole sentence, and the number is more important.
Hash values of the respective keywords are calculated, for example, "usa" is 100101 by the hash algorithm, and "51 zone" is 101011 by the hash algorithm. The string thus becomes a string of numbers.
And (3) weighting calculation: forming a weighted numeric string according to the hash value of each keyword and the weight of the keyword, wherein the hash value of the "United states" is "100101", and the weighted numeric string is calculated as "4-4-44-44" through weighting; the hash value of "51 field" is "101011", and is calculated by weighting as "5-55-555".
And (3) merging and calculating: and accumulating the sequence values obtained by weighting calculation of the keywords to form only one sequence string. For example, "4-4-44-44" in "usa," 5-55-555 "in" 51 region, "each bit is accumulated," 4+ 5-4 + -5-4 + 54 + -5-4 + 54 +5 "═ 9-91-119". Here, only two keywords are computed as an example, and the actual computation requires the accumulation of the sequence strings for all words.
And (3) dimension reduction calculation: the "9-91-119" obtained by the combination calculation is converted into a 01 string, forming code data. The calculation scheme is that if each bit is greater than 0 and is recorded as 1, and less than 0 is recorded as 0, the calculation result is: "101011".
In the embodiment of the invention, the code data is a SimHash signature.
In the embodiment of the invention, when the similarity calculation is carried out on the hash value of the keyword corresponding to the natural language, the bit number of the obtained code data can be set so as to facilitate the subsequent operation and the division of the label data, and the specific bit number can be determined by a user according to the actual condition.
In the embodiment of the invention, after code data are divided into n parts, the code data which have m parts of same label data and have the m parts of same label data in the same bit order are taken as a full set; m is less than n.
For example, code data C is divided into three tag data of 11,00,11, code data D is divided into 11,00,10, and code data E is divided into 11,11, 01. When m is set to 1, the tag data of the first order of code data C, D and E are both the same as 11, and thus code data C, D, E can be divided into the same full-size set. When m is set to 2, the tag data of the first bit and the second bit of the code data C, D are the same, and thus the code data C, D can be divided into the same full-size set.
In a specific implementation, the value of m can be determined by a user according to actual conditions.
In specific implementation, after the code data are classified by adopting the scheme, the label data on the same bit order are different, or the same less code data are divided into different full sets, and the similarity of the code data in the different full sets is lower. Therefore, if the database is simplified and code data with higher similarity is eliminated, the similarity between the code data in a single full set is only needed to be calculated, and compared with the scheme in the prior art, the method greatly reduces the operation amount and reduces the cost.
Referring to fig. 2, the present invention further provides a data classification apparatus 20, including:
an acquisition unit 201 for acquiring natural language data;
the processing unit 202 is configured to perform preprocessing on the natural language data, and acquire code data corresponding to each piece of natural language data;
a dividing unit 203 for dividing each piece of code data into n pieces of tag data, respectively; n is more than or equal to 2;
the sorting unit 204 is configured to divide the code data having the same tag data and the same bit order of the same tag data into a full set according to the bit order corresponding to the n pieces of tag data.
In this embodiment of the present invention, the processing unit 202 is configured to obtain a natural language field value corresponding to each piece of natural language data; performing word segmentation processing on each natural language field value, and extracting corresponding keywords; obtaining a hash value of a keyword corresponding to each natural language field value; and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language.
In an embodiment of the present invention, the similarity calculation includes at least one of: weighting calculation, combining calculation and dimension reduction calculation.
In the embodiment of the invention, the code data is a SimHash signature.
In this embodiment of the present invention, the processing unit 202 is configured to perform similarity calculation on a hash value of a keyword corresponding to each natural language field value, and obtain code data of a preset number of bits corresponding to each natural language.
In this embodiment of the present invention, the classifying unit 204 is configured to use code data having m identical tag data sets and the m identical tag data sets in the same bit order as a full set; m is less than n.
The present invention also provides a computer readable storage medium having stored thereon computer instructions which, when executed, perform the steps of any of the data classification methods described above.
The invention also provides a data classification device, which comprises a memory and a processor, wherein the memory is stored with computer instructions, and the processor executes the steps of any one of the data classification methods when the computer instructions are executed.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (14)
1. A method of data classification, comprising:
acquiring natural language data;
preprocessing the natural language data to obtain code data corresponding to each piece of natural language data;
dividing each piece of code data into n pieces of label data; n is more than or equal to 2;
dividing code data with the same label data and the same bit order of the same label data into a full set according to the bit order sequence corresponding to the n parts of label data;
and respectively calculating the similarity between the code data in each full-volume set.
2. The data classification method according to claim 1, wherein the preprocessing the natural language data to obtain code data corresponding to each piece of natural language data includes:
acquiring a natural language field value corresponding to each piece of natural language data;
performing word segmentation processing on each natural language field value, and extracting corresponding keywords;
obtaining a hash value of a keyword corresponding to each natural language field value;
and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language.
3. The data classification method according to claim 2, characterized in that the similarity calculation comprises at least one of: weighting calculation, combining calculation and dimension reduction calculation.
4. The data classification method according to claim 2, characterized in that the code data is a SimHash signature.
5. The data classification method according to claim 2, wherein the performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language field value includes:
and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data of a preset digit corresponding to each natural language.
6. The data classification method according to claim 1, wherein the dividing of the code data in which the same tag data exists and the same bit order of the same tag data into a full set comprises:
taking code data which are provided with m parts of same label data and have the same bit order of the m parts of same label data as a full set; m is less than n.
7. A data sorting apparatus, comprising:
an acquisition unit configured to acquire natural language data;
the processing unit is used for preprocessing the natural language data to obtain code data corresponding to each piece of natural language data;
a dividing unit configured to divide each piece of code data into n pieces of tag data, respectively; n is more than or equal to 2;
the sorting unit is used for dividing code data with the same label data and the same bit order of the same label data into a full set according to the bit order corresponding to the n parts of label data;
and the calculating unit is used for respectively calculating the similarity between the code data in each full set.
8. The data classification device according to claim 7, wherein the processing unit is configured to obtain a natural language field value corresponding to each piece of natural language data; performing word segmentation processing on each natural language field value, and extracting corresponding keywords; obtaining a hash value of a keyword corresponding to each natural language field value; and performing similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data corresponding to each natural language.
9. The data classification apparatus of claim 8, wherein the similarity calculation comprises at least one of: weighting calculation, combining calculation and dimension reduction calculation.
10. The data sorting apparatus according to claim 8, wherein the code data is a SimHash signature.
11. The data classification device according to claim 8, wherein the processing unit is configured to perform similarity calculation on the hash value of the keyword corresponding to each natural language field value to obtain code data of a preset number of bits corresponding to each natural language.
12. The data sorting device according to claim 7, wherein the sorting unit is configured to take code data having m identical label data sets in the same bit order as a full set; m is less than n.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data classification method according to any one of claims 1 to 6.
14. A data sorting device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor performs the steps of the data sorting method according to any one of claims 1 to 6 when the computer program is run.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811147293.0A CN109299260B (en) | 2018-09-29 | 2018-09-29 | Data classification method, device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811147293.0A CN109299260B (en) | 2018-09-29 | 2018-09-29 | Data classification method, device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109299260A CN109299260A (en) | 2019-02-01 |
CN109299260B true CN109299260B (en) | 2021-01-19 |
Family
ID=65161121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811147293.0A Active CN109299260B (en) | 2018-09-29 | 2018-09-29 | Data classification method, device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299260B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111724187A (en) * | 2019-03-21 | 2020-09-29 | 上海晶赞融宣科技有限公司 | DMP audience data real-time processing method and device and computer readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120011395A (en) * | 2010-07-29 | 2012-02-08 | 에스케이커뮤니케이션즈 주식회사 | Method and System for Analyzing Document using Structure of Word/Stopword, Record Medium |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9244937B2 (en) * | 2013-03-15 | 2016-01-26 | International Business Machines Corporation | Efficient calculation of similarity search values and digest block boundaries for data deduplication |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN106294350B (en) * | 2015-05-13 | 2019-10-11 | 阿里巴巴集团控股有限公司 | A kind of text polymerization and device |
CN107644010B (en) * | 2016-07-20 | 2021-05-25 | 阿里巴巴集团控股有限公司 | Text similarity calculation method and device |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN108573045B (en) * | 2018-04-18 | 2021-12-24 | 同方知网数字出版技术股份有限公司 | Comparison matrix similarity retrieval method based on multi-order fingerprints |
-
2018
- 2018-09-29 CN CN201811147293.0A patent/CN109299260B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120011395A (en) * | 2010-07-29 | 2012-02-08 | 에스케이커뮤니케이션즈 주식회사 | Method and System for Analyzing Document using Structure of Word/Stopword, Record Medium |
CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
Also Published As
Publication number | Publication date |
---|---|
CN109299260A (en) | 2019-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107229627B (en) | Text processing method and device and computing equipment | |
CN106570128A (en) | Mining algorithm based on association rule analysis | |
CN112417028B (en) | Wind speed time sequence characteristic mining method and short-term wind power prediction method | |
WO2014068990A1 (en) | Relatedness determination device, permanent physical computer-readable medium for same, and relatedness determination method | |
US11874866B2 (en) | Multiscale quantization for fast similarity search | |
CN111597297A (en) | Article recall method, system, electronic device and readable storage medium | |
CN110674865A (en) | Rule learning classifier integration method oriented to software defect class distribution unbalance | |
WO2016157275A1 (en) | Computer and graph data generation method | |
CN113986950A (en) | SQL statement processing method, device, equipment and storage medium | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN112818117A (en) | Label mapping method, system and computer readable storage medium | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
CN113505583B (en) | Emotion reason clause pair extraction method based on semantic decision graph neural network | |
CN109299260B (en) | Data classification method, device and computer readable storage medium | |
CN111125329B (en) | Text information screening method, device and equipment | |
CN109902162B (en) | Text similarity identification method based on digital fingerprints, storage medium and device | |
CN116503608A (en) | Data distillation method based on artificial intelligence and related equipment | |
CN116257601A (en) | Illegal word stock construction method and system based on deep learning | |
JP5824429B2 (en) | Spam account score calculation apparatus, spam account score calculation method, and program | |
Desai et al. | Analysis of Health Care Data Using Natural Language Processing | |
CN113705873B (en) | Construction method of film and television work score prediction model and score prediction method | |
CN117251574B (en) | Text classification extraction method and system based on multi-feature data fusion | |
CN114118085B (en) | Text information processing method, device and equipment | |
KR102317205B1 (en) | Method and apparatus for estimating parameters of compression algorithm | |
CN107122392B (en) | Word stock construction method, search requirement identification method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |