CN116881799A

CN116881799A - Method for classifying cigarette production data

Info

Publication number: CN116881799A
Application number: CN202310862754.7A
Authority: CN
Inventors: 李新建; 邹鑫灏; 陈小虎; 严智; 谢超; 郭著松; 崔书方; 潘伟; 刘艳超; 侯毓; 程婉君
Original assignee: China Tobacco Hubei Industrial LLC
Current assignee: China Tobacco Hubei Industrial LLC
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-13

Abstract

The invention discloses a method for classifying cigarette production data, which comprises the following steps: s1, extracting representative words of a service system, and taking the representative words as core words; s2, calculating the association distance between every two business systems to form a business system core word vector; a BOW word bag model is adopted, and all core words of each service system are formed into service system core word vectors in the occurrence times of different service systems; s3, normalizing core word vectors of the service system; s4, calculating the correlation distance of the service system; s5, drawing a two-dimensional distance distribution diagram of the service system; s6, clustering the service system; k-means clustering algorithm based on Euclidean distance is adopted, and K classifications are finally clustered in the determined two-dimensional plane; s7, determining a classification result; an appropriate K value is determined to determine the classification result. The method adopts transverse field association analysis, performs distance calculation, clustering and classification on a two-dimensional plane, and visualizes and visually displays; and visually displaying the classification area on a two-dimensional plane, and determining the final K value and the classification result.

Description

Method for classifying cigarette production data

Technical Field

The invention relates to the technical field of data security, in particular to a method for classifying production data of cigarettes.

Background

Data classification is the primary task of data security management. And each industry carries out data classification work according to the service characteristics and data of the industry. The digitalized transformation of tobacco manufacturing enterprises has the main business processes of informatization and digitalization, and the generated information is gradually transformed into important digital assets of the enterprises in different forms. Meanwhile, industrial data is more complex and diversified along with the increase of application scenes, and potential safety risks are faced from technology to management in the process of transferring industrial data which is not subjected to classified management among different businesses. The influence of security threats such as data leakage of the current tobacco manufacturing enterprises not only affects the interests of the enterprises, but also has a certain influence on social production and national security. How to guide cigarette manufacturing enterprises to standardize industrial data classification management and practically guarantee industrial data safety is a current urgent problem to be solved.

Therefore, a method for classifying the cigarette production data is provided.

Disclosure of Invention

The invention aims to provide a method for classifying cigarette production data, which solves the problems of insufficient automation and refinement degree of data classification of tobacco enterprises and huge manual input.

In order to achieve the above purpose, the present invention provides the following technical solutions: the method for classifying the cigarette production data is characterized by comprising the following steps of:

s1, extracting representative words of a service system, and taking the representative words as core words;

s2, calculating the association distance between every two business systems to form a business system core word vector; a BOW word bag model is adopted, and all core words of each service system are formed into service system core word vectors in the occurrence times of different service systems;

s3, normalizing core word vectors of the service system; wherein d is _i Representing weights, retaining only valuable words, where c _i Indicating that word i appears c in the business system field _i The denominator is the number of all useful words, and the formula is:

s4, calculating the correlation distance of the service system; by calculating the euclidean distance of the core word vector,

s5, drawing a two-dimensional distance distribution diagram of the service system; selecting a system as a center dot in the two-dimensional plane; the distance between other systems and the system is used as the length of a connecting line, and the equal-length connecting points are arranged in a circumference manner; far outward discharging; gradually arranging all the systems on a two-dimensional plane; the point is a service system; the line is a service system associated distance connecting line;

s6, clustering the service system; k-means clustering algorithm based on Euclidean distance is adopted, and K classifications are finally clustered in the determined two-dimensional plane;

s7, determining a classification result; an appropriate K value is determined to determine the classification result.

Further, the representative words related to the business system include time, place, person, action, result, as core words.

Further, in step S2, the service system includes a representative word including name, mobile phone number, tobacco producing area, tobacco price, shredding, temperature and logistics; the second representative word of the service system comprises a name, a home address, a mobile phone number, a tobacco producing place, a tobacco price, a roll package and an activity;

the third representative word of the service system comprises a mobile phone number, a tobacco leaf producing place, a tobacco leaf price, a rolling package and an activity;

the fourth representative word of the service system comprises a name, a mobile phone number, a tobacco leaf producing place, a tobacco leaf price, a tobacco leaf manufacturing process and a tobacco leaf manufacturing process;

the service system five representative words comprise names, mobile phone numbers, coil packages and actions;

in step S2, two service systems may be selected from the five service systems.

Further, in step S2, two service systems, namely, the first service system and the second service system, construct a word bag:

dictionary= {1: "name", 2."like", 3."tobacco producing place", 4."tobacco price", 5."making filament", 6."home address", 7."wrapping", 8."active", 9."temperature", 10."logistics" }.

Further, in step S6,

when k=5 is chosen, clustering is performed to obtain: 1) a development data field, 2) a production data field, 3) a management data field, 4) an operation and maintenance data field, 5) an external data field.

When k=6 is chosen, clustering is performed to obtain: 1) a development data field, 2) a production data field, 3) an administration data field, 4) a management data field, 5) a support data field, 6) an external data field.

When k=7 is chosen, clustering yields: 1) a development data field, 2) a production data field, 3) a production data field, 4) an administration data field, 5) quality process control data, 6) plant operation control data, 7) production process control data.

Further, in step S7,

and finally selecting K=6 according to service judgment, and clustering to obtain: 1) a development data field, 2) a production data field, 3) an administration data field, 4) a management data field, 5) a support data field, 6) an external data field.

Further, in step S5, if some connection lines cannot simultaneously satisfy connection of a plurality of points on a plane, connection with a large value is discarded.

Compared with the prior art, the invention has the beneficial effects that:

(1) The method changes the traditional method for classifying the isolated field definition, adopts transverse field association analysis, performs distance calculation cluster classification on a two-dimensional plane, and visualizes and visually displays.

(2) The relative distance between the core points is calculated through the K-means clustering algorithm based on Euclidean distance, and different results cannot be generated due to different selected two-dimensional coordinate dots.

(3) The K value is adjusted to obtain different classification results, classification areas can be intuitively displayed on a two-dimensional plane, and a reference basis is provided for determining the final K value and the classification results.

Drawings

FIG. 1 is a flow chart of the classification of production data according to an embodiment of the present invention;

FIG. 2 is a two-dimensional distance distribution diagram of a business system according to an embodiment of the present invention;

FIG. 3 is a diagram of normalized core word vectors of a business system according to an embodiment of the present invention;

FIG. 4 is a business system associated distance graph of an embodiment of the present invention;

fig. 5 is a two-dimensional distance distribution diagram of a business system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5, the present invention provides a technical solution: the method for classifying the cigarette production data is characterized by comprising the following steps of: .

As shown in fig. 1, the steps are described:

1. and extracting representative words of the service system as core words. And removing the field of whether the state is equal to the non-entity content. The time, place, person, action, result, etc. associated with the business system represent words as core words. The extraction object is a data field extracted from a data description such as a database table field, a transmission interface API, a message field, a data packet json, xml, a data file, and the like of the service system, for example: name, age, tobacco location, baking time, etc.

2. Calculating the association distance between every two business systems to form a business system core word vector; and adopting a BOW (bag of words) word bag model, and forming the core word vectors of the service systems by using the occurrence times of all core words of each service system in different service systems.

For example, there are two business systems:

business system is a representative word: name mobile phone number tobacco leaf producing place tobacco leaf price shredding temperature mobile phone number logistics.

Service system two representative words: the tobacco price package of the tobacco producing place of the name home address mobile phone number is active.

Based on the two business systems, a dictionary, namely a word bag, is constructed:

The dictionary contains 10 different words in total, and by using the index number of the dictionary, the above two business systems can each be represented by a 10-dimensional vector (the number of times a word appears in a business system field is represented by the integer numbers 0-n (n is a positive integer):

1)X ₁ ＝[1,2,1,1,1,0,0,0,1,1]

2)X ₂ ＝[1,1,1,1,0,1,1,1,0,0]

each element in the vector represents the number of times the associated element in the dictionary appears in the business system. However, it can be seen in constructing the vector that the order in which the words appear in the original system is not expressed, i.e., regardless of the order, and only the number of occurrences is fetched.

3. And normalizing the core word vector of the service system. By d _i Representing weights, retaining only valuable words, where c _i Indicating that word i appears c in the business system field _i The denominator is the number of all useful words, and the formula is:

the core word vectors of the component service system illustrated above are normalized as follows:

1)X ₁ ＝[0.125,0.25,0.125,0.125,0.125,0,0,0,0.125,0.125]

2)X ₂ ＝[0.14,0.14,0.14,0.14,0.00,0.14,0.14,0.14,0.00,0.00]

4. and calculating the correlation distance of the service system. The euclidean distance of the core word vector is calculated.

5. And drawing a two-dimensional distance distribution diagram of the service system. And selecting a system as a center dot in the two-dimensional plane. The distance between other systems and the system is used as the connecting length, the equal-length connecting points are arranged in a circle, the distance is far outwards arranged, and all the systems are gradually arranged on a two-dimensional plane. As shown in fig. 2 below. The points are service systems, and the lines are service system associated distance connecting lines. (if some connecting lines cannot meet the requirement that a plurality of points are connected on a plane at the same time, the connection with a large value is abandoned).

6. Clustering the business systems. The K-means clustering algorithm based on Euclidean distance is adopted, and the closer the distance between two targets is, the greater the similarity is. The two-dimensional plane determined by fig. 2 finally clusters K classifications.

7. And determining a classification result. And finally selecting K=6 according to service judgment, and clustering to obtain: 1) a development data field, 2) a production data field, 3) an administration data field, 4) a management data field, 5) a support data field, 6) an external data field.

The method is characterized by comprising the following steps:

classifying by service system clustering: (1) extracting representative words of the business system as core words. (2) forming word vectors by adopting a word bag model. (3) clustering by adopting a K-means algorithm. (4) examining the classification result using different K values.

And visually displaying the classification result on a two-dimensional plane: (1) selecting a system as a center dot in a two-dimensional plane. Other systems have the same length as the system distance, and the equal length connection points are arranged in a circle. Far outwardly. The point is the business system. The line is a business system associated distance line. (2) And selecting different K values to obtain different clustering classification results, and performing visual presentation.

For example, there are two business systems:

1-extracting representative words of the business system as core words.

Business system is a representative word: name mobile phone number tobacco leaf producing place tobacco leaf price shredding temperature mobile phone number logistics;

service system two representative words: the price package of the tobacco leaves in the tobacco producing place of the name home address mobile phone number is active;

three representative words of business system: mobile phone number tobacco leaf producing place tobacco leaf price wrapping is active;

service system four representative words: tobacco price of the tobacco producing area with the name of the mobile phone number is cut into shreds;

service system five representative words: the name mobile phone number package is active;

two service systems may be selected from the five service systems.

And 2, calculating the association distance between every two business systems to form a core word vector of the business system.

dictionary= {1: "name", 2 } "cell phone number", 3."tobacco producing place", 4."tobacco price", 5."making filament", 6."home address", 7."wrapping", 8."active", 9."temperature", 10."logistics" }.

1)X ₁ ＝[1,2,1,1,1,0,0,0,1,1]

2)X ₂ ＝[1,1,1,1,0,1,1,1,0,0]

3)X ₃ ＝[0,1,1,1,0,0,1,1,0,0]

4)X ₄ ＝[1,1,1,1,2,0,0,0,0,0]

5)X ₅ ＝[1,1,0,0,0,0,1,1,0,0]

4-3-service system core word vector normalization, and the calculation result is shown in figure 3;

5-calculating the service system association distance. D (X) _i ，X _j ) The calculation result is shown in fig. 4;

6-drawing a two-dimensional distance distribution diagram of the service system, as shown in fig. 5;

7-clustering the service systems.

Taking k=1, all in one class. K=4, X2 and X3 are the same category, and the others are 3 categories. And determining a classification result. An appropriate K value is determined to determine the classification result. The present example recommends a value of k=4.

The data field mark is adopted for classification, and the isolated viewing field attribute is used for classification, but only one field point. Sensitive data is not obtained from the cross operation of the plurality of fields in the transverse direction, and the data quantity is accumulated in the longitudinal direction to reach the sensitivity degree. The method explores and carries out association analysis from transverse and multi-field, constructs core word vectors of the service system, calculates the similarity distance of the service system and carries out two-dimensional plane clustering classification.

The method for classifying the tobacco enterprise data by using the independent field semantic association analysis is characterized in that the method for classifying the tobacco enterprise data by using the independent field semantic association analysis is used for detecting and calculating the common personnel data and the tobacco manufacturing package and cut tobacco manufacturing data based on the large model, and the method is finer than the traditional classification and is more beneficial to service data sharing and flowing.

The method explores and changes the traditional method for classifying the isolated field definition, adopts the method for weighting the external sharing according to the horizontal field semantic association analysis and the longitudinal accumulation field based on the large model, and inherits the manual classification base number in the vertical direction to perform three-dimensional space clustering classification.

The traditional manual classification marking has huge workload, is difficult to memorize and is not fine. There is a separate look at the fields for the classification work of the data, relying on the business personnel to classify the interpretation and understanding of the fields. Whereas the data of all business systems has similar meaning with language and sentence expressions. The fields cannot be seen in isolation. The method is a purposeful and meaningful selection and presentation of specific data during data generation, warehousing and sharing. For example, inquiring the qualification rate of a product in a cut tobacco manufacturing workshop of a cigarette factory for 2 months. Although the final output is one percentage. But wherein the associated time, total product quantity, compliance product quantity, etc. fields. The traditional classification labels only look at the field definition and do not perform relevant field weighting analysis.

Primary classification: the cigarette production data is subjected to a correlation analysis flow in a transverse multi-field mode, and the correlation analysis flow is shown in figure 1. Through multi-field association analysis, a clustering algorithm is adopted to extract representative words (associated with a plurality of words) of main business systems of each tobacco industry as core words, then based on word association, primary association, secondary association and tertiary association are extended, such as A and B association, B and C association and C and D association, mutual weight attenuation is calculated, and association of A and D is obtained. All words are aggregated to the core words as far as possible, industry word lists are made, and clustering is carried out by adopting the clustering algorithm industry word lists, so that data classification is obtained.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The method for classifying the cigarette production data is characterized by comprising the following steps of:

s3, normalizing core word vectors of the service system; wherein, the liquid crystal display device comprises a liquid crystal display device,

by d _i Representing weights, retaining only valuable words, where c _i Indicating that word i appears c in the business system field _i The denominator is the number of all useful words, and the formula is:

2. The method for classifying cigarette production data according to claim 1, wherein: representative words associated with the business system include time, place, character, action, result, as core words.

3. The method for classifying cigarette production data according to claim 1 or 2, wherein: in the step S2, the service system comprises representative words including name, mobile phone number, tobacco producing place, tobacco price, shredding, temperature and logistics; the second representative word of the service system comprises a name, a home address, a mobile phone number, a tobacco producing place, a tobacco price, a roll package and an activity;

in step S2, two service systems may be selected from the five service systems.

4. A method of sorting cigarette production data according to claim 3, wherein: in step S2, two business systems, namely, a first business system and a second business system, construct a word bag:

5. A method of sorting cigarette production data according to claim 3, wherein: in the step S6 of the process,

6. The method for sorting cigarette production data according to claim 5, wherein: in the step S7 of the process,

7. The method for classifying cigarette production data according to claim 1, wherein: in step S5, if some connection lines cannot simultaneously satisfy the connection of a plurality of points on a plane, the connection with a large value is discarded.