CN115357769B

CN115357769B - Government affair data labeling system and method

Info

Publication number: CN115357769B
Application number: CN202210922987.7A
Authority: CN
Inventors: 严洪涛; 张军
Original assignee: Wuxi Zhiwei Digital Technology Co ltd
Current assignee: Wuxi Zhiwei Digital Technology Co ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2023-07-04
Anticipated expiration: 2042-08-02
Also published as: CN115357769A

Abstract

The invention discloses a government affair data labeling system and a government affair data labeling method, which are characterized in that with the development of the Internet of things and big data technology, the information quantity of government affair data is larger and larger, the data formats and types formed by different business systems are different, and the government affair data labeling system and the government affair data labeling method can not be completed almost by manual identification. According to the invention, the original data is sampled, the preset type mark is added, the training model is input, and the data category is output after the training model is classified, so that automatic classification and identification of mass data can be realized, and a foundation is established for subsequent data retrieval application. The method is also applicable to blind classification, blind identification and data type subdivision of other types of data information or data information generated by other types of service systems.

Description

Government affair data labeling system and method

Technical Field

The invention relates to the technical fields of the Internet of things and big data, in particular to a government affair data labeling system and method.

Background

With the development of the internet of things and big data technology, the types of data are more and more, the data volume is more and more, and the automatic classification of unknown data is more and more important. The traditional data classification is mainly used for carrying out standardized processing on data in the modes of interface butt joint, database butt joint and manual identification, the method is low in efficiency, the intrusion on the original service system is large, the unified standardization of the whole disk is difficult to achieve, the service systems are always independent and standard, temporary conversion is needed when the butt joint is carried out, and effective global data management cannot be carried out.

For government affair data, the actual types are not too much, but as each government affair information system is independently developed, the same government affair data are inconsistent in marks in different places, different time periods and different government affair information systems, the standards are not uniform, the formats possibly also have differences, for example, the field name of an A system to an identity card number is sfz, a B system is userID, a C system is an old identity card number and the like, for example, a system A is represented by 'year-month-day' for the date of birth, a system B is represented by 'year-month-day', a system C is represented by 'year-month-day', and the like, and the unified classification identification is lacking; when the system D needs related information, unless the representation modes of different systems are accurately known, only the information item by item can be used for checking whether useful information exists, and whether similar information exists in different systems cannot be determined, so that a blind recognition and automatic recognition means for unknown data is lacked. In addition, with the development of informatization technology, the huge data lacks the sign, relies on the huge work load of manual sign entirely, hardly accomplishes the unified standard of full disk.

Disclosure of Invention

In view of the above, the present invention provides a system and a method for labeling government affair data, which can effectively solve the above-mentioned problems in the prior art.

The invention relates to a government affair data labeling system and a government affair data labeling method, which are characterized in that the system and the method are characterized in that the original data are sampled, a preset type mark is added, a training model is input, and the data category is output after the training model is classified; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, number and character mixture, pictures, videos, texts, other types and the like, and the other types refer to all other types except the types of the numbers, the Chinese characters, the English characters, the number and character mixture, the pictures, the videos and the texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:

s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience; the step is completed before the step S1 is executed, but the step is not required to be executed before the step S1 is executed each time, the subsequent multiple times of data labeling work can be supported after the weight parameter estimation is completed, and only the proper weight parameter estimation value is required before the data labeling work is executed;

s1: collecting data, namely collecting service system data to obtain various data, wherein the service system data comprises streaming or non-streaming data, structured or unstructured data, document data, internet data and the like;

s2: data preprocessing, including performing original data sampling and attaching a preset type mark, wherein the original data sampling includes intercepting previous continuous d bits (bits) of an original data unit; the original data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding preset type marks comprise adding a plurality of digits of preset type identification codes in front of the sampling number, and if the preset 8 types can be represented by 3 binary numbers; since the number of bytes occupied by a general data file is several KB (Byte) or even several MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculated amount and the calculation efficiency, and the effectiveness of feature extraction, and is not too long or too short, so that the calculated amount is too large, and the features of the original data unit can not be completely covered by the short data unit; a smaller value of d, while may represent a data characteristic of a portion of the original data unit, may be difficult to extract its valid characteristic for data items or image data having a longer format field level;

s3: data analysis, namely respectively calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;

for each group of input preprocessed sampling data, counting the total data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; for each set of inputs, record x _i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer is not calculated into the total layer number, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; training model layer l, layer j to layer l+1, layer i weights transferred by trainingThe parameters are

1≤j≤s _l ，1≤i≤s _l+1 ；

First layer ith bit input value

Second layer ith intermediate training value +.>

The ith intermediate training value of the first layer is +.>

1≤i≤s _l There is

Wherein, the liquid crystal display device comprises a liquid crystal display device,

for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>

X _l Intermediate training values, otherwise known as intermediate quantities,

the final layer output value is obtained after the layer-by-layer calculation

1≤i≤K；

S4: outputting the classification result, namely the data label h, when

When taking h _i =0; when->

When taking h _i =1; when->

When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is ₁ 0 probability error range, K ₂ A probability error range of 1; classification result h= [ h ] ₁ ，h ₂ ，…，h _K ]The classification result is also called a subdivision data type relative to a preset type;

s5: labeling and archiving the data, labeling the original data unit corresponding to the sampled data by using the classification result h, and archiving and storing the labeled original data unit.

And repeating the steps S1 to S5 for each group of original data units, so that labeling of all original data can be completed.

Preferably, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, including the following steps:

s01: acquiring sample data, recording the total number of samples as N, and recording the data label of the nth sample

It is known that when the sample belongs to the kth class, -/->

Otherwise->

Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;

s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the previous continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;

s03: sample training, calculating training values of each layer of the neural network model

For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; x is x _i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer and the last layer is connected with an output layer, the output layer does not count the total layer number, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows

1≤j≤s _l ，1≤i≤s _l+1 ；

Inputting sample data and a weight parameter initial value;

calculating the ith intermediate value of the first layer as

1≤i≤s _l There is

is the firstWeight parameter vector when all intermediate training values of layer are transferred to ith training value of layer 1+1th, superscript T is matrix transposition, < >>

X _l Intermediate training values, otherwise known as intermediate quantities,

calculating layer by layer until the final layer output value is obtained

I is more than or equal to 1 and less than or equal to K; recording the training result of the nth sample

Marked as->

1≤n≤N；

Repeating the above step S33 for each set of input preprocessed sample data;

s04: calculating a cost function, calculating a cost function J (Θ) for all training results and weight parameters, Θ being all

Is a collection of (3);

wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter, and generally is a small amount; the above cost function is essentially a sum of the three terms of the positive bias expectation, the negative bias expectation and the bias penalty term of the probability;

s05: optimizing weight parameters for all weight parameters

Optimizing J (Θ)Minimizing J (Θ), i.e

Θ＝minimizeJ(Θ)

The optimizing comprises automatic optimizing or weight parameter combination traversing optimizing of general machine learning software; the weight parameter combination traversal optimization includes, for each

Repeatedly executing steps S03 and S04 one by one within the range of values according to the designated step length, and taking J (Θ) minimum +.>

As a result of the optimizing.

The second aspect is characterized by at least comprising six modules, namely parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, which are sequentially connected, wherein the six modules sequentially execute steps S0, S1, S2, S3, S4 and S5 respectively. The government data tagging system is essentially a computer system.

Further, the parameter estimation module includes five modules for acquiring sample data, preprocessing the sample data, training the sample, calculating a cost function, and optimizing a weight parameter, and the five modules sequentially execute steps S01, S02, S03, S04, and S05, respectively.

The invention has the advantages and beneficial effects that: the invention provides a government affair data labeling system and a government affair data labeling method, which provide an automatic classification and identification method for unknown class and mass government affair data, can reduce a large amount of manual identification workload, and can even finish a large amount of identification work which cannot be finished manually. After the identification is finished, each piece of original data is marked with a corresponding data type label, so that the subsequent data retrieval and professional application requirements can be facilitated. The method is also applicable to blind classification, blind identification and data type subdivision of other types of data information or data information generated by other types of service systems.

Drawings

FIG. 1 is a flow chart of a method for tagging government affair data;

FIG. 2 is a flow chart of a weight parameter optimizing estimation method;

FIG. 3 is a schematic block diagram of a neural network training model;

fig. 4 is a schematic block diagram of a government data tagging system.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and examples. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Example 1: government affair data labeling method

As shown in FIG. 1, the system and the method for labeling government affair data are characterized in that the system and the method input training models after sampling original data and adding a preset type mark, and output data types after being classified by the training models; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, number and character mixture, pictures, videos, texts, other types and the like, and the other types refer to all other types except the types of the numbers, the Chinese characters, the English characters, the number and character mixture, the pictures, the videos and the texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:

s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience; in the embodiment, the weight parameter estimation is performed by self-learning estimation through a neural network training model, and an initial value during estimation can be an empirical value, so that the progress of optimizing can be quickened;

s2: data preprocessing, including performing original data sampling and attaching a preset type mark, wherein the original data sampling includes intercepting previous continuous d bits (bits) of an original data unit; the original data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number; since the number of bytes occupied by a general data file is several KB (Byte) or even several MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculated amount and the calculation efficiency, and the effectiveness of feature extraction, and is not too long or too short, so that the calculated amount is too large, and the features of the original data unit can not be completely covered by the short data unit; a smaller value of d, while may represent a data characteristic of a portion of the original data unit, may be difficult to extract its valid characteristic for data items or image data having a longer format field level; the host computer of the embodiment adopts a workstation or a server, has stronger computing capability, generally takes d=800 bits, presets general types such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts, other and the like, and is marked by 3-bit binary numbers 000, 001, 010, 011, 100, 101, 110 and 111 in sequence;

for each set of input preprocessed sample data, the total number of data bits is m bits, m=d+b, where b is the number of bits of the preset type flag, in this embodiment b=3, d=800, m=803; for each set of inputs, record x _i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer is not calculated into the total layer number, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; the present embodiment uses a 5-layer neural network training model (no output layer is calculated), i.e., l=5, and s ₁ ＝s ₂ ＝s ₃ ＝s ₄ ＝m＝803，s ₅ =k=20; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows

1≤j≤s _l ，1≤i≤s _l+1 ；

First layer ith bit input value

Second layer ith intermediate training value +.>

The ith intermediate training value of the first layer is +.>

1≤i≤s _l There is

X _l Intermediate training values, otherwise known as intermediate quantities,

the final layer output value is obtained after the layer-by-layer calculation

1≤i≤K；

S4: outputting the classification result, namely the data label h, when

When taking h _i =0; when->

When taking h _i =1; when->

When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is ₁ 0 probability error range, K ₂ A probability error range of 1; classification result h= [ h ] ₁ ，h ₂ ，…，h _K ]The classification result is also called a subdivision data type relative to a preset type; if the preset digital type is subdivided, the embodiment further comprises three subclasses of identity information, position information and statistical information, and the subdivided data class mainly depends on the possible classification requirement of the actual data; the embodiment takes K ₁ ＝0.1，K ₂ =0.9, classification result h is a matrix of 1×20;

Further, as shown in fig. 2 and 3, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:

It is known that when the sample belongs to the kth class, -/->

Otherwise->

For each set of input preprocessed sample data, the number of data bits is m, m=d+b, where b is the number of bits of the preset type flag, in this embodiment b=3, d=800, m=803; x is x _i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer and the last layer is connected with an output layer, the output layer does not count the total layer number, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; the present embodiment uses a 5-layer neural network training model (no output layer is calculated), i.e., l=5, and s ₁ ＝s ₂ ＝s ₃ ＝s ₄ ＝m＝803，s ₅ =k=20; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows

1≤j≤s _l ，1≤i≤s _l+1 ；

The initial values of the sample data and the weight parameters are input, the initial values are all 0.5 when the sample data and the weight parameters are tested for the first time, and the initial values are used as one group of weight parameter results tested for the next time in the subsequent debugging process, so that the group of parameters are considered to be more reasonable, and the testing and debugging process can be accelerated;

calculating the ith intermediate value of the first layer as

1≤i≤s _l There is

X _l Intermediate training values, otherwise known as intermediate quantities,

calculating layer by layer until the final layer output value is obtained

Marked as->

1≤n≤N；

Repeating the above step S33 for each set of input preprocessed sample data;

Is a collection of (3);

wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter, and generally is a small amount, and the value of the embodiment is 0.001; the above cost function is essentially a sum of the three terms of the positive bias expectation, the negative bias expectation and the bias penalty term of the probability;

in the embodiment, the total number of samples is n=30 ten thousand, 20 ten thousand are used for model training, 10 ten thousand are used for model testing after training, and when the test result is unsatisfactory, the parameters are adjusted for training and testing again;

s05: optimizing weight parameters for all weight parameters

Optimizing J (Θ) so that J (Θ) is minimized, i.e

Θ＝minimizeJ(Θ)

As the result of the optimizing; the embodiment adopts a general neural networkTraining model software (such as a TensorFlow open source machine learning platform), the principle is as shown in fig. 3 and 4, after defining the training layer number and each layer of activation items, the software can automatically perform machine learning and output optimized weight parameter vector Θ and the value of sample function J (Θ).

In the embodiment, the weight parameter estimation carries out self-learning estimation through a neural network training model, sample data is divided into two parts in the actual estimation process, one part is used for training the weight parameter estimation, and the other part is used for testing the availability of an estimation result; all sample data are known to be pre-classified, and the classification method and system have practical usability when the test qualification rate reaches more than 90%; the automatic classification result of the test data is inconsistent with the pre-result and is generally manually analyzed, the cause is found, and model parameters are adjusted if necessary; the inaccurate classification result is mainly caused by insufficient sampling digits, and the original data characteristics cannot be obviously represented; the reasons for the inaccurate classification may be, on the one hand, an insufficient number of sampled bits, on the other hand, that the original data does not match the intended subdivision class, or that at least an uncertain disturbance term is present in the sampled result of the original data.

The present embodiment generally does not retrain after the test training is completed, as retraining may mean that the previous classification results need to be reclassified, which requires case-specific analysis. To ensure availability of the classification system, this embodiment requires that the test yield be above 99%, all parameters and weights are curable. When special conditions are unable to be classified, the data can be manually analyzed to find reasons, other parameters are not changed except for temporary adjustment of sampling digits without special requirements, and the reasons are all classified as problem data.

Example 2: government affair data tagging system

As shown in fig. 1, the government affair data labeling system is characterized by at least comprising six modules, namely parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, which are sequentially connected, wherein the six modules sequentially execute steps S0, S1, S2, S3, S4 and S5 respectively. The government data tagging system is essentially a computer system.

Further, as shown in fig. 2, the parameter estimation module includes five modules, including sample data acquisition, sample data preprocessing, sample training, cost function calculation, and weight parameter optimization, where the five modules sequentially execute steps S01, S02, S03, S04, and S05, respectively.

The working principle of the system is shown in fig. 4, the model training adopts sample data, and the model application aims at actual data after training results are obtained.

With regard to the automatic classification method and system of government data in embodiments 1 and 2, the general weight parameters are not adjusted after the actual application is completed, and particularly for the large class of government data, the model training is performed again to estimate the weight parameters mainly for the new large class of original data to be subdivided, and the subdivision standard may be different from that of the original government data. At this time, not only model training is needed again, but also parameters such as the number of sampling bits, the preset type, the expected total classification number and the like are needed to be reset. The method of the invention can thus be applied to the automatic subdivision of other kinds of data information in fact.

The basic principle of the invention is as follows: determining the number of expected data labels according to the total possible subdivision category number requirements of government affair data; presetting general data types according to possible informatization representation modes of the original data unit, such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts and the like; collecting the explicitly classified original data as a training sample to train the model, and acquiring reasonable weight parameters; and classifying and identifying the original data units to be classified by using the trained model, and archiving after labeling is completed.

The application of the invention: classifying the data through the model, and establishing labels capable of being normally classified to ensure the usability of the data; carrying out deep analysis on the data which cannot be matched, and trying fuzzy matching; the repeated data can be cleaned; archiving dirty data or polluted data, feeding back related service systems, and trying to acquire standard service data for the second time; finally, the data which still cannot be matched are archived and checked; the system background performs statistical analysis on the matched tag data, can change monitoring and data interface service, and realizes intelligent data retrieval and supervision. For the data with non-ideal classification results, on one hand, the data can be directly classified into problem data or other types, on the other hand, the data can be manually analyzed, and the model can be trained again if necessary; for other types of data information or data information generated by other types of service systems, the method is applied to further subdivide data categories, and mainly the model parameters need to be correspondingly designed, such as elements of sampling bit number, preset type, number of expected total classifications and the like.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, modifications and variations can be made without departing from the technical principles of the present invention, including selection method, initialization method, etc. of parameters, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A government affair data labeling method is characterized in that the original data is sampled, a preset type mark is added, a training model is input, and the data category is output after the training model is classified; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, mixture of numbers and characters, pictures, videos and texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:

s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience;

s1: collecting data, namely collecting service system data, wherein the service system data comprises streaming or non-streaming data, structured or unstructured data, document data and internet data;

s2: the data preprocessing comprises the steps of carrying out original data sampling and attaching a preset type mark, wherein the original data sampling comprises the steps of intercepting the previous continuous d bits of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;

for each group of input preprocessed sampling data, counting the total data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; for each set of inputs, record x _i The ith bit value of the input data after the addition of the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows

1≤j≤s _l ，1≤i≤s _l+1 ；

First layer ith bit input value

Second layer ith intermediate training value +.>

The intermediate training value of the ith bit of the layer is +.>

Has the following components

X _l For the model layer I intermediate training value, +.>

The final layer output value is obtained after the layer-by-layer calculation

S4: outputting the classification result, namely the data label h, when

When taking h _i =0; when->

When taking h _i =1; when (when)

When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is ₁ 0 probability error range, K ₂ A probability error range of 1; classification result h= [ h ] ₁ ，h ₂ ，…，h _K ]；

2. The method for labeling government affair data according to claim 1, wherein the training model self-learning estimation in step S0 comprises a neural network training model parameter estimation method, comprising the steps of:

It is known that when the sample belongs to the kth class, -/->

Otherwise->

s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the previous continuous d bits of a sample data unit; the sample data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;

For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; x is x _i The ith bit value of the input data after the addition of the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s _l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s ₁ ＝m，s _L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows

Inputting sample data and a weight parameter initial value;

calculating the ith intermediate value of the first layer as

Has the following components

X _l For the model layer I intermediate training value, +.>

Calculating layer by layer until the final layer output value is obtained

Recording the training result of the nth sample

Marked as->

1≤n≤N；

Repeating the above step S33 for each set of input preprocessed sample data;

Is a collection of (3);

wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1;

s05: optimizing weight parameters for all weight parameters

Optimizing J (Θ) so that J (Θ) is minimized, i.e

Θ＝minimizeJ(Θ)

As a result of the optimizing.

3. The government data labeling system is characterized by at least comprising six modules which are sequentially connected with parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, wherein the six modules sequentially execute the steps S0, S1, S2, S3, S4 and S5 of claim 1 respectively.

4. A government data labelling system according to claim 3, wherein the parameter estimation module includes five modules for obtaining sample data, preprocessing sample data, training sample, calculating cost function, optimizing weight parameter, and the five modules sequentially execute steps S01, S02, S03, S04, S05 of claim 2, respectively.