CN113643116B

CN113643116B - Company classification method based on financial evidence data and computer readable medium

Info

Publication number: CN113643116B
Application number: CN202110969456.9A
Authority: CN
Inventors: 戴悦; 王耀左
Original assignee: Cosco Shipping Technology Beijing Co Ltd
Current assignee: Cosco Shipping Technology Beijing Co Ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2023-10-27
Anticipated expiration: 2041-08-23
Also published as: CN113643116A; ZA202109633B

Abstract

A method of corporate classification based on financial document data comprising the steps of: s1: grouping original financial voucher detail data of a single company to obtain first sample data about the occurrence amount of the month of a subject and second sample data about the billing frequency, wherein the column index of the first sample data and the column index of the second sample data comprise three dimensions of time, subjects and loan marks; s2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data; s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies; s4: determining initial clustering center points equal to the classification number according to a preset classification number; s5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification.

Description

Company classification method based on financial evidence data and computer readable medium

Technical Field

The application belongs to the technical field of financial credential data processing, and particularly relates to a company classification method and a computer readable medium based on financial credential data.

Background

In the prior art, the companies are generally classified according to the scale of the companies or the service range, and the classification method has a certain subjectivity, lacks objectivity and has no practical reference meaning on classification results of the companies.

Disclosure of Invention

In view of this, in one aspect, some embodiments disclose a method of corporate classification based on financial credential data. Specifically, the company classification method based on the financial document data includes the steps of:

s1: grouping original financial voucher detail data of a single company to obtain first sample data about month occurrence amount of a subject and second sample data about accounting frequency, wherein a column index of the first sample data comprises three dimensions of time, subjects and debit and credit marks, and a column index of the second sample data comprises three dimensions of time, subjects and debit and credit marks;

s2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data;

s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies;

s4: determining initial clustering center points equal to the classification number according to a preset classification number;

s5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification.

Further, some embodiments disclose a method for classifying companies based on financial credential data, in step S1, the first sample data is a (1*n) order matrix, and the second sample data is a (1*n) order matrix, where n is the number of column indices.

In step S2, the first total sample data is a (m×n) order matrix, the second total sample data is a (m×n) order matrix, where m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in a column index union set of all companies, and m and N are natural numbers.

In step S3, the first total sample data and the second total sample data are subjected to standardization processing, and then are subjected to column merging to obtain an initial sample data (m×2n) rank matrix, where m is the number of companies and 2N is the total number of column indexes.

In step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:

s401: calculating the square sum of data corresponding to all column indexes of each company in the first total sample data, taking the company with the largest square sum value as a first clustering initial center point, and moving the sample data corresponding to the first clustering initial center point out of the initial sample data;

s402: calculating the distance between the sample data of each company in the initial sample data after the sample data of the first clustering initial center point is moved out in the step S401 and the sample data of the first clustering initial center point, taking the company with the largest distance as the second clustering initial center point, and moving the corresponding sample data out of the initial sample data;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center point, and shifting the corresponding sample data out of the initial sample data;

s404: and by analogy, k clustering algorithm initial center points with the same classification number k are obtained, and a clustering algorithm initial center point set is formed.

s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data (m x N) order matrix, taking the company with the largest Pi value as a first clustering initial center point, and moving the corresponding sample data (1 x 2N) order matrix out of the initial sample data, wherein the initial sample data is changed into an (m-1 x 2N) order matrix; wherein the formula of Pi is formula (1):

in the formula (1), x _i y _l Representing data corresponding to the ith row and the ith column in a first total sample data (m x N) matrix, wherein l is a natural number from 1 to N, i is a natural number from 1 to m, m is the number of companies, and N is the number of column indexes;

s402: calculating the distance between the sample data of each company and the sample data (1 x 2N) rank matrix of the initial center point of the first cluster in the initial sample data (m-1 x 2N) rank matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data (1 x 2N) rank matrix out of the initial sample data, wherein the initial sample data is changed into the (m-2 x 2N) rank matrix;

wherein the distance L _1-i The calculation formula of (2) is:

in the formula (2), x ₁ y _l Representing data corresponding to a first column in a first clustering initial center point data sample (1 x 2N) rank matrix; x is x _i y _l Representing data corresponding to the ith row and the ith column in the (m-1 x 2N) order matrix of initial sample data; l is a natural number from 1 to 2N, i is a natural number from 1 to m-1; l (L) _1-i The method comprises the steps of obtaining initial center point sample data of a first cluster and the distance between an ith data sample in the initial sample data;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the initial sample data (m-2 x 2 n) order matrix after the sample data is removed in the step S402, taking the company with the largest distance as the third clustering initial center point, removing the corresponding sample data from the initial sample data, and changing the initial sample data into the (m-3 x 2 n) order matrix;

wherein the distance L between two sample data _j-i Calculated according to the following formula (3):

in the formula (3), x _j y _l Representing data corresponding to the first column in a (1 x 2 n) order matrix of sample data j, x _i y _l Representing data corresponding to the first column in the (1 x 2 n) order matrix of the sample data i; l is a natural number from 1 to 2N, i is a natural number from 1 to m-2; j is a natural number of 1-m-2; l (L) _j-i Is the distance between sample data j and sample data i;

s404: and by analogy, obtaining k clustering algorithm initial center points equal to the classification number k, forming a clustering algorithm initial center point set (k is 2N) order matrix, wherein k is a natural number smaller than m.

Some embodiments disclose a method for classifying companies based on financial credential data, wherein step S5 specifically includes: and calculating the distance between each sample data in the initial sample data and each sample data in the initial clustering center point, classifying the company represented by the initial sample data to the company represented by the initial clustering center point closest to the initial sample data, and finally classifying all the companies according to the preset classification number.

Some embodiments disclose a method for classifying companies based on financial credential data, wherein step S5 specifically includes:

s501, calculating the distance between each sample data in an initial sample data (m.times.2N) rank matrix and each sample data in a clustering algorithm initial center point (k.times.2N) rank matrix, classifying companies corresponding to the sample data with the closest clustering algorithm initial center point distance into companies corresponding to the clustering algorithm initial center point, classifying all the companies into k types respectively to obtain k types of companies, wherein the sample data of each type of company form a (c.times.2N) rank matrix, and c represents the number of the companies corresponding to each type in the k types of companies;

s502: calculating an average value of each column of data in each (c.2n) order matrix, wherein the average value forms a (1.2n) order matrix;

s503: the (1 x 2N) order matrix formed in step S502 forms a new cluster initial center point set (k x 2N) order matrix; if the new clustering initial center point (k×2n) rank matrix obtained in step S503 is the same as the clustering initial center point (k×2n) rank matrix in step S501, the clustering algorithm ends, otherwise, step S504 is entered;

s504: calculating the distance between each sample data in the initial sample data (m.times.2N) rank matrix and each sample data in the new clustering initial center point (k.times.2N) rank matrix, classifying the companies corresponding to the sample data with the nearest distance to the new clustering initial center point into the corresponding companies, classifying all the companies into k types to obtain k types of companies, wherein the sample data of each type of company form a (c.times.2N) rank matrix, and c represents the number of the companies corresponding to each type in the k types of companies;

s505: if the category of the company obtained in step S504 is the same as the category of the company in step S501 and the company in each category is the same, the classification is ended; otherwise, the next clustering calculation is repeated from step S501.

In another aspect, some embodiments disclose a computer-readable medium containing computer-executable instructions that, when processed via a data processing device, perform a method of corporate classification based on financial credential data.

According to the corporate classification method based on the financial voucher data, which is disclosed by the embodiment of the application, based on the corporate financial voucher data, the corporate financial data is processed from a plurality of dimensions such as monthly occurrence amount, billing frequency, subject content and the like, and the corporate is clustered and classified based on the data sample obtained by processing, so that the obtained classification result can more objectively reflect the type characteristics of the corporation, and the classification result has more practical reference value.

Drawings

FIG. 1 is a schematic flow chart of a method for classifying companies based on financial document data

Detailed Description

The word "embodiment" as used herein does not necessarily mean that any embodiment described as "exemplary" is preferred or advantageous over other embodiments. Performance index testing in the examples of the present application, unless otherwise specified, was performed using conventional testing methods in the art. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure.

Unless otherwise defined, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; other test methods and techniques not specifically mentioned in the present application are those commonly used by those skilled in the art.

The terms "substantially" and "about" are used herein to describe small fluctuations. For example, they may refer to less than or equal to ±5%, such as less than or equal to ±2%, such as less than or equal to ±1%, such as less than or equal to ±0.5%, such as less than or equal to ±0.2%, such as less than or equal to ±0.1%, such as less than or equal to ±0.05%. Numerical data presented or represented herein in a range format is used only for convenience and brevity and should therefore be interpreted flexibly to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range. For example, a numerical range of "1 to 5%" should be interpreted to include not only the explicitly recited values of 1% to 5%, but also include individual values and sub-ranges within the indicated range. Thus, individual values, such as 2%, 3.5% and 4%, and subranges, such as 1% to 3%, 2% to 4% and 3% to 5%, etc., are included in this numerical range. The same principle applies to ranges reciting only one numerical value. Moreover, such an interpretation applies regardless of the breadth of the range or the characteristics being described.

In this document, including the claims, conjunctions such as "comprising," including, "" carrying, "" having, "" containing, "" involving, "" containing, "and the like are to be construed as open-ended, i.e., to mean" including, but not limited to. Only the conjunctions "consisting of … …" and "consisting of … …" are closed conjunctions.

Numerous specific details are set forth in the following examples in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In the examples, some methods, means, instruments, devices, etc. well known to those skilled in the art are not described in detail in order to highlight the gist of the present application.

On the premise of no conflict, the technical features disclosed by the embodiment of the application can be combined at will, and the obtained technical scheme belongs to the disclosure of the embodiment of the application.

In some embodiments, as shown in FIG. 1, a method of classifying a company based on financial credential data includes the steps of:

s1: for the origin of a single companyGrouping the initial financial voucher detail data to obtain first sample data about the month occurrence amount of the subject and second sample data about the accounting frequency, wherein the column index of the first sample data comprises three dimensions of time, subject and debit and credit marks, and the column index of the second sample data comprises three dimensions of time, subject and debit and credit marks; in general, the first sample data may be a (1*n) order matrix, specifically expressed as (x) ₁₁ ,x ₁₂ ……，x _1i ，……，x _1n ) Wherein x is _1i For the amount corresponding to the column index i, the column index is in a data format including time, subject and loan identification, and n represents the total number of column indexes. In general, the second sample data may be a (1*n) order matrix, specifically expressed as (y) ₁₁ ，y ₁₂ ……，y _1j ，……，y _1n ) Wherein y is _1j For the frequency number corresponding to the column index i, the column index is in a data format including time, subjects and loan marks, and n represents the total number of the column indexes. Similar grouping processing is performed on the original financial voucher data details of each company respectively, so that first sample data and second sample data of each company can be obtained.

S2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data; generally, accounting details used by different companies have a certain difference, when financial voucher data are summarized from a layer of primary subjects, most of the primary subjects among different companies can be mutually overlapped, and small primary subjects are always different, so that column indexes among the companies are different, the column indexes of all the companies can be combined, the obtained column index union set can completely cover the primary subjects of all the companies, and the primary subjects of all the companies are considered in the classification process of the companies, so that the objectivity and practicability of classification results of the companies are improved; in the company column index merging process, the missing value zero-filling processing of the column index corresponding data which is missing by the company can be generally carried out, so that the total number of the column indexes of each company is kept consistent; summarizing the first sample data of all companies to obtain a first total sample data (m x N) order matrix about the occurrence amount of the month of the subject, as represented by formula (4);

wherein x represents the month occurrence amount of the subject, m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes of all companies and is concentrated, and m and N are natural numbers;

according to the same method and process, a second order matrix of total sample data (m×n) with respect to frequency can be obtained, as represented by formula (5):

wherein y represents frequency, m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in the column index union set of all companies, and m and N are natural numbers;

s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies; for example, the first total sample data represented by the matrix (4) is combined with the second total sample data represented by the matrix (5) to obtain an initial sample data (m×2n) rank matrix (6), as represented by the formula (6):

generally, since the first sample data and the second sample data are different in magnitude, and the traffic and the company sizes of different companies are also greatly different, the first sample data are often different in magnitude, or the second data samples are also different in magnitude, in order to reasonably consider the actual influence of all data in all samples in the company classification process, the first total sample data and the second total sample data can be standardized and normalized, the difference in magnitude between the data is basically eliminated by the data after the standardized and normalized, the first total sample data and the second total sample data after the standardized and normalized are combined in a row and column manner, and the obtained initial sample data (m x 2N) rank matrix is in the same level in magnitude, so that the reasonable influence of company size, time, subjects, lending marks, occurrence amount, occurrence frequency and the like in the company classification process can be generated, and the classification result has more practical guiding significance.

S4: determining initial clustering center points equal to the classification number according to a preset classification number; specifically, if the classification number is set to k, determining the initial center point of clustering specifically includes:

s402: calculating the distance between the sample data of each company in the initial sample data after the sample data is removed in the step S401 and the sample data of the initial center point of the first cluster, taking the company with the largest distance as the initial center point of the second cluster, and removing the corresponding sample data from the initial sample data;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the distance between the sample data of the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center point, and shifting the corresponding sample data out of the initial sample data;

As an optional embodiment, in the method for classifying companies based on financial credential data, the determining the initial center point of clustering specifically includes:

s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data (m x N) order matrix, taking the company with the largest Pi value as a first clustering initial center point, and moving the corresponding sample data (1 x 2N) order matrix out of the initial sample data, wherein the initial sample data is changed into an (m-1 x 2N) order matrix; wherein the calculation formula of Pi is formula (1) disclosed herein;

s402: calculating the distance between the sample data of each company and the sample data (1 x 2N) rank matrix of the initial center point of the first cluster in the initial sample data (m-1 x 2N) rank matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data (1 x 2N) rank matrix out of the initial sample data, wherein the initial sample data is changed into the (m-2 x 2N) rank matrix; wherein the distance L _1-i Is formula (2) disclosed herein;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the initial sample data (m-2 x 2 n) order matrix after the sample data is removed in the step S402, taking the company with the largest distance as the third clustering initial center point, removing the corresponding sample data from the initial sample data, and changing the initial sample data into the (m-3 x 2 n) order matrix; wherein the distance L between two sample data _j-i Calculated according to formula (3) disclosed herein;

S5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification. In general, the distance between each sample data in the initial sample data and each sample data in the initial center point of clustering may be calculated, the company represented by the initial sample data is classified into the company represented by the initial center point of clustering closest thereto, and finally all the companies are classified according to a preset classification number. As an alternative embodiment, in general, step S5 may specifically include:

s504: calculating the distance between each sample data in the initial sample data (m x 2N) rank matrix and each sample data in the new clustering initial center point (k x 2N) rank matrix, classifying the companies corresponding to the sample data with the nearest distance to the new clustering initial center point into the companies corresponding to the new clustering initial center point, classifying all the companies into k classes respectively to obtain k classes of companies, wherein the sample data of each class of companies form a (c x 2N) rank matrix, and c represents the number of the companies corresponding to each class in the k classes of companies;

In another aspect, some embodiments disclose a computer-readable medium containing computer-executable instructions that, when processed via a data processing device, perform a method of corporate classification based on financial credential data. Generally, computer program instructions or code for performing the operations of some embodiments of the present disclosure can be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++, python and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Further exemplary details are described below in connection with the embodiments.

Example 1

The following example 1 exemplifies the process of obtaining initial sample data.

Single company single credential data

TABLE 1 company A list of raw credential details

Table 1 shows an original voucher details list of company A, wherein the two parties of borrowing and lending are displayed, and after the absolute value is calculated on the sum of the money amounts according to the identification of the borrowing and lending, the two amounts are balanced. For example, the credential line item numbers 2 and 3 represent a borrower, the sum of the amounts is 200000, the credential line item number 1 represents a lender, the sum of the amounts is 200000, the absolute value is obtained, and the two values are equal. The accounting date is the date of financial running water, the general account subject is a detail subject code, and the coin amount represents the occurrence amount of the detail.

All financial data of all companies.

Hereinafter, company a and company B will be described as examples. Table 2 shows all the financial data of company A and company B.

TABLE 2 financial data for company A, company B

First sample data of company A

With the billing date year month-first order subject code-lending identification as column index, determining that the first sample data of company a with respect to the subject month occurrence amount is table 3 first sample data of company a, wherein 202106_1002_h, 202106_2121_s are two column indexes, a is the company code, and the first sample data forms (1*2) order matrix:

TABLE 3 company A first sample data

Second sample data of company A

Determining that the second sample data of company a about the billing rate is table 4 company a second sample data with the billing date year month-first order subject code-loan identification as column index, wherein 202106_1002_h, 202106_2121_s are column index, a is company code, and the second sample data forms (1*2) order matrix:

TABLE 4 company A second sample data

Company code	202106_1002_H	202106_2121_S
			A	1	2

Company B first sample data

With the billing date year month-first order subject code-lending identification as column index, determining that first sample data of company B with respect to the subject month occurrence amount is first sample data of company B of table 5, wherein 202107_1003_h, 202107_2121_s are two column indexes, B is a company code, and the first sample data forms (1*2) order matrix:

TABLE 5 company B first sample data

Company code	202107_1003_H	202107_2121_S
			B	2000	2000

Company B second sample data

With the billing date year month-first order subject code-lending identification as column index, determining that the second sample data of company B about billing frequency is table 6 company B second sample data, wherein 202107_1003_h, 202107_2121_s are two column indexes, B is company code, and the second sample data forms (1*2) order matrix:

TABLE 6 company B second sample data

Company code	202107_1003_H	202107_2121_S
			B	1	1

First total sample data of all companies

Combining the first sample data of company a with the first sample data of company B to obtain first total sample data of company a and company B, as shown in table 7, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h, 202107_2121_s are four column indexes, A, B is a company code, and the first total sample data forms (2×4) rank matrix:

TABLE 7 first total sample data for all companies

Company code	202106_1002_H	202106_2121_S	202107_1003_H	202107_2121_S
					A	200000	200000	0	0
B	0	0	2000	2000

Second total sample data of all companies

Combining the second sample data of company a with the second sample data of company B to obtain second total sample data of company a and company B, as shown in table 8, where 202106_1002_h, 202106_2121_s, 202107_1003_h, 202107_2121_s are four column indexes, A, B is a company code, and the second total sample data forms a (2×4) rank matrix:

table 8 second Total sample data for all companies

Company code	202106_1002_H	202106_2121_S	202107_1003_H	202107_2121_S
					A	1	2	0	0
B	0	0	1	1

First total sample data normalization

Normalization processing is performed on all the first total sample data of all companies shown in table 7 to obtain normalized first total sample data shown in table 9, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h and 202107_2121_s are four column indexes, A, B is a company code, and the normalized first total sample data form a (2×4) order matrix: :

table 9 normalized first total sample data:

second total sample data normalization

Normalization processing is performed on all the second total sample data of the company shown in table 8, so as to obtain normalized second total sample data shown in table 10, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h and 202107_2121_s are four column indexes, A, B is a company code, and the normalized second total sample data forms a (2×4) order matrix:

table 10 normalized second total sample data:

company code	202106_1002_H	202106_2121_S	202107_1003_H	202107_2121_S
					A	0.301511345	1.507556723	-0.904534034	-0.904534034
B	-1	-1	1	1

Initial sample data

Combining the normalized first total sample data (listed in table 9) with the normalized second total sample data (listed in table 10) to obtain an initial data sample, as listed in table 11 below, wherein 202106_1002_h_value, 202106_2121_s_value, 202107_1003_h_value, 202107_2121_s_value, 202106_1002_h_count, 202106_2121_s_count, 202107_1003_h_count, 202107_2121_s_count are eight column indexes, A, B is a company code, the normalized initial sample data forms (2*8) order matrix, wherein the column indexes representing the amounts are suffixed with-value, the column indexes representing the frequencies are suffixed with-count to represent different column indexes, and the column index representation representing the amounts and the frequencies are prevented from being consistent, and the column index repetition occurs:

table 11 initial sample data

Example 2

Classification of companies based on financial credential data

In example 2, five companies A, B, C, D, E are classified, and the classification column index is divided into four items, which is expressed as x ₁ 、x ₂ 、x ₃ 、x ₄ The initial sample data for five companies is shown in table 12:

table 12 initial sample data

Company code	x ₁	x ₂	x ₃	x ₄
					A	5.1	3.5	1.4	0.2
B	4.9	3.0	1.4	0.2
					C	4.7	3.2	1.3	0.2
D	4.6	3.1	1.5	0.2
					E	5.0	3.6	1.4	0.2

Normalization processing is performed on the initial sample data listed in table 12 to obtain normalized initial data samples listed in table 13.

Table 13 normalization of initial sample data

Company code	x ₁	x ₂	x ₃	x ₄
					A	1.351023	0.5033223	-0.60928488	-1.24506042
B	1.43136473	0.3542982	-0.55270519	-1.23295773
					C	1.35847229	0.49136232	-0.60697698	-1.24285762
D	1.35865503	0.45288501	-0.51326968	-1.29827036
					E	1.3119249	0.56225353	-0.61580149	-1.25837695

Computing company A, B, C, D, E data matrix according to equation (1) disclosed herein, each column index x ₁ 、x ₂ 、x ₃ 、x ₄ And (3) obtaining the sum of the squares of the corresponding data, obtaining a first clustering initial center point as A, further calculating the distance between each company data sample and the company A data sample in the company B, C, D, E data matrix according to the formula (2), determining the company B with the largest distance as a second clustering initial center point, and forming the clustering initial center point set sample data listed in the table 14 by the data sample sets of the clustering initial center points A and B.

Table 14 clustering of initial center point set sample data

Grouping	Company code	x ₁	x ₂	x ₃	x ₄
						0	A	1.351023	0.5033223	-0.60928488	-1.24506042
1	B	1.43136473	0.3542982	-0.55270519	-1.23295773

The sample data represented by company a was grouped into 0 groups, the sample data represented by company B was grouped into 1 groups, and the distances between all companies in table 13 and the data samples in group 0 and group 1 in table 14 were calculated, respectively, the calculation formula was formula (3) disclosed herein, and the calculation results are shown in table 15 below.

Table 15 distances between all companies and group 0 and group 1

In Table 15, the first column is at a distance from group 0, where company C, D, E is closer to group A, classifying C, D, E as group 0 with group A, the second column is at a distance from group 1, both A, C, D, E are farther from group B, and B is separately classified as group 1.

Determining a new set of cluster initial center points。

A, C, D, E is taken as group 0 to form a data set, each column of data in the data set is averaged, and the data set and the average value calculation result are listed in the group 0 data set and the average value in table 16.

Table 16 group 0 dataset and average list

Company code	x ₁	x ₂	x ₃	x ₄
					A	1.351023	0.5033223	-0.60928488	-1.24506042
C	1.35847229	0.49136232	-0.60697698	-1.24285762
					D	1.35865503	0.45288501	-0.51326968	-1.29827036
E	1.3119249	0.56225353	-0.61580149	-1.25837695
					Average value of	1.345018805	0.50245579	-0.586333258	-1.261141338

The lowest average value of each column index in table 16 is the average value of data corresponding to the column index of A, C, D, E four companies.

There is only one data point B in group 1, so the data corresponding to each column index of data point B can be considered as its average.

The average value listed in table 16 is shown as data of group 0, and is recombined with data of group 1 in table 14 to new cluster initial center point set sample data, which is shown as table 17.

Table 17 new clustering initial center point set sample data

Grouping	x ₁	x ₂	x ₃	x ₄
					0	1.345018805	0.50245579	-0.586333258	-1.261141338
1	1.43136473	0.3542982	-0.55270519	-1.23295773

The distances between all companies in table 13 and the data samples in group 0 and group 1 in table 17 were again calculated, the calculation formula is formula (3) disclosed herein, and the calculation results are shown in table 18 below.

Table 18 distances between all companies in table 13 and group 0 and group 1 in table 17

Company code	Distance from packet 0	Distance from group 1
			A	0.000822	0.032011
B	0.031331	0
			C	0.001065	0.027143
D	0.00936	0.020827
			E	0.005547	0.062139

In table 18, the first column is the distance from group 0, where company C, D, E is closer to group a, groups C, D, E to group a into the same class of group 0, the second column is farther from group 1, both groups A, C, D, E are farther from group B, and group B is separately assigned to class 1.

Up to this point, company A, B, C, D, E is classified into two categories, wherein A, C, D, E is assigned to group 0, b is assigned to group 1, the grouping result is constant, and the classification calculation is determined to end. The results are shown in Table 19 under the classification of A, B, C, D, E.

Table 19 company A, B, C, D, E classification results

Company code	Packet number
		A	0
B	1
		C	0
D	0
		E	0

The technical details disclosed in the technical scheme and the embodiment of the application are only illustrative of the inventive concept of the application and are not limiting to the technical scheme of the application, and all conventional changes, substitutions or combinations of the technical details disclosed in the application have the same inventive concept as the application and are within the scope of the claims of the application.

Claims

1. A method of classifying companies based on financial document data, the method comprising the steps of:

s1: grouping original financial voucher detail data of a single company to obtain first sample data about month occurrence amount of a subject and second sample data about accounting frequency, wherein a column index of the first sample data comprises three dimensions of time, subjects and loan marks, and a column index of the second sample data comprises three dimensions of time, subjects and loan marks;

2. The method according to claim 1, wherein in the step S1, the first sample data is a 1*n order matrix, and the second sample data is a 1*n order matrix, where n is a column index number and is a natural number.

3. The method according to claim 1, wherein in the step S2, the first total sample data is m×n order matrix, the second total sample data is m×n order matrix, where m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in a column index union of all companies, and m and N are both natural numbers.

4. The method for classifying companies based on financial document data according to claim 3, wherein in the step S3, after the normalization processing is performed on the first total sample data and the second total sample data, column merging is performed to obtain an m×2n order matrix of initial sample data, where m is the number of companies, and 2N is the total number of column indexes.

5. The method for classifying companies based on financial document data according to claim 1, wherein in step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:

s402: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point in the initial sample data after the sample data of the first clustering initial center point is moved out in the step S401, taking the company with the largest distance as the second clustering initial center point, and moving out the corresponding sample data of the second clustering initial center point from the initial sample data;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the sample data of the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center, and shifting the corresponding sample data out of the initial sample data;

6. The method for classifying companies based on financial document data according to claim 4, wherein in step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:

s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data m-N order matrix, taking the company with the largest Pi value as a first clustering initial center point, moving the corresponding sample data 1-2N order matrix out of the initial sample data, and changing the initial sample data into m-1-2N order matrix; the calculation formula of Pi is as follows:

wherein x is _i y _l Representing data corresponding to the ith row and the ith column in the m x N matrix of the first total sample data, wherein l is a natural number from 1 to N, and i is a natural number from 1 to m;

s402: calculating the distance between the sample data of each company and the sample data 1 x 2N order matrix of the initial center point of the first cluster in the initial sample data m-1 x 2N order matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data 1 x 2N order matrix out of the initial sample data, wherein the initial sample data is changed into m-2 x 2N order matrix;

wherein the distance L _1-i The calculation formula of (2) is as follows:

wherein x is ₁ y _l Representing data corresponding to a first column in a 1 x 2N-order matrix of the initial center point data sample of the first cluster; x is x _i y _l Representing data corresponding to the ith row and the ith column in the m-1 x 2N-order matrix of initial sample data; l is a natural number from 1 to 2N, i is a natural number from 1 to m-1; l (L) _1-i The method comprises the steps of obtaining initial center point sample data of a first cluster and the distance between an ith data sample in the initial sample data;

s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in an m-2 x 2N order matrix of initial sample data, taking the company with the largest distance as a third clustering initial center point, moving the corresponding sample data out of the initial sample data, and changing the initial sample data into an m-3 x 2N order matrix;

wherein the distance L between two sample data _j-i Calculated according to the following formula:

wherein x is _j y _l Representation ofData corresponding to the first column in 1 x 2N-order matrix of sample data j, x _i y _l Representing data corresponding to the first column in a 1 x 2N-order matrix of the sample data i; l is a natural number from 1 to 2N, i is a natural number from 1 to m-2; j is a natural number of 1-m-2; l (L) _j-i Is the distance between sample data j and sample data i;

s404: and by analogy, k clustering algorithm initial center points equal to the classification number k are obtained, a clustering algorithm initial center point k is formed and is a 2N-order matrix, and k is a natural number smaller than m.

7. The method for classifying companies based on financial document data according to claim 1, wherein said step S5 comprises: and calculating the distance between each sample data in the initial sample data and each sample data in the initial clustering center point, classifying the company represented by the initial sample data to the company represented by the initial clustering center point closest to the initial sample data, and finally classifying all the companies according to the preset classification number.

8. The method for classifying companies based on financial document data according to claim 6, wherein said step S5 comprises:

s501, calculating the distance between each sample data in an m-2N-order matrix of initial sample data and each sample data in a k-2N-order matrix of an initial center point of a clustering algorithm, classifying companies corresponding to the sample data with the nearest distance from the initial center point of the clustering algorithm into the companies corresponding to the initial center point of the clustering algorithm, classifying all the companies into k classes respectively to obtain k classes of companies, wherein the sample data of each class of companies form a c-2N-order matrix, and c represents the number of the companies corresponding to each class in the k classes of companies;

s502: calculating the average value of each column of data in each c.2N-order matrix, wherein the average value forms a 1.2N-order matrix;

s503: the 1 x 2N-order matrix formed in the step S502 forms a new clustering initial center point k x 2N-order matrix; if the new clustering initial center point k×2n order matrix obtained in step S503 is the same as the clustering initial center point k×2n order matrix in step S501, the clustering algorithm ends, otherwise, step S504 is entered;

s504: calculating the distance between each sample data in the m-2N-order matrix of initial sample data and each sample data in the k-2N-order matrix of new clustering initial center points, classifying companies corresponding to the sample data with the nearest initial center point distance of a new clustering algorithm into corresponding companies, classifying all the companies into k types to obtain k types of companies, wherein the sample data of each type of company form a c-2N-order matrix, and c represents the number of the companies corresponding to each type in the k types of companies;

9. Computer-readable medium containing computer-executable instructions, characterized in that the computer-executable instructions, when processed via a data processing device, perform the company classification method based on financial document data as claimed in any one of claims 1 to 8.