CN113643116B - Company classification method based on financial evidence data and computer readable medium - Google Patents

Company classification method based on financial evidence data and computer readable medium Download PDF

Info

Publication number
CN113643116B
CN113643116B CN202110969456.9A CN202110969456A CN113643116B CN 113643116 B CN113643116 B CN 113643116B CN 202110969456 A CN202110969456 A CN 202110969456A CN 113643116 B CN113643116 B CN 113643116B
Authority
CN
China
Prior art keywords
sample data
initial
companies
data
center point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110969456.9A
Other languages
Chinese (zh)
Other versions
CN113643116A (en
Inventor
戴悦
王耀左
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cosco Shipping Technology Beijing Co Ltd
Original Assignee
Cosco Shipping Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cosco Shipping Technology Beijing Co Ltd filed Critical Cosco Shipping Technology Beijing Co Ltd
Priority to CN202110969456.9A priority Critical patent/CN113643116B/en
Publication of CN113643116A publication Critical patent/CN113643116A/en
Priority to ZA2021/09633A priority patent/ZA202109633B/en
Application granted granted Critical
Publication of CN113643116B publication Critical patent/CN113643116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of corporate classification based on financial document data comprising the steps of: s1: grouping original financial voucher detail data of a single company to obtain first sample data about the occurrence amount of the month of a subject and second sample data about the billing frequency, wherein the column index of the first sample data and the column index of the second sample data comprise three dimensions of time, subjects and loan marks; s2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data; s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies; s4: determining initial clustering center points equal to the classification number according to a preset classification number; s5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification.

Description

Company classification method based on financial evidence data and computer readable medium
Technical Field
The application belongs to the technical field of financial credential data processing, and particularly relates to a company classification method and a computer readable medium based on financial credential data.
Background
In the prior art, the companies are generally classified according to the scale of the companies or the service range, and the classification method has a certain subjectivity, lacks objectivity and has no practical reference meaning on classification results of the companies.
Disclosure of Invention
In view of this, in one aspect, some embodiments disclose a method of corporate classification based on financial credential data. Specifically, the company classification method based on the financial document data includes the steps of:
s1: grouping original financial voucher detail data of a single company to obtain first sample data about month occurrence amount of a subject and second sample data about accounting frequency, wherein a column index of the first sample data comprises three dimensions of time, subjects and debit and credit marks, and a column index of the second sample data comprises three dimensions of time, subjects and debit and credit marks;
s2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data;
s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies;
s4: determining initial clustering center points equal to the classification number according to a preset classification number;
s5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification.
Further, some embodiments disclose a method for classifying companies based on financial credential data, in step S1, the first sample data is a (1*n) order matrix, and the second sample data is a (1*n) order matrix, where n is the number of column indices.
In step S2, the first total sample data is a (m×n) order matrix, the second total sample data is a (m×n) order matrix, where m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in a column index union set of all companies, and m and N are natural numbers.
In step S3, the first total sample data and the second total sample data are subjected to standardization processing, and then are subjected to column merging to obtain an initial sample data (m×2n) rank matrix, where m is the number of companies and 2N is the total number of column indexes.
In step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:
s401: calculating the square sum of data corresponding to all column indexes of each company in the first total sample data, taking the company with the largest square sum value as a first clustering initial center point, and moving the sample data corresponding to the first clustering initial center point out of the initial sample data;
s402: calculating the distance between the sample data of each company in the initial sample data after the sample data of the first clustering initial center point is moved out in the step S401 and the sample data of the first clustering initial center point, taking the company with the largest distance as the second clustering initial center point, and moving the corresponding sample data out of the initial sample data;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center point, and shifting the corresponding sample data out of the initial sample data;
s404: and by analogy, k clustering algorithm initial center points with the same classification number k are obtained, and a clustering algorithm initial center point set is formed.
In step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:
s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data (m x N) order matrix, taking the company with the largest Pi value as a first clustering initial center point, and moving the corresponding sample data (1 x 2N) order matrix out of the initial sample data, wherein the initial sample data is changed into an (m-1 x 2N) order matrix; wherein the formula of Pi is formula (1):
in the formula (1), x i y l Representing data corresponding to the ith row and the ith column in a first total sample data (m x N) matrix, wherein l is a natural number from 1 to N, i is a natural number from 1 to m, m is the number of companies, and N is the number of column indexes;
s402: calculating the distance between the sample data of each company and the sample data (1 x 2N) rank matrix of the initial center point of the first cluster in the initial sample data (m-1 x 2N) rank matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data (1 x 2N) rank matrix out of the initial sample data, wherein the initial sample data is changed into the (m-2 x 2N) rank matrix;
wherein the distance L 1-i The calculation formula of (2) is:
in the formula (2), x 1 y l Representing data corresponding to a first column in a first clustering initial center point data sample (1 x 2N) rank matrix; x is x i y l Representing data corresponding to the ith row and the ith column in the (m-1 x 2N) order matrix of initial sample data; l is a natural number from 1 to 2N, i is a natural number from 1 to m-1; l (L) 1-i The method comprises the steps of obtaining initial center point sample data of a first cluster and the distance between an ith data sample in the initial sample data;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the initial sample data (m-2 x 2 n) order matrix after the sample data is removed in the step S402, taking the company with the largest distance as the third clustering initial center point, removing the corresponding sample data from the initial sample data, and changing the initial sample data into the (m-3 x 2 n) order matrix;
wherein the distance L between two sample data j-i Calculated according to the following formula (3):
in the formula (3), x j y l Representing data corresponding to the first column in a (1 x 2 n) order matrix of sample data j, x i y l Representing data corresponding to the first column in the (1 x 2 n) order matrix of the sample data i; l is a natural number from 1 to 2N, i is a natural number from 1 to m-2; j is a natural number of 1-m-2; l (L) j-i Is the distance between sample data j and sample data i;
s404: and by analogy, obtaining k clustering algorithm initial center points equal to the classification number k, forming a clustering algorithm initial center point set (k is 2N) order matrix, wherein k is a natural number smaller than m.
Some embodiments disclose a method for classifying companies based on financial credential data, wherein step S5 specifically includes: and calculating the distance between each sample data in the initial sample data and each sample data in the initial clustering center point, classifying the company represented by the initial sample data to the company represented by the initial clustering center point closest to the initial sample data, and finally classifying all the companies according to the preset classification number.
Some embodiments disclose a method for classifying companies based on financial credential data, wherein step S5 specifically includes:
s501, calculating the distance between each sample data in an initial sample data (m.times.2N) rank matrix and each sample data in a clustering algorithm initial center point (k.times.2N) rank matrix, classifying companies corresponding to the sample data with the closest clustering algorithm initial center point distance into companies corresponding to the clustering algorithm initial center point, classifying all the companies into k types respectively to obtain k types of companies, wherein the sample data of each type of company form a (c.times.2N) rank matrix, and c represents the number of the companies corresponding to each type in the k types of companies;
s502: calculating an average value of each column of data in each (c.2n) order matrix, wherein the average value forms a (1.2n) order matrix;
s503: the (1 x 2N) order matrix formed in step S502 forms a new cluster initial center point set (k x 2N) order matrix; if the new clustering initial center point (k×2n) rank matrix obtained in step S503 is the same as the clustering initial center point (k×2n) rank matrix in step S501, the clustering algorithm ends, otherwise, step S504 is entered;
s504: calculating the distance between each sample data in the initial sample data (m.times.2N) rank matrix and each sample data in the new clustering initial center point (k.times.2N) rank matrix, classifying the companies corresponding to the sample data with the nearest distance to the new clustering initial center point into the corresponding companies, classifying all the companies into k types to obtain k types of companies, wherein the sample data of each type of company form a (c.times.2N) rank matrix, and c represents the number of the companies corresponding to each type in the k types of companies;
s505: if the category of the company obtained in step S504 is the same as the category of the company in step S501 and the company in each category is the same, the classification is ended; otherwise, the next clustering calculation is repeated from step S501.
In another aspect, some embodiments disclose a computer-readable medium containing computer-executable instructions that, when processed via a data processing device, perform a method of corporate classification based on financial credential data.
According to the corporate classification method based on the financial voucher data, which is disclosed by the embodiment of the application, based on the corporate financial voucher data, the corporate financial data is processed from a plurality of dimensions such as monthly occurrence amount, billing frequency, subject content and the like, and the corporate is clustered and classified based on the data sample obtained by processing, so that the obtained classification result can more objectively reflect the type characteristics of the corporation, and the classification result has more practical reference value.
Drawings
FIG. 1 is a schematic flow chart of a method for classifying companies based on financial document data
Detailed Description
The word "embodiment" as used herein does not necessarily mean that any embodiment described as "exemplary" is preferred or advantageous over other embodiments. Performance index testing in the examples of the present application, unless otherwise specified, was performed using conventional testing methods in the art. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure.
Unless otherwise defined, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; other test methods and techniques not specifically mentioned in the present application are those commonly used by those skilled in the art.
The terms "substantially" and "about" are used herein to describe small fluctuations. For example, they may refer to less than or equal to ±5%, such as less than or equal to ±2%, such as less than or equal to ±1%, such as less than or equal to ±0.5%, such as less than or equal to ±0.2%, such as less than or equal to ±0.1%, such as less than or equal to ±0.05%. Numerical data presented or represented herein in a range format is used only for convenience and brevity and should therefore be interpreted flexibly to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range. For example, a numerical range of "1 to 5%" should be interpreted to include not only the explicitly recited values of 1% to 5%, but also include individual values and sub-ranges within the indicated range. Thus, individual values, such as 2%, 3.5% and 4%, and subranges, such as 1% to 3%, 2% to 4% and 3% to 5%, etc., are included in this numerical range. The same principle applies to ranges reciting only one numerical value. Moreover, such an interpretation applies regardless of the breadth of the range or the characteristics being described.
In this document, including the claims, conjunctions such as "comprising," including, "" carrying, "" having, "" containing, "" involving, "" containing, "and the like are to be construed as open-ended, i.e., to mean" including, but not limited to. Only the conjunctions "consisting of … …" and "consisting of … …" are closed conjunctions.
Numerous specific details are set forth in the following examples in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In the examples, some methods, means, instruments, devices, etc. well known to those skilled in the art are not described in detail in order to highlight the gist of the present application.
On the premise of no conflict, the technical features disclosed by the embodiment of the application can be combined at will, and the obtained technical scheme belongs to the disclosure of the embodiment of the application.
In some embodiments, as shown in FIG. 1, a method of classifying a company based on financial credential data includes the steps of:
s1: for the origin of a single companyGrouping the initial financial voucher detail data to obtain first sample data about the month occurrence amount of the subject and second sample data about the accounting frequency, wherein the column index of the first sample data comprises three dimensions of time, subject and debit and credit marks, and the column index of the second sample data comprises three dimensions of time, subject and debit and credit marks; in general, the first sample data may be a (1*n) order matrix, specifically expressed as (x) 11 ,x 12 ……,x 1i ,……,x 1n ) Wherein x is 1i For the amount corresponding to the column index i, the column index is in a data format including time, subject and loan identification, and n represents the total number of column indexes. In general, the second sample data may be a (1*n) order matrix, specifically expressed as (y) 11 ,y 12 ……,y 1j ,……,y 1n ) Wherein y is 1j For the frequency number corresponding to the column index i, the column index is in a data format including time, subjects and loan marks, and n represents the total number of the column indexes. Similar grouping processing is performed on the original financial voucher data details of each company respectively, so that first sample data and second sample data of each company can be obtained.
S2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data; generally, accounting details used by different companies have a certain difference, when financial voucher data are summarized from a layer of primary subjects, most of the primary subjects among different companies can be mutually overlapped, and small primary subjects are always different, so that column indexes among the companies are different, the column indexes of all the companies can be combined, the obtained column index union set can completely cover the primary subjects of all the companies, and the primary subjects of all the companies are considered in the classification process of the companies, so that the objectivity and practicability of classification results of the companies are improved; in the company column index merging process, the missing value zero-filling processing of the column index corresponding data which is missing by the company can be generally carried out, so that the total number of the column indexes of each company is kept consistent; summarizing the first sample data of all companies to obtain a first total sample data (m x N) order matrix about the occurrence amount of the month of the subject, as represented by formula (4);
wherein x represents the month occurrence amount of the subject, m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes of all companies and is concentrated, and m and N are natural numbers;
according to the same method and process, a second order matrix of total sample data (m×n) with respect to frequency can be obtained, as represented by formula (5):
wherein y represents frequency, m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in the column index union set of all companies, and m and N are natural numbers;
s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies; for example, the first total sample data represented by the matrix (4) is combined with the second total sample data represented by the matrix (5) to obtain an initial sample data (m×2n) rank matrix (6), as represented by the formula (6):
generally, since the first sample data and the second sample data are different in magnitude, and the traffic and the company sizes of different companies are also greatly different, the first sample data are often different in magnitude, or the second data samples are also different in magnitude, in order to reasonably consider the actual influence of all data in all samples in the company classification process, the first total sample data and the second total sample data can be standardized and normalized, the difference in magnitude between the data is basically eliminated by the data after the standardized and normalized, the first total sample data and the second total sample data after the standardized and normalized are combined in a row and column manner, and the obtained initial sample data (m x 2N) rank matrix is in the same level in magnitude, so that the reasonable influence of company size, time, subjects, lending marks, occurrence amount, occurrence frequency and the like in the company classification process can be generated, and the classification result has more practical guiding significance.
S4: determining initial clustering center points equal to the classification number according to a preset classification number; specifically, if the classification number is set to k, determining the initial center point of clustering specifically includes:
s401: calculating the square sum of data corresponding to all column indexes of each company in the first total sample data, taking the company with the largest square sum value as a first clustering initial center point, and moving the sample data corresponding to the first clustering initial center point out of the initial sample data;
s402: calculating the distance between the sample data of each company in the initial sample data after the sample data is removed in the step S401 and the sample data of the initial center point of the first cluster, taking the company with the largest distance as the initial center point of the second cluster, and removing the corresponding sample data from the initial sample data;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the distance between the sample data of the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center point, and shifting the corresponding sample data out of the initial sample data;
s404: and by analogy, k clustering algorithm initial center points with the same classification number k are obtained, and a clustering algorithm initial center point set is formed.
As an optional embodiment, in the method for classifying companies based on financial credential data, the determining the initial center point of clustering specifically includes:
s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data (m x N) order matrix, taking the company with the largest Pi value as a first clustering initial center point, and moving the corresponding sample data (1 x 2N) order matrix out of the initial sample data, wherein the initial sample data is changed into an (m-1 x 2N) order matrix; wherein the calculation formula of Pi is formula (1) disclosed herein;
s402: calculating the distance between the sample data of each company and the sample data (1 x 2N) rank matrix of the initial center point of the first cluster in the initial sample data (m-1 x 2N) rank matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data (1 x 2N) rank matrix out of the initial sample data, wherein the initial sample data is changed into the (m-2 x 2N) rank matrix; wherein the distance L 1-i Is formula (2) disclosed herein;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in the initial sample data (m-2 x 2 n) order matrix after the sample data is removed in the step S402, taking the company with the largest distance as the third clustering initial center point, removing the corresponding sample data from the initial sample data, and changing the initial sample data into the (m-3 x 2 n) order matrix; wherein the distance L between two sample data j-i Calculated according to formula (3) disclosed herein;
s404: and by analogy, obtaining k clustering algorithm initial center points equal to the classification number k, forming a clustering algorithm initial center point set (k is 2N) order matrix, wherein k is a natural number smaller than m.
S5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification. In general, the distance between each sample data in the initial sample data and each sample data in the initial center point of clustering may be calculated, the company represented by the initial sample data is classified into the company represented by the initial center point of clustering closest thereto, and finally all the companies are classified according to a preset classification number. As an alternative embodiment, in general, step S5 may specifically include:
s501, calculating the distance between each sample data in an initial sample data (m.times.2N) rank matrix and each sample data in a clustering algorithm initial center point (k.times.2N) rank matrix, classifying companies corresponding to the sample data with the closest clustering algorithm initial center point distance into companies corresponding to the clustering algorithm initial center point, classifying all the companies into k types respectively to obtain k types of companies, wherein the sample data of each type of company form a (c.times.2N) rank matrix, and c represents the number of the companies corresponding to each type in the k types of companies;
s502: calculating an average value of each column of data in each (c.2n) order matrix, wherein the average value forms a (1.2n) order matrix;
s503: the (1 x 2N) order matrix formed in step S502 forms a new cluster initial center point set (k x 2N) order matrix; if the new clustering initial center point (k×2n) rank matrix obtained in step S503 is the same as the clustering initial center point (k×2n) rank matrix in step S501, the clustering algorithm ends, otherwise, step S504 is entered;
s504: calculating the distance between each sample data in the initial sample data (m x 2N) rank matrix and each sample data in the new clustering initial center point (k x 2N) rank matrix, classifying the companies corresponding to the sample data with the nearest distance to the new clustering initial center point into the companies corresponding to the new clustering initial center point, classifying all the companies into k classes respectively to obtain k classes of companies, wherein the sample data of each class of companies form a (c x 2N) rank matrix, and c represents the number of the companies corresponding to each class in the k classes of companies;
s505: if the category of the company obtained in step S504 is the same as the category of the company in step S501 and the company in each category is the same, the classification is ended; otherwise, the next clustering calculation is repeated from step S501.
In another aspect, some embodiments disclose a computer-readable medium containing computer-executable instructions that, when processed via a data processing device, perform a method of corporate classification based on financial credential data. Generally, computer program instructions or code for performing the operations of some embodiments of the present disclosure can be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++, python and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Further exemplary details are described below in connection with the embodiments.
Example 1
The following example 1 exemplifies the process of obtaining initial sample data.
Single company single credential data
TABLE 1 company A list of raw credential details
Table 1 shows an original voucher details list of company A, wherein the two parties of borrowing and lending are displayed, and after the absolute value is calculated on the sum of the money amounts according to the identification of the borrowing and lending, the two amounts are balanced. For example, the credential line item numbers 2 and 3 represent a borrower, the sum of the amounts is 200000, the credential line item number 1 represents a lender, the sum of the amounts is 200000, the absolute value is obtained, and the two values are equal. The accounting date is the date of financial running water, the general account subject is a detail subject code, and the coin amount represents the occurrence amount of the detail.
All financial data of all companies.
Hereinafter, company a and company B will be described as examples. Table 2 shows all the financial data of company A and company B.
TABLE 2 financial data for company A, company B
First sample data of company A
With the billing date year month-first order subject code-lending identification as column index, determining that the first sample data of company a with respect to the subject month occurrence amount is table 3 first sample data of company a, wherein 202106_1002_h, 202106_2121_s are two column indexes, a is the company code, and the first sample data forms (1*2) order matrix:
TABLE 3 company A first sample data
Second sample data of company A
Determining that the second sample data of company a about the billing rate is table 4 company a second sample data with the billing date year month-first order subject code-loan identification as column index, wherein 202106_1002_h, 202106_2121_s are column index, a is company code, and the second sample data forms (1*2) order matrix:
TABLE 4 company A second sample data
Company code 202106_1002_H 202106_2121_S
A 1 2
Company B first sample data
With the billing date year month-first order subject code-lending identification as column index, determining that first sample data of company B with respect to the subject month occurrence amount is first sample data of company B of table 5, wherein 202107_1003_h, 202107_2121_s are two column indexes, B is a company code, and the first sample data forms (1*2) order matrix:
TABLE 5 company B first sample data
Company code 202107_1003_H 202107_2121_S
B 2000 2000
Company B second sample data
With the billing date year month-first order subject code-lending identification as column index, determining that the second sample data of company B about billing frequency is table 6 company B second sample data, wherein 202107_1003_h, 202107_2121_s are two column indexes, B is company code, and the second sample data forms (1*2) order matrix:
TABLE 6 company B second sample data
Company code 202107_1003_H 202107_2121_S
B 1 1
First total sample data of all companies
Combining the first sample data of company a with the first sample data of company B to obtain first total sample data of company a and company B, as shown in table 7, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h, 202107_2121_s are four column indexes, A, B is a company code, and the first total sample data forms (2×4) rank matrix:
TABLE 7 first total sample data for all companies
Company code 202106_1002_H 202106_2121_S 202107_1003_H 202107_2121_S
A 200000 200000 0 0
B 0 0 2000 2000
Second total sample data of all companies
Combining the second sample data of company a with the second sample data of company B to obtain second total sample data of company a and company B, as shown in table 8, where 202106_1002_h, 202106_2121_s, 202107_1003_h, 202107_2121_s are four column indexes, A, B is a company code, and the second total sample data forms a (2×4) rank matrix:
table 8 second Total sample data for all companies
Company code 202106_1002_H 202106_2121_S 202107_1003_H 202107_2121_S
A 1 2 0 0
B 0 0 1 1
First total sample data normalization
Normalization processing is performed on all the first total sample data of all companies shown in table 7 to obtain normalized first total sample data shown in table 9, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h and 202107_2121_s are four column indexes, A, B is a company code, and the normalized first total sample data form a (2×4) order matrix: :
table 9 normalized first total sample data:
second total sample data normalization
Normalization processing is performed on all the second total sample data of the company shown in table 8, so as to obtain normalized second total sample data shown in table 10, wherein 202106_1002_h, 202106_2121_s, 202107_1003_h and 202107_2121_s are four column indexes, A, B is a company code, and the normalized second total sample data forms a (2×4) order matrix:
table 10 normalized second total sample data:
company code 202106_1002_H 202106_2121_S 202107_1003_H 202107_2121_S
A 0.301511345 1.507556723 -0.904534034 -0.904534034
B -1 -1 1 1
Initial sample data
Combining the normalized first total sample data (listed in table 9) with the normalized second total sample data (listed in table 10) to obtain an initial data sample, as listed in table 11 below, wherein 202106_1002_h_value, 202106_2121_s_value, 202107_1003_h_value, 202107_2121_s_value, 202106_1002_h_count, 202106_2121_s_count, 202107_1003_h_count, 202107_2121_s_count are eight column indexes, A, B is a company code, the normalized initial sample data forms (2*8) order matrix, wherein the column indexes representing the amounts are suffixed with-value, the column indexes representing the frequencies are suffixed with-count to represent different column indexes, and the column index representation representing the amounts and the frequencies are prevented from being consistent, and the column index repetition occurs:
table 11 initial sample data
Example 2
Classification of companies based on financial credential data
In example 2, five companies A, B, C, D, E are classified, and the classification column index is divided into four items, which is expressed as x 1 、x 2 、x 3 、x 4 The initial sample data for five companies is shown in table 12:
table 12 initial sample data
Company code x 1 x 2 x 3 x 4
A 5.1 3.5 1.4 0.2
B 4.9 3.0 1.4 0.2
C 4.7 3.2 1.3 0.2
D 4.6 3.1 1.5 0.2
E 5.0 3.6 1.4 0.2
Normalization processing is performed on the initial sample data listed in table 12 to obtain normalized initial data samples listed in table 13.
Table 13 normalization of initial sample data
Company code x 1 x 2 x 3 x 4
A 1.351023 0.5033223 -0.60928488 -1.24506042
B 1.43136473 0.3542982 -0.55270519 -1.23295773
C 1.35847229 0.49136232 -0.60697698 -1.24285762
D 1.35865503 0.45288501 -0.51326968 -1.29827036
E 1.3119249 0.56225353 -0.61580149 -1.25837695
Computing company A, B, C, D, E data matrix according to equation (1) disclosed herein, each column index x 1 、x 2 、x 3 、x 4 And (3) obtaining the sum of the squares of the corresponding data, obtaining a first clustering initial center point as A, further calculating the distance between each company data sample and the company A data sample in the company B, C, D, E data matrix according to the formula (2), determining the company B with the largest distance as a second clustering initial center point, and forming the clustering initial center point set sample data listed in the table 14 by the data sample sets of the clustering initial center points A and B.
Table 14 clustering of initial center point set sample data
Grouping Company code x 1 x 2 x 3 x 4
0 A 1.351023 0.5033223 -0.60928488 -1.24506042
1 B 1.43136473 0.3542982 -0.55270519 -1.23295773
The sample data represented by company a was grouped into 0 groups, the sample data represented by company B was grouped into 1 groups, and the distances between all companies in table 13 and the data samples in group 0 and group 1 in table 14 were calculated, respectively, the calculation formula was formula (3) disclosed herein, and the calculation results are shown in table 15 below.
Table 15 distances between all companies and group 0 and group 1
In Table 15, the first column is at a distance from group 0, where company C, D, E is closer to group A, classifying C, D, E as group 0 with group A, the second column is at a distance from group 1, both A, C, D, E are farther from group B, and B is separately classified as group 1.
Determining a new set of cluster initial center points
A, C, D, E is taken as group 0 to form a data set, each column of data in the data set is averaged, and the data set and the average value calculation result are listed in the group 0 data set and the average value in table 16.
Table 16 group 0 dataset and average list
Company code x 1 x 2 x 3 x 4
A 1.351023 0.5033223 -0.60928488 -1.24506042
C 1.35847229 0.49136232 -0.60697698 -1.24285762
D 1.35865503 0.45288501 -0.51326968 -1.29827036
E 1.3119249 0.56225353 -0.61580149 -1.25837695
Average value of 1.345018805 0.50245579 -0.586333258 -1.261141338
The lowest average value of each column index in table 16 is the average value of data corresponding to the column index of A, C, D, E four companies.
There is only one data point B in group 1, so the data corresponding to each column index of data point B can be considered as its average.
The average value listed in table 16 is shown as data of group 0, and is recombined with data of group 1 in table 14 to new cluster initial center point set sample data, which is shown as table 17.
Table 17 new clustering initial center point set sample data
Grouping x 1 x 2 x 3 x 4
0 1.345018805 0.50245579 -0.586333258 -1.261141338
1 1.43136473 0.3542982 -0.55270519 -1.23295773
The distances between all companies in table 13 and the data samples in group 0 and group 1 in table 17 were again calculated, the calculation formula is formula (3) disclosed herein, and the calculation results are shown in table 18 below.
Table 18 distances between all companies in table 13 and group 0 and group 1 in table 17
Company code Distance from packet 0 Distance from group 1
A 0.000822 0.032011
B 0.031331 0
C 0.001065 0.027143
D 0.00936 0.020827
E 0.005547 0.062139
In table 18, the first column is the distance from group 0, where company C, D, E is closer to group a, groups C, D, E to group a into the same class of group 0, the second column is farther from group 1, both groups A, C, D, E are farther from group B, and group B is separately assigned to class 1.
Up to this point, company A, B, C, D, E is classified into two categories, wherein A, C, D, E is assigned to group 0, b is assigned to group 1, the grouping result is constant, and the classification calculation is determined to end. The results are shown in Table 19 under the classification of A, B, C, D, E.
Table 19 company A, B, C, D, E classification results
Company code Packet number
A 0
B 1
C 0
D 0
E 0
According to the corporate classification method based on the financial voucher data, which is disclosed by the embodiment of the application, based on the corporate financial voucher data, the corporate financial data is processed from a plurality of dimensions such as monthly occurrence amount, billing frequency, subject content and the like, and the corporate is clustered and classified based on the data sample obtained by processing, so that the obtained classification result can more objectively reflect the type characteristics of the corporation, and the classification result has more practical reference value.
The technical details disclosed in the technical scheme and the embodiment of the application are only illustrative of the inventive concept of the application and are not limiting to the technical scheme of the application, and all conventional changes, substitutions or combinations of the technical details disclosed in the application have the same inventive concept as the application and are within the scope of the claims of the application.

Claims (9)

1. A method of classifying companies based on financial document data, the method comprising the steps of:
s1: grouping original financial voucher detail data of a single company to obtain first sample data about month occurrence amount of a subject and second sample data about accounting frequency, wherein a column index of the first sample data comprises three dimensions of time, subjects and loan marks, and a column index of the second sample data comprises three dimensions of time, subjects and loan marks;
s2: summarizing first sample data of all companies to obtain first total sample data, and summarizing second sample data of all companies to obtain second total sample data;
s3: combining the first total sample data and the second total sample data of all companies to obtain initial sample data of all companies;
s4: determining initial clustering center points equal to the classification number according to a preset classification number;
s5: and classifying according to the classification number, the initial center point of the cluster and the initial sample data, and determining the cluster label of each classification and the company contained in each classification.
2. The method according to claim 1, wherein in the step S1, the first sample data is a 1*n order matrix, and the second sample data is a 1*n order matrix, where n is a column index number and is a natural number.
3. The method according to claim 1, wherein in the step S2, the first total sample data is m×n order matrix, the second total sample data is m×n order matrix, where m is the number of companies, N is the number of column indexes, which is equal to the number of column indexes in a column index union of all companies, and m and N are both natural numbers.
4. The method for classifying companies based on financial document data according to claim 3, wherein in the step S3, after the normalization processing is performed on the first total sample data and the second total sample data, column merging is performed to obtain an m×2n order matrix of initial sample data, where m is the number of companies, and 2N is the total number of column indexes.
5. The method for classifying companies based on financial document data according to claim 1, wherein in step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:
s401: calculating the square sum of data corresponding to all column indexes of each company in the first total sample data, taking the company with the largest square sum value as a first clustering initial center point, and moving the sample data corresponding to the first clustering initial center point out of the initial sample data;
s402: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point in the initial sample data after the sample data of the first clustering initial center point is moved out in the step S401, taking the company with the largest distance as the second clustering initial center point, and moving out the corresponding sample data of the second clustering initial center point from the initial sample data;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the sample data of the second clustering initial center point in the total sample data after the sample data of the second clustering initial center point is shifted out in the step S402, taking the company with the largest distance as the third clustering initial center, and shifting the corresponding sample data out of the initial sample data;
s404: and by analogy, k clustering algorithm initial center points with the same classification number k are obtained, and a clustering algorithm initial center point set is formed.
6. The method for classifying companies based on financial document data according to claim 4, wherein in step S4, the preset classification number is k, and determining the initial center point of clustering specifically includes:
s401: calculating the square sum Pi of data corresponding to all column indexes of each company in a first total sample data m-N order matrix, taking the company with the largest Pi value as a first clustering initial center point, moving the corresponding sample data 1-2N order matrix out of the initial sample data, and changing the initial sample data into m-1-2N order matrix; the calculation formula of Pi is as follows:
wherein x is i y l Representing data corresponding to the ith row and the ith column in the m x N matrix of the first total sample data, wherein l is a natural number from 1 to N, and i is a natural number from 1 to m;
s402: calculating the distance between the sample data of each company and the sample data 1 x 2N order matrix of the initial center point of the first cluster in the initial sample data m-1 x 2N order matrix, taking the company with the largest distance as the initial center point of the second cluster, and moving the corresponding sample data 1 x 2N order matrix out of the initial sample data, wherein the initial sample data is changed into m-2 x 2N order matrix;
wherein the distance L 1-i The calculation formula of (2) is as follows:
wherein x is 1 y l Representing data corresponding to a first column in a 1 x 2N-order matrix of the initial center point data sample of the first cluster; x is x i y l Representing data corresponding to the ith row and the ith column in the m-1 x 2N-order matrix of initial sample data; l is a natural number from 1 to 2N, i is a natural number from 1 to m-1; l (L) 1-i The method comprises the steps of obtaining initial center point sample data of a first cluster and the distance between an ith data sample in the initial sample data;
s403: calculating the distance between the sample data of each company and the sample data of the first clustering initial center point and the second clustering initial center point in an m-2 x 2N order matrix of initial sample data, taking the company with the largest distance as a third clustering initial center point, moving the corresponding sample data out of the initial sample data, and changing the initial sample data into an m-3 x 2N order matrix;
wherein the distance L between two sample data j-i Calculated according to the following formula:
wherein x is j y l Representation ofData corresponding to the first column in 1 x 2N-order matrix of sample data j, x i y l Representing data corresponding to the first column in a 1 x 2N-order matrix of the sample data i; l is a natural number from 1 to 2N, i is a natural number from 1 to m-2; j is a natural number of 1-m-2; l (L) j-i Is the distance between sample data j and sample data i;
s404: and by analogy, k clustering algorithm initial center points equal to the classification number k are obtained, a clustering algorithm initial center point k is formed and is a 2N-order matrix, and k is a natural number smaller than m.
7. The method for classifying companies based on financial document data according to claim 1, wherein said step S5 comprises: and calculating the distance between each sample data in the initial sample data and each sample data in the initial clustering center point, classifying the company represented by the initial sample data to the company represented by the initial clustering center point closest to the initial sample data, and finally classifying all the companies according to the preset classification number.
8. The method for classifying companies based on financial document data according to claim 6, wherein said step S5 comprises:
s501, calculating the distance between each sample data in an m-2N-order matrix of initial sample data and each sample data in a k-2N-order matrix of an initial center point of a clustering algorithm, classifying companies corresponding to the sample data with the nearest distance from the initial center point of the clustering algorithm into the companies corresponding to the initial center point of the clustering algorithm, classifying all the companies into k classes respectively to obtain k classes of companies, wherein the sample data of each class of companies form a c-2N-order matrix, and c represents the number of the companies corresponding to each class in the k classes of companies;
s502: calculating the average value of each column of data in each c.2N-order matrix, wherein the average value forms a 1.2N-order matrix;
s503: the 1 x 2N-order matrix formed in the step S502 forms a new clustering initial center point k x 2N-order matrix; if the new clustering initial center point k×2n order matrix obtained in step S503 is the same as the clustering initial center point k×2n order matrix in step S501, the clustering algorithm ends, otherwise, step S504 is entered;
s504: calculating the distance between each sample data in the m-2N-order matrix of initial sample data and each sample data in the k-2N-order matrix of new clustering initial center points, classifying companies corresponding to the sample data with the nearest initial center point distance of a new clustering algorithm into corresponding companies, classifying all the companies into k types to obtain k types of companies, wherein the sample data of each type of company form a c-2N-order matrix, and c represents the number of the companies corresponding to each type in the k types of companies;
s505: if the category of the company obtained in step S504 is the same as the category of the company in step S501 and the company in each category is the same, the classification is ended; otherwise, the next clustering calculation is repeated from step S501.
9. Computer-readable medium containing computer-executable instructions, characterized in that the computer-executable instructions, when processed via a data processing device, perform the company classification method based on financial document data as claimed in any one of claims 1 to 8.
CN202110969456.9A 2021-08-23 2021-08-23 Company classification method based on financial evidence data and computer readable medium Active CN113643116B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110969456.9A CN113643116B (en) 2021-08-23 2021-08-23 Company classification method based on financial evidence data and computer readable medium
ZA2021/09633A ZA202109633B (en) 2021-08-23 2021-11-26 Method for classifying companies based on financial voucher data, and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110969456.9A CN113643116B (en) 2021-08-23 2021-08-23 Company classification method based on financial evidence data and computer readable medium

Publications (2)

Publication Number Publication Date
CN113643116A CN113643116A (en) 2021-11-12
CN113643116B true CN113643116B (en) 2023-10-27

Family

ID=78423413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110969456.9A Active CN113643116B (en) 2021-08-23 2021-08-23 Company classification method based on financial evidence data and computer readable medium

Country Status (2)

Country Link
CN (1) CN113643116B (en)
ZA (1) ZA202109633B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020066026A (en) * 2001-02-08 2002-08-14 주식회사 아이앤아이오 Enterprise Management System and method thereof
CN110796159A (en) * 2019-09-12 2020-02-14 国网浙江省电力有限公司杭州供电公司 Power data classification method and system based on k-means algorithm
CN111639516A (en) * 2019-03-01 2020-09-08 埃森哲环球解决方案有限公司 Analysis platform based on machine learning
CN112634003A (en) * 2020-12-29 2021-04-09 深圳行智互动科技有限公司 Method for generating financial record from user actual operation data based on network
CN112750023A (en) * 2020-12-16 2021-05-04 苏宁消费金融有限公司 Consumption financial user income estimation method based on factor analysis
CN112905863A (en) * 2021-03-19 2021-06-04 青岛檬豆网络科技有限公司 Automatic customer classification method based on K-Means clustering
CN113051462A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Multi-classification model training method, system and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346371A (en) * 2013-07-31 2015-02-11 Sap欧洲公司 Business integration system report driven by in-memory database

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020066026A (en) * 2001-02-08 2002-08-14 주식회사 아이앤아이오 Enterprise Management System and method thereof
CN111639516A (en) * 2019-03-01 2020-09-08 埃森哲环球解决方案有限公司 Analysis platform based on machine learning
CN110796159A (en) * 2019-09-12 2020-02-14 国网浙江省电力有限公司杭州供电公司 Power data classification method and system based on k-means algorithm
CN113051462A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Multi-classification model training method, system and device
CN112750023A (en) * 2020-12-16 2021-05-04 苏宁消费金融有限公司 Consumption financial user income estimation method based on factor analysis
CN112634003A (en) * 2020-12-29 2021-04-09 深圳行智互动科技有限公司 Method for generating financial record from user actual operation data based on network
CN112905863A (en) * 2021-03-19 2021-06-04 青岛檬豆网络科技有限公司 Automatic customer classification method based on K-Means clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
上市公司经营业绩的分类与评价;杨善林;江兵;;数学的实践与认识(第01期);29-34 *

Also Published As

Publication number Publication date
CN113643116A (en) 2021-11-12
ZA202109633B (en) 2022-05-25

Similar Documents

Publication Publication Date Title
US20080147601A1 (en) Method For Searching Data Elements on the Web Using a Conceptual Metadata and Contextual Metadata Search Engine
CN110597870A (en) Enterprise relation mining method
CN110502638B (en) Enterprise news risk classification method based on target entity
WO2016073614A1 (en) Combining network analysis and predictive analytics
CN107248023B (en) Method and device for screening benchmarking enterprise list
CN111709826A (en) Target information determination method and device
CN108885673A (en) For calculating data-privacy-effectiveness compromise system and method
CN112232944B (en) Method and device for creating scoring card and electronic equipment
CN110619535A (en) Data processing method and device
CN110737917A (en) Data sharing device and method based on privacy protection and readable storage medium
Dang et al. Credit ratings of Chinese households using factor scores and K-means clustering method
CN109885797B (en) Relational network construction method based on multi-identity space mapping
CN108427667A (en) A kind of segmentation method and device of legal documents
CN113643116B (en) Company classification method based on financial evidence data and computer readable medium
CN114298845A (en) Method and device for processing claim settlement bills
CN116881799A (en) Method for classifying cigarette production data
Holowczak et al. Testing market response to auditor change filings: A comparison of machine learning classifiers
CN112241820A (en) Risk identification method and device for key nodes in fund flow and computing equipment
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
WO2022183019A9 (en) Methods for mitigation of algorithmic bias discrimination, proxy discrimination and disparate impact
Kemme et al. Inequality, autocracy, and sovereign funds as determinants of foreign portfolio equity flows
CN110852392A (en) User grouping method, device, equipment and medium
Li et al. Evolutionary mechanism of risk factor disclosure in American financial corporation annual report
US8156127B1 (en) Method and system for data arbitration
CN113723522B (en) Abnormal user identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant