CN106649890B - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
CN106649890B
CN106649890B CN201710066733.9A CN201710066733A CN106649890B CN 106649890 B CN106649890 B CN 106649890B CN 201710066733 A CN201710066733 A CN 201710066733A CN 106649890 B CN106649890 B CN 106649890B
Authority
CN
China
Prior art keywords
data
vector
classification model
type
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710066733.9A
Other languages
Chinese (zh)
Other versions
CN106649890A (en
Inventor
程力
王云
仇瑜
马超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuiyun Network Technology Service Co ltd
Original Assignee
Shuiyun Network Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shuiyun Network Technology Service Co ltd filed Critical Shuiyun Network Technology Service Co ltd
Priority to CN201710066733.9A priority Critical patent/CN106649890B/en
Publication of CN106649890A publication Critical patent/CN106649890A/en
Application granted granted Critical
Publication of CN106649890B publication Critical patent/CN106649890B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The application discloses a data storage method and device. One embodiment of the above method comprises: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type. The method can save the storage space and simultaneously can quickly store data.

Description

Data storage method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of internet technologies, and in particular, to a data storage method and apparatus.
Background
Data storage is the collection, storage, retrieval, processing, transformation and transmission of data. In the existing data storage, especially in the data storage process in the financial and tax fields, data features and data types corresponding to the data features are first manually defined and stored according to business needs, so as to facilitate subsequent financial accounting.
However, the existing data storage system applied to the financial and tax fields is not capable of analyzing and processing unstructured data, and secondly, due to the large difference between different financial accounting systems, data features and matching rules need to be defined for multiple times for storage according to different accounting systems, so that the complexity of data storage is increased, a large amount of storage space is occupied, and the utilization efficiency of data is reduced.
Disclosure of Invention
It is an object of the present application to provide an improved data storage method and apparatus to solve the technical problems mentioned in the background section above.
In a first aspect, the present application provides a data storage method, where the method includes: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type.
In some embodiments, the data classification model is a decision tree model.
In some optional implementation manners of this embodiment, the data is data in a data table, and the feature information includes: the name and the statistical characteristic information of the data table item in the data table to which the data belong; and converting the feature information into an input vector of a data classification model, and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data includes: generating a data table feature vector corresponding to the feature information, wherein the data table feature vector comprises: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; generating a first input vector of a data classification model sequentially comprising the characteristic vector and the zero vector of the data table; and inputting the first input vector into a data classification model to obtain an output vector indicating the type of the data.
In some embodiments, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some optional implementation manners of this embodiment, the data is text data, and the feature information is a keyword; and converting the feature information into an input vector of a data classification model, and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data includes: generating keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component; generating a second input vector of the data classification model sequentially comprising the zero vector and the keyword feature vector;
in some embodiments, the second input vector is input to a data classification model, resulting in an output vector indicative of the type of the data.
In a second aspect, the present application provides a data storage device, the device comprising: an obtaining unit configured to obtain feature information of data to be stored, the feature information including at least one of: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; an input unit configured to input an input vector for converting the feature information into a data classification model to a data classification model, and obtain an output vector indicating a type of the data, the data classification model being generated based on a supervised training using a training sample in advance, the training sample including: the characteristic information of the stored data and the marked type of the stored data; and the storage unit is configured to store the data in a storage area corresponding to the type.
In some embodiments, the data classification model is a decision tree model.
In some embodiments, the data is in a data table, and the characteristic information includes: the name and the statistical characteristic information of the data entry in the data table to which the data belongs, and the input unit include: a data table feature vector generation subunit configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; a first input vector generation subunit configured to generate a first input vector of a data classification model including the feature vector and the zero vector of the data table in this order; and the output vector generating subunit is configured to input the first input vector to a data classification model, so as to obtain an output vector indicating the type of the data.
In some embodiments, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some embodiments, the data is text data, the feature information is a keyword, and the input unit includes: the keyword feature vector generating subunit is configured to generate keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component; a second input vector generation subunit configured to generate a second input vector of the data classification model sequentially including a zero vector and the keyword feature vector; and the output vector generating subunit is configured to input the second input vector to a data classification model, so as to obtain an output vector indicating the type of the data.
According to the data storage method and the data storage device, the characteristic information of the data to be stored is obtained, then the characteristic information is converted into the input vector to be input into the data classification model with supervision training, and the data vector output from the data classification model is stored in the storage area corresponding to the data type, so that the data are effectively classified according to the data type, and the storage space of the data storage area is saved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a data storage method according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a data storage method according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a data storage device according to the present application;
FIG. 5 is a block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the data storage method or data storage apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as a web browser application, a data accounting type application, a financial reporting type application, a search type application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio L player iii, mpeg compressed standard Audio layer 3), MP4 players (Moving Picture Experts Group Audio L layer IV, mpeg compressed standard Audio layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background data processing server providing data support for applications running on the terminal devices 101, 102, 103, or may be a server collecting data from various data sources. The background data processing server can analyze and process the data acquired from the data source, and store and feed back the processing result to the terminal equipment.
It should be noted that the data storage method provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the data storage device is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram 200 of one embodiment of a data storage method according to the present application is shown. The data storage method comprises the following steps:
step 201, obtaining characteristic information of data to be stored.
In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the data storage method operates may acquire data source information of data to be stored in a wired connection manner or a wireless connection manner, and acquire the data to be stored according to the data source information. Here, the data source refers to an original medium providing desired data or a database supported by a storage device. Data source information refers to information needed to establish a database connection. In obtaining data to be stored based on the data source information, the data to be stored may be obtained from a network, a database, or an application associated with a financial system.
When the data to be stored is acquired from the database, the electronic device may find the corresponding database connection relationship by providing the correct data source name to the server supporting the database, and further acquire the data to be stored from the corresponding data source.
When data to be stored is acquired from a financial system of an enterprise, the data source information may include financial internal information and external information, wherein the internal information may include various business processing data and various document data, and the external information may include various laws and regulations, market information, and the like.
In this embodiment, after the server acquires the data to be stored from the data source, the characteristic information of the data to be stored may be further acquired, where the characteristic information of the data to be stored includes at least one of the following: the name of the data table entry in the data table to which the data to be stored belongs, the statistical characteristic information indicating the statistical characteristic of the data, and the keyword. Here, the data table may be disposed in the database, and is used for storing the data to be stored. One data table may set a name, which may be, for example, a department name, a cost, an employee, and the like. The statistical characteristics can be the number of data, the length of the data and the like. When the data to be stored is text data, the feature information may be a keyword indicating the text content. For example, when the text data is "scientific research expenditure in department a", the keywords may be "department a" and "scientific research expenditure".
In some optional implementations of the embodiment, the statistical characteristic information includes association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and types of characters in the data.
As an example, the server first obtains data to be stored from a plurality of data sources. Then, the server may further obtain names of data entries in data tables to which the data to be stored belong in the database, for example, a name of a data entry in a data table to which one of the data to be stored belongs in the database is "department wage", and a name of a data entry in a data table to which another one of the data to be stored belongs in the database is "performance wage". The server may further obtain statistical characteristic information of the data to be stored, for example, the server may obtain an average value of data lengths of data of "department wage", and may also obtain a minimum value and a maximum value of data lengths of data of "performance wage".
Step 202, converting the characteristic information into an input vector of the data classification model, and inputting the input vector into the data classification model to obtain an output vector indicating the type of the data.
In this embodiment, according to the feature information of the data to be stored acquired in step 201, the server may construct a multidimensional vector representing a plurality of features of the data to be stored as an input vector of the data classification model according to the feature information. The input vector includes a component representing the name of a data entry, a statistical feature component representing statistical features of the data, a feature component representing a keyword. The input vector is then input into a data classification model, resulting in an output vector indicative of the type of data to be stored. The output vector may include a type component of each preset data, a matching degree component between the data to be stored and the type of the data. The data to be stored and the type of the data corresponding to each other may use the matching degree to represent the strength of the corresponding relationship. In general, the higher the degree of match, the greater the probability that the data to be stored belongs to the type of the data.
The type of data may include a character string data type for representing names of various things such as a department name, a document name, may also include a data type for representing numbers such as an integer, a floating point, a positive number, a negative number, may also include a data type for representing a date and time, may also include a data type for representing money, and the like.
The data classification model may be used to describe a correspondence between data to be stored (e.g., data in a data table) and a type of the data (e.g., a type of data representing a number). The data classification model is formed by training the characteristic information of the stored data, the type of the labeled stored data matched with the characteristic information of the stored data and the matching degree between the characteristic information of the stored data and the type of the stored data as training samples by a machine learning method in a supervised learning mode.
The supervised learning mode can be carried out through the following steps:
firstly, the stored data is used as a training sample, and the server acquires the characteristic information of the stored data. For example, when the stored data is data in a database, since a plurality of data tables exist in the database, the server may obtain the name of a data entry of the stored data, may obtain the type of a character of the stored data, and the like; when the stored data is text data, the server may acquire a keyword of the stored data as the feature information.
Then, a type tag of data is set for the stored data, and the tag may be, for example, a data type indicating a number, a data type indicating a date, a data type indicating a text, or the like.
And thirdly, establishing a matching degree between the type of the data of the stored data and the characteristic information of the stored data based on the data type label of the stored data and the characteristic information of the stored data. Since one stored data sample has at least one characteristic information, and each stored data sample corresponds to one data type tag, the server can calculate the matching degree between the data type of the stored data and the characteristic information of the stored data according to a set algorithm.
And finally, performing data classification model training based on the characteristic information of the stored data, the type of the labeled stored data matched with the characteristic information of the stored data and the matching degree between the characteristic information of the stored data and the type of the stored data by using a machine learning method.
The machine learning method may include a neural network, a genetic algorithm, and the like.
This step is explained by taking the "department name" as an example of the data to be stored. The term "department name" is different in names in different application scenarios, and may be called "department" in some systems, or "department" in another system, or "department" in yet another system, but they are all "department names". Therefore, in a system, when the data to be stored is any one of the above, the feature information related to the above name acquired in step 201 may be converted into an input vector of a data classification model, and input into the data classification model for matching, so as to obtain an output vector indicating the type of the data to be stored, and the server may determine that the type of the data to be stored is the "department name" according to the output vector.
Step 203, storing the data in a storage area corresponding to the type of the data indicated by the output vector.
In this embodiment, according to the output vector of the data classification model obtained in step 202, the type to which the data belongs may be determined, so that the data is stored in the storage area corresponding to the type. In order to conveniently and effectively manage data in a unified manner in a server or a client, a storage area is usually set according to different data types, after the server determines the data type to be stored according to an output vector, the server can firstly search whether a preset storage area is provided with the data type, if so, the data to be stored can be directly stored in the storage area corresponding to the type, and if not, the server can reestablish a new storage area for storage.
According to the data storage method provided by the embodiment of the application, the characteristic information of the data with storage is obtained, then the characteristic information is converted into the input vector of the data classification model formed by pre-training and is input into the data classification model, the output vector indicating the type of the data is obtained, and finally the data is stored in the storage area corresponding to the data type indicated by the data classification model, so that the data to be stored in the mailbox is classified, the storage efficiency of the data is improved, and the storage space of the data is saved.
With further reference to FIG. 3, a flow 300 of yet another embodiment of a data storage method is illustrated. The process 300 of the data storage method includes the following steps:
step 301, obtaining characteristic information of data to be stored.
The existing data can be divided into various types, and the data can be divided into structured data and unstructured data according to whether the data can be logically expressed by a two-dimensional table structure. Structured data, i.e., row data, can be represented by a uniform structure, such as numbers, symbols, and traditional data models; unstructured data refers to data in which the field length of the data is variable, and the record of each field can be composed of repeatable or non-repeatable sub-fields, and unstructured data includes video, audio, documents, text pictures, various reports, images, office documents, and the like. In a financial system, a large amount of data in a data table, namely structured data, exists, and characteristic information of the data can be represented by a data length value, the type of character strings in the data and the like; there is also a large amount of text data whose characteristic information can be represented by keywords.
In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the data storage method operates may acquire the feature information of the data to be stored through a wired connection manner or a wireless connection manner. When the data to be stored is data in a data table, the characteristic information includes at least one of the following items: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and the statistical characteristic information indicating the statistical characteristics of the data further include association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data. When the data to be stored is text data, the characteristic information includes a keyword.
In this embodiment, when the data to be stored is text data, a natural language processing method or a recurrent neural network model may be used to perform word segmentation and word segmentation on the text data, so as to determine the keywords in the text data.
Step 302, generating a data table feature vector corresponding to the feature information.
According to the feature information of the data to be stored in the data table acquired in step 301, in this embodiment, the server may generate a data table feature vector from the feature information of the data to be stored, where the data table feature vector includes a component representing a name of a data entry in the data table to which the data belongs, and a component representing statistical feature information. As an example, in one system, the data "B" to be stored is "employee information", and the "employee information" such as "sex", "age", and the like may be stored in the data table of "basic information of employee", or may be stored by establishing a relationship with the data table of "department information" using a main foreign key relationship. The feature vector corresponding to the data "B" to be stored is a component indicating the name of an entry of a data table to which this data of "employee information" belongs, a component indicating an association with "department information", a component indicating an average length value of the data of employee information.
Step 303, generate a first input vector of the data classification model comprising the feature vector of the data table and the zero vector in sequence.
The input vector of the data classification model mainly comprises two parts, namely a data table feature vector and a keyword feature vector, when the data to be stored is data table data, namely structured data, the keyword feature vector can be represented in a zero vector form, and when the data to be stored is text data, namely unstructured data, the data table feature vector can be represented in the zero vector form.
In this embodiment, the server may further generate a first input vector of the data classification model according to the data to be stored determined in step 301 as the data in the data table and according to the feature vector of the data in the data table determined in step 302, where the first input vector includes the data table feature vector determined in step 302 and the zero vector in sequence.
And step 304, generating a keyword feature vector corresponding to the feature information. In this embodiment, when the data to be stored is text data, since the feature information of the text data is a keyword, in this step, a keyword feature vector may be generated from the keyword information corresponding to the text data, where each keyword in the keyword feature vector corresponds to one component. In this embodiment, the keyword feature vectors may be generated by using a vector space model, which is a conventional technology and is not described herein again. As an example, in some system, there are a large number of unstructured text data such as documents, contracts, and the like. When the data to be stored is the 'company contract C', the server generates a keyword component corresponding to the keyword 'company C' and a keyword component corresponding to the 'contract' respectively according to the acquired keywords such as the characteristic information 'company C' and 'contract' of the 'company contract C'.
Step 305, a second input vector of the data classification model is generated, which in turn comprises a zero vector and a keyword feature vector.
In this embodiment, the server may further generate a second input vector of the data classification model according to the text data determined in step 301 as the data to be stored, and according to the keyword vector of the text data determined in step 305, where the input vector includes the zero vector and the keyword vector determined in step 305 in sequence
Step 306, input the input vector to the data classification model, and obtain an output vector indicating the type of the data.
In this embodiment, according to the first input vector and the second input vector of the data classification model determined in steps 303 and 305, the server may input the first input vector and the second input vector into the data classification model respectively, and obtain an output vector indicating the type of the data. The output vector may include a type component of each preset data, a matching degree component between the data to be stored and the type of the data. Here, the data classification model may first determine whether the data to be stored is data in a data table or text data according to the input vector, and then the data classification model may process the two data separately, thereby generating output vectors according to the first input vector and the second input vector, respectively. For example, when the server inputs an input vector generated by data "X" to be stored into the data classification model, the data classification model may determine the data "X" to be stored as data in the data table based on the data table feature components of the input vector and the zero vector, and simultaneously determine the data type of the data as "data type related to numbers", so the data classification model outputs an output vector corresponding to the "data type related to numbers". For another example, when the server inputs an input vector generated by data "Y" to be stored into the data classification model, the data classification model may determine that the data "Y" to be stored is text data based on a zero vector and a keyword feature component of the input vector, and at the same time, determine that the data type of the data is "character type", so that the data classification model outputs an output component corresponding to the "character type".
In this embodiment, the data classification model is formed by training in a supervised manner based on training samples in advance, and optionally, the data classification model is a decision tree model, where it is to be noted that a machine learning method of the decision tree model is a well-known technology widely studied and applied at present, and is not described herein again.
Step 307, storing the data in a storage area corresponding to the type of the data indicated by the output vector.
In this embodiment, according to the output vector of the data classification model obtained in step 306, the type to which the data belongs may be determined, so that the data is stored in the storage area corresponding to the type.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the process 300 of the data storage method in this embodiment divides the data to be stored into structured data and unstructured data, that is, data and text data in the data table, and simultaneously inputs the two data distributions into the data classification model for matching, and the data classification model separates and processes the two data to obtain an output vector indicating the type of the data in the data table and an output vector indicating the type of the text data, so as to more quickly and effectively classify the data, increase the speed of data storage, and reduce the space for storing the data.
With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of a data storage device, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.
As shown in fig. 4, the data storage device 400 of the present embodiment includes: an acquisition unit 401, an input unit 402, and a storage unit 403. The obtaining unit 401 is configured to obtain feature information of data to be stored, where the feature information includes at least one of the following: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; the input unit 402 is configured to input an input vector, which is obtained by converting the feature information into a data classification model generated by performing supervised training using a training sample in advance, to a data classification model, and obtain an output vector indicating a type of the data, where the training sample includes: the characteristic information of the stored data and the marked type of the stored data; the storage unit 403 is configured to store the data in the storage area corresponding to the type.
In this embodiment, specific processing of the obtaining unit 401, the input unit 402, and the storage unit 403 of the data storage device 400 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional implementations of this embodiment, the data is data in a data table, the feature information includes a name of a data entry in the data table to which the data belongs and statistical feature information, and the input unit 402 includes: the data table feature vector generation subunit 4021 is configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; the first input vector generation subunit 4022 is configured to generate an input vector of a data classification model that sequentially includes the feature vector and the zero vector of the data table; the output vector generating subunit 4025 is configured to input the input vector to a data classification model, and obtain an output vector indicating a type of the data.
In some optional implementation manners of this embodiment, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some optional implementation manners of this embodiment, the data is text data, the feature information is a keyword, and the input unit 402 includes: the keyword feature vector generation subunit 4023 is configured to generate keyword feature vectors corresponding to the feature information, where each keyword in the keyword feature vectors corresponds to a component; the second input vector generation subunit 4024 is configured to generate a second input vector of the data classification model, which sequentially includes a zero vector and the keyword feature vector; the output vector determination subunit 4025 is configured to input the second input vector to the data classification model, and obtain an output vector indicating a type of the data.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
To the I/O interface 505, AN input section 506 including a keyboard, a mouse, and the like, AN output section 507 including a keyboard such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 508 including a hard disk and the like, and a communication section 509 including a network interface card such as a L AN card, a modem, and the like, the communication section 509 performs communication processing via a network such as the internet, a drive 510 is also connected to the I/O interface 505 as necessary, a removable medium 511 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an input unit, and a storage unit. Here, the names of the units do not constitute a limitation to the units themselves in some cases, and for example, the acquisition unit may also be described as a "unit that acquires characteristic information of data to be stored".
As another aspect, the present application also provides a non-volatile computer storage medium, which may be the non-volatile computer storage medium included in the apparatus in the above-described embodiments; or it may be a non-volatile computer storage medium that exists separately and is not incorporated into the terminal. The non-transitory computer storage medium stores one or more programs that, when executed by a device, cause the device to: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (8)

1. A method of data storage, the method comprising:
acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword;
converting the feature information into an input vector of a data classification model, inputting the input vector into a data classification model, and obtaining an output vector indicating the type of the data, wherein the data classification model is generated based on training in a supervised manner by utilizing a training sample in advance, and the training sample comprises: the characteristic information of the stored data, the type of the stored data marked;
storing the data in a storage area corresponding to the type;
wherein, the data is data in a data table, and the characteristic information comprises: the name and the statistical characteristic information of a data table item in a data table to which the data belongs; and
converting the feature information into an input vector of a data classification model and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data comprises:
generating a data table feature vector corresponding to the feature information, wherein the data table feature vector comprises: a component representing the name of a data table entry in a data table to which the data belongs, and a component representing statistical characteristic information;
generating a first input vector of a data classification model sequentially comprising the characteristic vector and the zero vector of the data table;
and inputting the first input vector into a data classification model to obtain an output vector indicating the type of the data.
2. The method of claim 1, wherein the data classification model is a decision tree model.
3. The method of claim 1, wherein the statistical characteristic information comprises: association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of lengths of the data, a minimum value of lengths of the data, and types of characters in the data.
4. The method according to claim 2, wherein the data is text data, and the feature information is a keyword; and
converting the feature information into an input vector of a data classification model and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data comprises:
generating keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component;
generating a second input vector of the data classification model sequentially comprising the zero vector and the keyword feature vector;
and inputting the second input vector into a data classification model to obtain an output vector indicating the type of the data.
5. A data storage device, characterized in that the device comprises:
an obtaining unit configured to obtain feature information of data to be stored, the feature information including at least one of: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword;
an input unit configured to input an input vector for converting the feature information into a data classification model to a data classification model, resulting in an output vector indicating a type of the data, the data classification model being generated based on training in a supervised manner using a training sample in advance, the training sample including: the characteristic information of the stored data, the type of the stored data marked;
the storage unit is configured to store the data in a storage area corresponding to the type;
wherein, the data is data in a data table, and the characteristic information comprises: the name and the statistical characteristic information of the data table entry in the data table to which the data belongs, and the input unit include:
the data table feature vector generating subunit is configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data table entry in a data table to which the data belongs, and a component representing statistical characteristic information;
the first input vector generating subunit is configured to generate a first input vector of a data classification model sequentially including the feature vector and the zero vector of the data table;
and the output vector generation subunit is configured to input the first input vector to a data classification model to obtain an output vector indicating the type of the data.
6. The apparatus of claim 5, wherein the data classification model is a decision tree model.
7. The apparatus of claim 5, wherein the statistical characteristic information comprises: association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of lengths of the data, a minimum value of lengths of the data, and types of characters in the data.
8. The apparatus according to claim 6, wherein the data is text data, the feature information is a keyword, and the input unit includes:
the keyword feature vector generating subunit is configured to generate keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component;
a second input vector generation subunit configured to generate a second input vector of the data classification model sequentially including a zero vector and the keyword feature vector;
and the output vector generation subunit is configured to input the second input vector to a data classification model to obtain an output vector indicating the type of the data.
CN201710066733.9A 2017-02-07 2017-02-07 Data storage method and device Expired - Fee Related CN106649890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710066733.9A CN106649890B (en) 2017-02-07 2017-02-07 Data storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710066733.9A CN106649890B (en) 2017-02-07 2017-02-07 Data storage method and device

Publications (2)

Publication Number Publication Date
CN106649890A CN106649890A (en) 2017-05-10
CN106649890B true CN106649890B (en) 2020-07-14

Family

ID=58845975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710066733.9A Expired - Fee Related CN106649890B (en) 2017-02-07 2017-02-07 Data storage method and device

Country Status (1)

Country Link
CN (1) CN106649890B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578014B (en) * 2017-09-06 2020-11-03 上海寒武纪信息科技有限公司 Information processing apparatus and method
CN107679544A (en) * 2017-08-04 2018-02-09 平安科技(深圳)有限公司 Automatic data matching method, electronic equipment and computer-readable recording medium
CN109951509A (en) * 2017-12-21 2019-06-28 航天信息股份有限公司 A kind of cloud storage dispatching method, device, electronic equipment and storage medium
CN108427725B (en) * 2018-02-11 2021-08-03 华为技术有限公司 Data processing method, device and system
CN108763277B (en) * 2018-04-10 2023-04-18 平安科技(深圳)有限公司 Data analysis method, computer readable storage medium and terminal device
CN108563783B (en) * 2018-04-25 2022-04-12 张艳 Financial analysis management system and method based on big data
CN108763952B (en) * 2018-05-03 2022-04-05 创新先进技术有限公司 Data classification method and device and electronic equipment
CN109144999B (en) * 2018-08-02 2021-06-08 东软集团股份有限公司 Data positioning method, device, storage medium and program product
CN112732601A (en) * 2018-08-28 2021-04-30 中科寒武纪科技股份有限公司 Data preprocessing method and device, computer equipment and storage medium
CN109271356A (en) * 2018-09-03 2019-01-25 中国平安人寿保险股份有限公司 Log file formats processing method, device, computer equipment and storage medium
CN112988884B (en) * 2019-12-17 2024-04-12 中国移动通信集团陕西有限公司 Big data platform data storage method and device
CN111626057B (en) * 2020-07-28 2020-10-30 南京中孚信息技术有限公司 Official document judgment method and judgment system based on named entity
CN111881869B (en) * 2020-08-04 2023-04-18 浪潮云信息技术股份公司 Hierarchical storage method and system based on gesture data
CN112199694A (en) * 2020-09-30 2021-01-08 杭州云链趣链数字科技有限公司 Standardized bill processing method and device, electronic device and storage medium
CN113515680A (en) * 2021-04-20 2021-10-19 建信金融科技有限责任公司 Financial monitoring data processing method and device
CN116432238B (en) * 2023-06-05 2023-09-08 全中半导体(深圳)有限公司 Data storage method and device and storage chip

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866333A (en) * 2009-12-24 2010-10-20 金蝶软件(中国)有限公司 Worksheet self-defining method and adapter engine
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102073704A (en) * 2010-12-24 2011-05-25 华为终端有限公司 Text classification processing method, system and equipment
US8903182B1 (en) * 2012-03-08 2014-12-02 Google Inc. Image classification
CN104881424A (en) * 2015-03-13 2015-09-02 国家电网公司 Regular expression-based acquisition, storage and analysis method of power big data
CN106126502A (en) * 2016-07-07 2016-11-16 四川长虹电器股份有限公司 A kind of emotional semantic classification system and method based on support vector machine

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866333A (en) * 2009-12-24 2010-10-20 金蝶软件(中国)有限公司 Worksheet self-defining method and adapter engine
CN102073704A (en) * 2010-12-24 2011-05-25 华为终端有限公司 Text classification processing method, system and equipment
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US8903182B1 (en) * 2012-03-08 2014-12-02 Google Inc. Image classification
CN104881424A (en) * 2015-03-13 2015-09-02 国家电网公司 Regular expression-based acquisition, storage and analysis method of power big data
CN106126502A (en) * 2016-07-07 2016-11-16 四川长虹电器股份有限公司 A kind of emotional semantic classification system and method based on support vector machine

Also Published As

Publication number Publication date
CN106649890A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649890B (en) Data storage method and device
US11243990B2 (en) Dynamic document clustering and keyword extraction
US20190311025A1 (en) Methods and systems for modeling complex taxonomies with natural language understanding
CN107436875B (en) Text classification method and device
CN109492772B (en) Method and device for generating information
CN107797982B (en) Method, device and equipment for recognizing text type
CN107145485B (en) Method and apparatus for compressing topic models
US10606910B2 (en) Ranking search results using machine learning based models
CN106354856B (en) Artificial intelligence-based deep neural network enhanced search method and device
US11436446B2 (en) Image analysis enhanced related item decision
CN113434716B (en) Cross-modal information retrieval method and device
US11100252B1 (en) Machine learning systems and methods for predicting personal information using file metadata
CN110059172B (en) Method and device for recommending answers based on natural language understanding
CN111723180A (en) Interviewing method and device
US20210349920A1 (en) Method and apparatus for outputting information
CN105159898A (en) Searching method and searching device
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN113139558B (en) Method and device for determining multi-stage classification labels of articles
CN109902152B (en) Method and apparatus for retrieving information
US11328218B1 (en) Identifying subjective attributes by analysis of curation signals
CN114691850A (en) Method for generating question-answer pairs, training method and device of neural network model
Khan et al. Multimodal rule transfer into automatic knowledge based topic models
CN111274383B (en) Object classifying method and device applied to quotation
CN110110199B (en) Information output method and device
US20210295036A1 (en) Systematic language to enable natural language processing on technical diagrams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200714

Termination date: 20220207