CN106649890B - Data storage method and device - Google Patents
Data storage method and device Download PDFInfo
- Publication number
- CN106649890B CN106649890B CN201710066733.9A CN201710066733A CN106649890B CN 106649890 B CN106649890 B CN 106649890B CN 201710066733 A CN201710066733 A CN 201710066733A CN 106649890 B CN106649890 B CN 106649890B
- Authority
- CN
- China
- Prior art keywords
- data
- vector
- classification model
- type
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/12—Accounting
- G06Q40/125—Finance or payroll
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Abstract
The application discloses a data storage method and device. One embodiment of the above method comprises: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type. The method can save the storage space and simultaneously can quickly store data.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of internet technologies, and in particular, to a data storage method and apparatus.
Background
Data storage is the collection, storage, retrieval, processing, transformation and transmission of data. In the existing data storage, especially in the data storage process in the financial and tax fields, data features and data types corresponding to the data features are first manually defined and stored according to business needs, so as to facilitate subsequent financial accounting.
However, the existing data storage system applied to the financial and tax fields is not capable of analyzing and processing unstructured data, and secondly, due to the large difference between different financial accounting systems, data features and matching rules need to be defined for multiple times for storage according to different accounting systems, so that the complexity of data storage is increased, a large amount of storage space is occupied, and the utilization efficiency of data is reduced.
Disclosure of Invention
It is an object of the present application to provide an improved data storage method and apparatus to solve the technical problems mentioned in the background section above.
In a first aspect, the present application provides a data storage method, where the method includes: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type.
In some embodiments, the data classification model is a decision tree model.
In some optional implementation manners of this embodiment, the data is data in a data table, and the feature information includes: the name and the statistical characteristic information of the data table item in the data table to which the data belong; and converting the feature information into an input vector of a data classification model, and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data includes: generating a data table feature vector corresponding to the feature information, wherein the data table feature vector comprises: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; generating a first input vector of a data classification model sequentially comprising the characteristic vector and the zero vector of the data table; and inputting the first input vector into a data classification model to obtain an output vector indicating the type of the data.
In some embodiments, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some optional implementation manners of this embodiment, the data is text data, and the feature information is a keyword; and converting the feature information into an input vector of a data classification model, and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data includes: generating keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component; generating a second input vector of the data classification model sequentially comprising the zero vector and the keyword feature vector;
in some embodiments, the second input vector is input to a data classification model, resulting in an output vector indicative of the type of the data.
In a second aspect, the present application provides a data storage device, the device comprising: an obtaining unit configured to obtain feature information of data to be stored, the feature information including at least one of: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; an input unit configured to input an input vector for converting the feature information into a data classification model to a data classification model, and obtain an output vector indicating a type of the data, the data classification model being generated based on a supervised training using a training sample in advance, the training sample including: the characteristic information of the stored data and the marked type of the stored data; and the storage unit is configured to store the data in a storage area corresponding to the type.
In some embodiments, the data classification model is a decision tree model.
In some embodiments, the data is in a data table, and the characteristic information includes: the name and the statistical characteristic information of the data entry in the data table to which the data belongs, and the input unit include: a data table feature vector generation subunit configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; a first input vector generation subunit configured to generate a first input vector of a data classification model including the feature vector and the zero vector of the data table in this order; and the output vector generating subunit is configured to input the first input vector to a data classification model, so as to obtain an output vector indicating the type of the data.
In some embodiments, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some embodiments, the data is text data, the feature information is a keyword, and the input unit includes: the keyword feature vector generating subunit is configured to generate keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component; a second input vector generation subunit configured to generate a second input vector of the data classification model sequentially including a zero vector and the keyword feature vector; and the output vector generating subunit is configured to input the second input vector to a data classification model, so as to obtain an output vector indicating the type of the data.
According to the data storage method and the data storage device, the characteristic information of the data to be stored is obtained, then the characteristic information is converted into the input vector to be input into the data classification model with supervision training, and the data vector output from the data classification model is stored in the storage area corresponding to the data type, so that the data are effectively classified according to the data type, and the storage space of the data storage area is saved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a data storage method according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a data storage method according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a data storage device according to the present application;
FIG. 5 is a block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the data storage method or data storage apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as a web browser application, a data accounting type application, a financial reporting type application, a search type application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio L player iii, mpeg compressed standard Audio layer 3), MP4 players (Moving Picture Experts Group Audio L layer IV, mpeg compressed standard Audio layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background data processing server providing data support for applications running on the terminal devices 101, 102, 103, or may be a server collecting data from various data sources. The background data processing server can analyze and process the data acquired from the data source, and store and feed back the processing result to the terminal equipment.
It should be noted that the data storage method provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the data storage device is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram 200 of one embodiment of a data storage method according to the present application is shown. The data storage method comprises the following steps:
In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the data storage method operates may acquire data source information of data to be stored in a wired connection manner or a wireless connection manner, and acquire the data to be stored according to the data source information. Here, the data source refers to an original medium providing desired data or a database supported by a storage device. Data source information refers to information needed to establish a database connection. In obtaining data to be stored based on the data source information, the data to be stored may be obtained from a network, a database, or an application associated with a financial system.
When the data to be stored is acquired from the database, the electronic device may find the corresponding database connection relationship by providing the correct data source name to the server supporting the database, and further acquire the data to be stored from the corresponding data source.
When data to be stored is acquired from a financial system of an enterprise, the data source information may include financial internal information and external information, wherein the internal information may include various business processing data and various document data, and the external information may include various laws and regulations, market information, and the like.
In this embodiment, after the server acquires the data to be stored from the data source, the characteristic information of the data to be stored may be further acquired, where the characteristic information of the data to be stored includes at least one of the following: the name of the data table entry in the data table to which the data to be stored belongs, the statistical characteristic information indicating the statistical characteristic of the data, and the keyword. Here, the data table may be disposed in the database, and is used for storing the data to be stored. One data table may set a name, which may be, for example, a department name, a cost, an employee, and the like. The statistical characteristics can be the number of data, the length of the data and the like. When the data to be stored is text data, the feature information may be a keyword indicating the text content. For example, when the text data is "scientific research expenditure in department a", the keywords may be "department a" and "scientific research expenditure".
In some optional implementations of the embodiment, the statistical characteristic information includes association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and types of characters in the data.
As an example, the server first obtains data to be stored from a plurality of data sources. Then, the server may further obtain names of data entries in data tables to which the data to be stored belong in the database, for example, a name of a data entry in a data table to which one of the data to be stored belongs in the database is "department wage", and a name of a data entry in a data table to which another one of the data to be stored belongs in the database is "performance wage". The server may further obtain statistical characteristic information of the data to be stored, for example, the server may obtain an average value of data lengths of data of "department wage", and may also obtain a minimum value and a maximum value of data lengths of data of "performance wage".
In this embodiment, according to the feature information of the data to be stored acquired in step 201, the server may construct a multidimensional vector representing a plurality of features of the data to be stored as an input vector of the data classification model according to the feature information. The input vector includes a component representing the name of a data entry, a statistical feature component representing statistical features of the data, a feature component representing a keyword. The input vector is then input into a data classification model, resulting in an output vector indicative of the type of data to be stored. The output vector may include a type component of each preset data, a matching degree component between the data to be stored and the type of the data. The data to be stored and the type of the data corresponding to each other may use the matching degree to represent the strength of the corresponding relationship. In general, the higher the degree of match, the greater the probability that the data to be stored belongs to the type of the data.
The type of data may include a character string data type for representing names of various things such as a department name, a document name, may also include a data type for representing numbers such as an integer, a floating point, a positive number, a negative number, may also include a data type for representing a date and time, may also include a data type for representing money, and the like.
The data classification model may be used to describe a correspondence between data to be stored (e.g., data in a data table) and a type of the data (e.g., a type of data representing a number). The data classification model is formed by training the characteristic information of the stored data, the type of the labeled stored data matched with the characteristic information of the stored data and the matching degree between the characteristic information of the stored data and the type of the stored data as training samples by a machine learning method in a supervised learning mode.
The supervised learning mode can be carried out through the following steps:
firstly, the stored data is used as a training sample, and the server acquires the characteristic information of the stored data. For example, when the stored data is data in a database, since a plurality of data tables exist in the database, the server may obtain the name of a data entry of the stored data, may obtain the type of a character of the stored data, and the like; when the stored data is text data, the server may acquire a keyword of the stored data as the feature information.
Then, a type tag of data is set for the stored data, and the tag may be, for example, a data type indicating a number, a data type indicating a date, a data type indicating a text, or the like.
And thirdly, establishing a matching degree between the type of the data of the stored data and the characteristic information of the stored data based on the data type label of the stored data and the characteristic information of the stored data. Since one stored data sample has at least one characteristic information, and each stored data sample corresponds to one data type tag, the server can calculate the matching degree between the data type of the stored data and the characteristic information of the stored data according to a set algorithm.
And finally, performing data classification model training based on the characteristic information of the stored data, the type of the labeled stored data matched with the characteristic information of the stored data and the matching degree between the characteristic information of the stored data and the type of the stored data by using a machine learning method.
The machine learning method may include a neural network, a genetic algorithm, and the like.
This step is explained by taking the "department name" as an example of the data to be stored. The term "department name" is different in names in different application scenarios, and may be called "department" in some systems, or "department" in another system, or "department" in yet another system, but they are all "department names". Therefore, in a system, when the data to be stored is any one of the above, the feature information related to the above name acquired in step 201 may be converted into an input vector of a data classification model, and input into the data classification model for matching, so as to obtain an output vector indicating the type of the data to be stored, and the server may determine that the type of the data to be stored is the "department name" according to the output vector.
In this embodiment, according to the output vector of the data classification model obtained in step 202, the type to which the data belongs may be determined, so that the data is stored in the storage area corresponding to the type. In order to conveniently and effectively manage data in a unified manner in a server or a client, a storage area is usually set according to different data types, after the server determines the data type to be stored according to an output vector, the server can firstly search whether a preset storage area is provided with the data type, if so, the data to be stored can be directly stored in the storage area corresponding to the type, and if not, the server can reestablish a new storage area for storage.
According to the data storage method provided by the embodiment of the application, the characteristic information of the data with storage is obtained, then the characteristic information is converted into the input vector of the data classification model formed by pre-training and is input into the data classification model, the output vector indicating the type of the data is obtained, and finally the data is stored in the storage area corresponding to the data type indicated by the data classification model, so that the data to be stored in the mailbox is classified, the storage efficiency of the data is improved, and the storage space of the data is saved.
With further reference to FIG. 3, a flow 300 of yet another embodiment of a data storage method is illustrated. The process 300 of the data storage method includes the following steps:
The existing data can be divided into various types, and the data can be divided into structured data and unstructured data according to whether the data can be logically expressed by a two-dimensional table structure. Structured data, i.e., row data, can be represented by a uniform structure, such as numbers, symbols, and traditional data models; unstructured data refers to data in which the field length of the data is variable, and the record of each field can be composed of repeatable or non-repeatable sub-fields, and unstructured data includes video, audio, documents, text pictures, various reports, images, office documents, and the like. In a financial system, a large amount of data in a data table, namely structured data, exists, and characteristic information of the data can be represented by a data length value, the type of character strings in the data and the like; there is also a large amount of text data whose characteristic information can be represented by keywords.
In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the data storage method operates may acquire the feature information of the data to be stored through a wired connection manner or a wireless connection manner. When the data to be stored is data in a data table, the characteristic information includes at least one of the following items: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and the statistical characteristic information indicating the statistical characteristics of the data further include association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data. When the data to be stored is text data, the characteristic information includes a keyword.
In this embodiment, when the data to be stored is text data, a natural language processing method or a recurrent neural network model may be used to perform word segmentation and word segmentation on the text data, so as to determine the keywords in the text data.
According to the feature information of the data to be stored in the data table acquired in step 301, in this embodiment, the server may generate a data table feature vector from the feature information of the data to be stored, where the data table feature vector includes a component representing a name of a data entry in the data table to which the data belongs, and a component representing statistical feature information. As an example, in one system, the data "B" to be stored is "employee information", and the "employee information" such as "sex", "age", and the like may be stored in the data table of "basic information of employee", or may be stored by establishing a relationship with the data table of "department information" using a main foreign key relationship. The feature vector corresponding to the data "B" to be stored is a component indicating the name of an entry of a data table to which this data of "employee information" belongs, a component indicating an association with "department information", a component indicating an average length value of the data of employee information.
The input vector of the data classification model mainly comprises two parts, namely a data table feature vector and a keyword feature vector, when the data to be stored is data table data, namely structured data, the keyword feature vector can be represented in a zero vector form, and when the data to be stored is text data, namely unstructured data, the data table feature vector can be represented in the zero vector form.
In this embodiment, the server may further generate a first input vector of the data classification model according to the data to be stored determined in step 301 as the data in the data table and according to the feature vector of the data in the data table determined in step 302, where the first input vector includes the data table feature vector determined in step 302 and the zero vector in sequence.
And step 304, generating a keyword feature vector corresponding to the feature information. In this embodiment, when the data to be stored is text data, since the feature information of the text data is a keyword, in this step, a keyword feature vector may be generated from the keyword information corresponding to the text data, where each keyword in the keyword feature vector corresponds to one component. In this embodiment, the keyword feature vectors may be generated by using a vector space model, which is a conventional technology and is not described herein again. As an example, in some system, there are a large number of unstructured text data such as documents, contracts, and the like. When the data to be stored is the 'company contract C', the server generates a keyword component corresponding to the keyword 'company C' and a keyword component corresponding to the 'contract' respectively according to the acquired keywords such as the characteristic information 'company C' and 'contract' of the 'company contract C'.
In this embodiment, the server may further generate a second input vector of the data classification model according to the text data determined in step 301 as the data to be stored, and according to the keyword vector of the text data determined in step 305, where the input vector includes the zero vector and the keyword vector determined in step 305 in sequence
In this embodiment, according to the first input vector and the second input vector of the data classification model determined in steps 303 and 305, the server may input the first input vector and the second input vector into the data classification model respectively, and obtain an output vector indicating the type of the data. The output vector may include a type component of each preset data, a matching degree component between the data to be stored and the type of the data. Here, the data classification model may first determine whether the data to be stored is data in a data table or text data according to the input vector, and then the data classification model may process the two data separately, thereby generating output vectors according to the first input vector and the second input vector, respectively. For example, when the server inputs an input vector generated by data "X" to be stored into the data classification model, the data classification model may determine the data "X" to be stored as data in the data table based on the data table feature components of the input vector and the zero vector, and simultaneously determine the data type of the data as "data type related to numbers", so the data classification model outputs an output vector corresponding to the "data type related to numbers". For another example, when the server inputs an input vector generated by data "Y" to be stored into the data classification model, the data classification model may determine that the data "Y" to be stored is text data based on a zero vector and a keyword feature component of the input vector, and at the same time, determine that the data type of the data is "character type", so that the data classification model outputs an output component corresponding to the "character type".
In this embodiment, the data classification model is formed by training in a supervised manner based on training samples in advance, and optionally, the data classification model is a decision tree model, where it is to be noted that a machine learning method of the decision tree model is a well-known technology widely studied and applied at present, and is not described herein again.
In this embodiment, according to the output vector of the data classification model obtained in step 306, the type to which the data belongs may be determined, so that the data is stored in the storage area corresponding to the type.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the process 300 of the data storage method in this embodiment divides the data to be stored into structured data and unstructured data, that is, data and text data in the data table, and simultaneously inputs the two data distributions into the data classification model for matching, and the data classification model separates and processes the two data to obtain an output vector indicating the type of the data in the data table and an output vector indicating the type of the text data, so as to more quickly and effectively classify the data, increase the speed of data storage, and reduce the space for storing the data.
With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of a data storage device, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.
As shown in fig. 4, the data storage device 400 of the present embodiment includes: an acquisition unit 401, an input unit 402, and a storage unit 403. The obtaining unit 401 is configured to obtain feature information of data to be stored, where the feature information includes at least one of the following: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; the input unit 402 is configured to input an input vector, which is obtained by converting the feature information into a data classification model generated by performing supervised training using a training sample in advance, to a data classification model, and obtain an output vector indicating a type of the data, where the training sample includes: the characteristic information of the stored data and the marked type of the stored data; the storage unit 403 is configured to store the data in the storage area corresponding to the type.
In this embodiment, specific processing of the obtaining unit 401, the input unit 402, and the storage unit 403 of the data storage device 400 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional implementations of this embodiment, the data is data in a data table, the feature information includes a name of a data entry in the data table to which the data belongs and statistical feature information, and the input unit 402 includes: the data table feature vector generation subunit 4021 is configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data entry in a data table to which the data belongs, and a component representing statistical characteristic information; the first input vector generation subunit 4022 is configured to generate an input vector of a data classification model that sequentially includes the feature vector and the zero vector of the data table; the output vector generating subunit 4025 is configured to input the input vector to a data classification model, and obtain an output vector indicating a type of the data.
In some optional implementation manners of this embodiment, the statistical characteristic information includes: association information indicating an association relationship between the data entry, an average value of lengths of the data, a maximum value of the lengths of the data, a minimum value of the lengths of the data, and a type of a character in the data.
In some optional implementation manners of this embodiment, the data is text data, the feature information is a keyword, and the input unit 402 includes: the keyword feature vector generation subunit 4023 is configured to generate keyword feature vectors corresponding to the feature information, where each keyword in the keyword feature vectors corresponds to a component; the second input vector generation subunit 4024 is configured to generate a second input vector of the data classification model, which sequentially includes a zero vector and the keyword feature vector; the output vector determination subunit 4025 is configured to input the second input vector to the data classification model, and obtain an output vector indicating a type of the data.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
To the I/O interface 505, AN input section 506 including a keyboard, a mouse, and the like, AN output section 507 including a keyboard such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 508 including a hard disk and the like, and a communication section 509 including a network interface card such as a L AN card, a modem, and the like, the communication section 509 performs communication processing via a network such as the internet, a drive 510 is also connected to the I/O interface 505 as necessary, a removable medium 511 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an input unit, and a storage unit. Here, the names of the units do not constitute a limitation to the units themselves in some cases, and for example, the acquisition unit may also be described as a "unit that acquires characteristic information of data to be stored".
As another aspect, the present application also provides a non-volatile computer storage medium, which may be the non-volatile computer storage medium included in the apparatus in the above-described embodiments; or it may be a non-volatile computer storage medium that exists separately and is not incorporated into the terminal. The non-transitory computer storage medium stores one or more programs that, when executed by a device, cause the device to: acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table item in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword; converting the feature information into an input vector of a data classification model, and inputting the input vector into a data classification model to obtain an output vector indicating a type of the data, wherein the data classification model is generated by training in a supervised manner by using a training sample in advance, and the training sample comprises: the characteristic information of the stored data and the marked type of the stored data; and storing the data in a storage area corresponding to the type.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Claims (8)
1. A method of data storage, the method comprising:
acquiring characteristic information of data to be stored, wherein the characteristic information comprises at least one of the following items: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword;
converting the feature information into an input vector of a data classification model, inputting the input vector into a data classification model, and obtaining an output vector indicating the type of the data, wherein the data classification model is generated based on training in a supervised manner by utilizing a training sample in advance, and the training sample comprises: the characteristic information of the stored data, the type of the stored data marked;
storing the data in a storage area corresponding to the type;
wherein, the data is data in a data table, and the characteristic information comprises: the name and the statistical characteristic information of a data table item in a data table to which the data belongs; and
converting the feature information into an input vector of a data classification model and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data comprises:
generating a data table feature vector corresponding to the feature information, wherein the data table feature vector comprises: a component representing the name of a data table entry in a data table to which the data belongs, and a component representing statistical characteristic information;
generating a first input vector of a data classification model sequentially comprising the characteristic vector and the zero vector of the data table;
and inputting the first input vector into a data classification model to obtain an output vector indicating the type of the data.
2. The method of claim 1, wherein the data classification model is a decision tree model.
3. The method of claim 1, wherein the statistical characteristic information comprises: association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of lengths of the data, a minimum value of lengths of the data, and types of characters in the data.
4. The method according to claim 2, wherein the data is text data, and the feature information is a keyword; and
converting the feature information into an input vector of a data classification model and inputting the input vector into the data classification model, and obtaining an output vector indicating the type of the data comprises:
generating keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component;
generating a second input vector of the data classification model sequentially comprising the zero vector and the keyword feature vector;
and inputting the second input vector into a data classification model to obtain an output vector indicating the type of the data.
5. A data storage device, characterized in that the device comprises:
an obtaining unit configured to obtain feature information of data to be stored, the feature information including at least one of: the name of a data table entry in a data table to which the data belongs, statistical characteristic information indicating statistical characteristics of the data, and a keyword;
an input unit configured to input an input vector for converting the feature information into a data classification model to a data classification model, resulting in an output vector indicating a type of the data, the data classification model being generated based on training in a supervised manner using a training sample in advance, the training sample including: the characteristic information of the stored data, the type of the stored data marked;
the storage unit is configured to store the data in a storage area corresponding to the type;
wherein, the data is data in a data table, and the characteristic information comprises: the name and the statistical characteristic information of the data table entry in the data table to which the data belongs, and the input unit include:
the data table feature vector generating subunit is configured to generate a data table feature vector corresponding to the feature information, where the data table feature vector includes: a component representing the name of a data table entry in a data table to which the data belongs, and a component representing statistical characteristic information;
the first input vector generating subunit is configured to generate a first input vector of a data classification model sequentially including the feature vector and the zero vector of the data table;
and the output vector generation subunit is configured to input the first input vector to a data classification model to obtain an output vector indicating the type of the data.
6. The apparatus of claim 5, wherein the data classification model is a decision tree model.
7. The apparatus of claim 5, wherein the statistical characteristic information comprises: association information indicating an association relationship between the data table entries, an average value of lengths of the data, a maximum value of lengths of the data, a minimum value of lengths of the data, and types of characters in the data.
8. The apparatus according to claim 6, wherein the data is text data, the feature information is a keyword, and the input unit includes:
the keyword feature vector generating subunit is configured to generate keyword feature vectors corresponding to the feature information, wherein each keyword in the keyword feature vectors corresponds to one component;
a second input vector generation subunit configured to generate a second input vector of the data classification model sequentially including a zero vector and the keyword feature vector;
and the output vector generation subunit is configured to input the second input vector to a data classification model to obtain an output vector indicating the type of the data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710066733.9A CN106649890B (en) | 2017-02-07 | 2017-02-07 | Data storage method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710066733.9A CN106649890B (en) | 2017-02-07 | 2017-02-07 | Data storage method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649890A CN106649890A (en) | 2017-05-10 |
CN106649890B true CN106649890B (en) | 2020-07-14 |
Family
ID=58845975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710066733.9A Expired - Fee Related CN106649890B (en) | 2017-02-07 | 2017-02-07 | Data storage method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649890B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578014B (en) * | 2017-09-06 | 2020-11-03 | 上海寒武纪信息科技有限公司 | Information processing apparatus and method |
CN107679544A (en) * | 2017-08-04 | 2018-02-09 | 平安科技(深圳)有限公司 | Automatic data matching method, electronic equipment and computer-readable recording medium |
CN109951509A (en) * | 2017-12-21 | 2019-06-28 | 航天信息股份有限公司 | A kind of cloud storage dispatching method, device, electronic equipment and storage medium |
CN108427725B (en) * | 2018-02-11 | 2021-08-03 | 华为技术有限公司 | Data processing method, device and system |
CN108763277B (en) * | 2018-04-10 | 2023-04-18 | 平安科技(深圳)有限公司 | Data analysis method, computer readable storage medium and terminal device |
CN108563783B (en) * | 2018-04-25 | 2022-04-12 | 张艳 | Financial analysis management system and method based on big data |
CN108763952B (en) * | 2018-05-03 | 2022-04-05 | 创新先进技术有限公司 | Data classification method and device and electronic equipment |
CN109144999B (en) * | 2018-08-02 | 2021-06-08 | 东软集团股份有限公司 | Data positioning method, device, storage medium and program product |
CN112732601A (en) * | 2018-08-28 | 2021-04-30 | 中科寒武纪科技股份有限公司 | Data preprocessing method and device, computer equipment and storage medium |
CN109271356A (en) * | 2018-09-03 | 2019-01-25 | 中国平安人寿保险股份有限公司 | Log file formats processing method, device, computer equipment and storage medium |
CN112988884B (en) * | 2019-12-17 | 2024-04-12 | 中国移动通信集团陕西有限公司 | Big data platform data storage method and device |
CN111626057B (en) * | 2020-07-28 | 2020-10-30 | 南京中孚信息技术有限公司 | Official document judgment method and judgment system based on named entity |
CN111881869B (en) * | 2020-08-04 | 2023-04-18 | 浪潮云信息技术股份公司 | Hierarchical storage method and system based on gesture data |
CN112199694A (en) * | 2020-09-30 | 2021-01-08 | 杭州云链趣链数字科技有限公司 | Standardized bill processing method and device, electronic device and storage medium |
CN113515680A (en) * | 2021-04-20 | 2021-10-19 | 建信金融科技有限责任公司 | Financial monitoring data processing method and device |
CN116432238B (en) * | 2023-06-05 | 2023-09-08 | 全中半导体(深圳)有限公司 | Data storage method and device and storage chip |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866333A (en) * | 2009-12-24 | 2010-10-20 | 金蝶软件(中国)有限公司 | Worksheet self-defining method and adapter engine |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN102073704A (en) * | 2010-12-24 | 2011-05-25 | 华为终端有限公司 | Text classification processing method, system and equipment |
US8903182B1 (en) * | 2012-03-08 | 2014-12-02 | Google Inc. | Image classification |
CN104881424A (en) * | 2015-03-13 | 2015-09-02 | 国家电网公司 | Regular expression-based acquisition, storage and analysis method of power big data |
CN106126502A (en) * | 2016-07-07 | 2016-11-16 | 四川长虹电器股份有限公司 | A kind of emotional semantic classification system and method based on support vector machine |
-
2017
- 2017-02-07 CN CN201710066733.9A patent/CN106649890B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866333A (en) * | 2009-12-24 | 2010-10-20 | 金蝶软件(中国)有限公司 | Worksheet self-defining method and adapter engine |
CN102073704A (en) * | 2010-12-24 | 2011-05-25 | 华为终端有限公司 | Text classification processing method, system and equipment |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US8903182B1 (en) * | 2012-03-08 | 2014-12-02 | Google Inc. | Image classification |
CN104881424A (en) * | 2015-03-13 | 2015-09-02 | 国家电网公司 | Regular expression-based acquisition, storage and analysis method of power big data |
CN106126502A (en) * | 2016-07-07 | 2016-11-16 | 四川长虹电器股份有限公司 | A kind of emotional semantic classification system and method based on support vector machine |
Also Published As
Publication number | Publication date |
---|---|
CN106649890A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649890B (en) | Data storage method and device | |
US11243990B2 (en) | Dynamic document clustering and keyword extraction | |
US20190311025A1 (en) | Methods and systems for modeling complex taxonomies with natural language understanding | |
CN107436875B (en) | Text classification method and device | |
CN109492772B (en) | Method and device for generating information | |
CN107797982B (en) | Method, device and equipment for recognizing text type | |
CN107145485B (en) | Method and apparatus for compressing topic models | |
US10606910B2 (en) | Ranking search results using machine learning based models | |
CN106354856B (en) | Artificial intelligence-based deep neural network enhanced search method and device | |
US11436446B2 (en) | Image analysis enhanced related item decision | |
CN113434716B (en) | Cross-modal information retrieval method and device | |
US11100252B1 (en) | Machine learning systems and methods for predicting personal information using file metadata | |
CN110059172B (en) | Method and device for recommending answers based on natural language understanding | |
CN111723180A (en) | Interviewing method and device | |
US20210349920A1 (en) | Method and apparatus for outputting information | |
CN105159898A (en) | Searching method and searching device | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
CN113139558B (en) | Method and device for determining multi-stage classification labels of articles | |
CN109902152B (en) | Method and apparatus for retrieving information | |
US11328218B1 (en) | Identifying subjective attributes by analysis of curation signals | |
CN114691850A (en) | Method for generating question-answer pairs, training method and device of neural network model | |
Khan et al. | Multimodal rule transfer into automatic knowledge based topic models | |
CN111274383B (en) | Object classifying method and device applied to quotation | |
CN110110199B (en) | Information output method and device | |
US20210295036A1 (en) | Systematic language to enable natural language processing on technical diagrams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200714 Termination date: 20220207 |