CN117971819B - Management method and system for automatically collecting flow data - Google Patents
Management method and system for automatically collecting flow data Download PDFInfo
- Publication number
- CN117971819B CN117971819B CN202410371733.XA CN202410371733A CN117971819B CN 117971819 B CN117971819 B CN 117971819B CN 202410371733 A CN202410371733 A CN 202410371733A CN 117971819 B CN117971819 B CN 117971819B
- Authority
- CN
- China
- Prior art keywords
- data
- mapping value
- mapping
- flow data
- pipeline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种自动汇集流水数据的管理方法,包括:获取多份流水数据;识别每一条流水数据的类型名称与类型信息,并基于类型信息,生成关联的映射值;基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中,若否,将流水数据存储至缓存中;若是,基于流水数据及其关联的流水数据集合,以判断识别的准确性,并综合存入关联性数据库中。本发明采取了一种新的自动汇集流水数据的方法,不同于传统方式中只采用正则表达式制式的对流水数据进行识别,或者采用自然语言处理进行识别。本方法具有极高的识别准确率。
A management method for automatically collecting running data includes: obtaining multiple pieces of running data; identifying the type name and type information of each piece of running data, and generating an associated mapping value based on the type information; judging whether its associated running data sets are all in the cache based on the mapping value of the running data, and if not, storing the running data in the cache; if so, judging the accuracy of the recognition based on the running data and its associated running data sets, and storing them in a relational database. The present invention adopts a new method for automatically collecting running data, which is different from the traditional method of only using regular expressions to identify running data, or using natural language processing for identification. This method has a very high recognition accuracy rate.
Description
技术领域Technical Field
本发明属于数据分析技术领域,更具体的,涉及一种自动汇集流水数据的管理方法与系统。The present invention belongs to the technical field of data analysis, and more specifically, relates to a management method and system for automatically collecting flow data.
背景技术Background technique
在现代经济活动中,流水数据,如银行交易记录、电子支付信息和通信话单,扮演着至关重要的角色。这些数据反映了个人和企业的财务状况、消费习惯和沟通模式。然而,由于这些信息通常以非结构化文本形式出现,使得数据的整合、分析和管理变得极其复杂和耗时。当前的背景下,企业和个人需要从各种源收集和整合流水数据,以支持决策制定、财务管理和客户服务等关键业务活动。In modern economic activities, flow data, such as bank transaction records, electronic payment information, and communication bills, play a vital role. These data reflect the financial status, consumption habits, and communication patterns of individuals and businesses. However, since this information usually appears in the form of unstructured text, the integration, analysis, and management of data becomes extremely complex and time-consuming. In the current context, businesses and individuals need to collect and integrate flow data from various sources to support key business activities such as decision making, financial management, and customer service.
目前,自动汇集流水数据的技术主要依赖于正则表达式、自然语言处理(NLP)、数据挖掘和机器学习等方法。通过实现高级的实体识别和模式匹配算法,系统能够识别文本中的关键信息,如日期、金额和参与方等。此外,文本分类和情感分析技术被用于进一步分析数据,从而为用户提供洞察力。这些技术的整合形成了能够自动处理和汇总流水数据的系统,为用户提供了一个更加清晰、有序的财务画面。Currently, the technology for automatically aggregating transaction data mainly relies on methods such as regular expressions, natural language processing (NLP), data mining, and machine learning. By implementing advanced entity recognition and pattern matching algorithms, the system is able to identify key information in the text, such as dates, amounts, and parties involved. In addition, text classification and sentiment analysis techniques are used to further analyze the data to provide insights to users. The integration of these technologies has formed a system that can automatically process and aggregate transaction data, providing users with a clearer and more organized financial picture.
尽管有显著进步,当前的自动汇集流水数据技术仍然存在一些不足之处。首先,制式的正则表达式对复杂和非标准化文本的处理仍然是一个挑战。其次,自然语言处理对于文本中的隐含意义和上下文信息的理解还不够深入,这限制了信息提取的准确性和系统的应用范围。Despite significant progress, current technologies for automatically aggregating streaming data still have some shortcomings. First, the processing of complex and non-standardized texts with standard regular expressions remains a challenge. Second, natural language processing does not yet have a deep enough understanding of the implicit meaning and contextual information in the text, which limits the accuracy of information extraction and the scope of application of the system.
发明内容Summary of the invention
为解决现有技术中存在的不足,本发明的目的在于解决上述缺陷,进而提出一种自动汇集流水数据的管理方法与系统。In order to solve the deficiencies in the prior art, the purpose of the present invention is to solve the above-mentioned defects and further propose a management method and system for automatically collecting flow data.
本发明采用如下的技术方案。The present invention adopts the following technical solution.
本发明第一方面公开了一种自动汇集流水数据的管理方法,包括步骤1~步骤3;The first aspect of the present invention discloses a management method for automatically collecting flow data, comprising steps 1 to 3;
步骤1,获取多份流水数据;Step 1, obtain multiple copies of flow data;
步骤2,识别每一条流水数据的类型名称与类型信息,并基于类型信息,生成关联的映射值;Step 2: Identify the type name and type information of each piece of pipeline data, and generate an associated mapping value based on the type information;
步骤3,基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中,若否,将流水数据存储至缓存中;若是,基于流水数据及其关联的流水数据集合,以判断识别的准确性,并综合存入关联性数据库中。Step 3, based on the mapping value of the flow data, determine whether its associated flow data sets are all in the cache. If not, store the flow data in the cache; if so, determine the accuracy of the recognition based on the flow data and its associated flow data sets, and store them in the associative database.
进一步的,在获取多份流水数据前还包括对多份流水数据进行预处理,包括:数据清洗与标准化。Furthermore, before obtaining the multiple pieces of flow data, the multiple pieces of flow data are pre-processed, including data cleaning and standardization.
进一步的,基于类型信息,生成关联的映射值具体为:基于类型信息中的交易对方信息或交易平台信息,生成关联的映射值。Furthermore, generating the associated mapping value based on the type information specifically includes: generating the associated mapping value based on the transaction counterparty information or the transaction platform information in the type information.
进一步的,步骤2中通过正则表达式的方式实现对流水数据的识别分割。Furthermore, in step 2, regular expressions are used to identify and segment the pipeline data.
进一步的,基于类型信息,生成关联的映射值具体为:基于类型信息中的交易时间,生成第一映射值,并基于类型信息中的交易对方信息或交易平台信息中的多个稀缺关键字,生成多个第二映射值;其中,稀缺关键字指的是一个汉字或一个单词。Furthermore, based on the type information, the associated mapping values are generated specifically as follows: based on the transaction time in the type information, a first mapping value is generated, and based on multiple scarce keywords in the transaction counterparty information or the transaction platform information in the type information, multiple second mapping values are generated; wherein a scarce keyword refers to a Chinese character or a word.
进一步的,步骤3中,基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中具体包括步骤3.1~步骤3.3;Further, in step 3, based on the mapping value of the pipeline data, determining whether the associated pipeline data sets are all in the cache specifically includes steps 3.1 to 3.3;
步骤3.1,基于第一映射值,获取第一后继映射值;Step 3.1, based on the first mapping value, obtaining a first subsequent mapping value;
步骤3.2,基于所述流水数据的第一映射值,获取第一映射值与第一后继映射值中的所有流水数据作为待比较流水数据集;Step 3.2, based on the first mapping value of the pipeline data, obtaining all pipeline data in the first mapping value and the first subsequent mapping value as the pipeline data set to be compared;
步骤3.3,将所述流水数据的第二映射值依次与待比较流水数据集中每一个流水数据的第二映射值进行比较,若二者相似度大于等于预设的相似度阈值,则判定所述每一个流水数据为关联的流水数据。Step 3.3, compare the second mapping value of the pipeline data with the second mapping value of each pipeline data in the pipeline data set to be compared in turn, if the similarity between the two is greater than or equal to a preset similarity threshold, then determine that each pipeline data is associated pipeline data.
进一步的,选取稀缺关键字的过程可以具体包括步骤S101~步骤S103;Furthermore, the process of selecting scarce keywords may specifically include steps S101 to S103;
步骤S101,获取所述流水数据中类型信息中的交易对方信息或交易平台信息中的的每一个关键字;Step S101, obtaining each keyword in the transaction counterparty information or the transaction platform information in the type information in the transaction data;
步骤S102,根据哈夫曼编码,计算出每一个关键字对应的映射值;Step S102, calculating the mapping value corresponding to each keyword according to Huffman coding;
步骤S103,从高到低对映射值进行排序,选取映射值最大的n个,作为多个第二映射值,其中,n为第二映射值的数量。Step S103, sorting the mapping values from high to low, selecting n mapping values with the largest mapping values as multiple second mapping values, where n is the number of second mapping values.
本发明第二方面公开了一种自动汇集流水数据的管理系统,应用于第一方面所述的方法上,包括:数据获取模块、逻辑判断模块、第一数据存储模块与第二数据存储模块;The second aspect of the present invention discloses a management system for automatically collecting flow data, which is applied to the method described in the first aspect, and comprises: a data acquisition module, a logic judgment module, a first data storage module and a second data storage module;
数据获取模块用于获取多份流水数据;The data acquisition module is used to obtain multiple copies of flow data;
逻辑判断模块用于识别每一条流水数据的类型名称与类型信息,并基于类型信息,生成关联的映射值;以及基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中;以及基于流水数据及其关联的流水数据集合,以判断识别的准确性;The logic judgment module is used to identify the type name and type information of each piece of pipeline data, and generate an associated mapping value based on the type information; and based on the mapping value of the pipeline data, determine whether its associated pipeline data set is in the cache; and based on the pipeline data and its associated pipeline data set, determine the accuracy of the recognition;
第一数据存储模块用于将流水数据存储至缓存中;The first data storage module is used to store the running data in the cache;
第二数据存储模块用于将综合存入关联性数据库中。The second data storage module is used to store the synthesis into a relational database.
本发明第三方面公开了一种终端,包括处理器及存储介质;其特征在于:The third aspect of the present invention discloses a terminal, including a processor and a storage medium; the characteristics are:
所述存储介质用于存储指令;The storage medium is used to store instructions;
所述处理器用于根据所述指令进行操作以执行第一方面所述方法的步骤。The processor is used to operate according to the instructions to execute the steps of the method described in the first aspect.
本发明第四方面公开了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现第一方面所述方法的步骤。The fourth aspect of the present invention discloses a computer-readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps of the method described in the first aspect are implemented.
本发明的有益效果在于,与现有技术相比,本发明具有以下优点:The beneficial effects of the present invention are that, compared with the prior art, the present invention has the following advantages:
本发明采取了一种新的自动汇集流水数据的方法,不同于传统方式中只采用正则表达式制式的对流水数据进行识别,或者采用自然语言处理进行识别。本方法具有极高的识别准确率。The present invention adopts a new method for automatically collecting running data, which is different from the traditional method of only using regular expression format to identify running data, or using natural language processing to identify. The present method has a very high recognition accuracy rate.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明实施例的一种自动汇集流水数据的管理方法的流程图。FIG. 1 is a flow chart of a management method for automatically collecting flow data according to an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图对本申请作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本申请的保护范围。The present application is further described below in conjunction with the accompanying drawings. The following embodiments are only used to more clearly illustrate the technical solution of the present invention, and cannot be used to limit the protection scope of the present application.
本发明公开了一种自动汇集流水数据的管理方法,如图1所示,可以包括步骤1~步骤3。The present invention discloses a management method for automatically collecting flow data, as shown in FIG1 , which may include steps 1 to 3.
步骤1,获取多份流水数据。Step 1: Obtain multiple copies of transaction data.
其中,流水数据可以是银行账号、电子支付平台(例如:支付宝、微信等)、或其他通信服务商的交易数据。其中,流水数据为包含多种格式和结构的文本数据,通常,The transaction data can be bank account, electronic payment platform (such as Alipay, WeChat, etc.), or transaction data of other communication service providers. The transaction data is text data with various formats and structures.
在一些实施例中,在获取多份流水数据前还包括对多份流水数据进行预处理,包括:数据清洗与标准化。其中,数据清洗用于去除无关信息,减少流水数据的数量。标准化用于将所有的流水数据统一格式化,同时识别出其中的关键信息,例如:交易日期、金额、交易方、交易描述等等,以便于进行后期的存储。In some embodiments, before obtaining multiple pieces of transaction data, the multiple pieces of transaction data are pre-processed, including data cleaning and standardization. Data cleaning is used to remove irrelevant information and reduce the amount of transaction data. Standardization is used to uniformly format all transaction data and identify key information therein, such as transaction date, amount, transaction party, transaction description, etc., for later storage.
步骤2,识别每一条流水数据的类型名称与类型信息,并基于类型信息,生成关联的映射值。Step 2: Identify the type name and type information of each piece of flow data, and generate an associated mapping value based on the type information.
其中,实体信息可以包括交易时间、交易金额、交易类型、交易本方信息、交易对方信息、交易平台信息、订单号等。Among them, entity information may include transaction time, transaction amount, transaction type, transaction party information, transaction party information, transaction platform information, order number, etc.
在本发明的第一实施例中,可以基于类型信息中的交易对方信息或交易平台信息,生成关联的映射值。假设一条流水数据X可以包含如下信息:交易本方信息:“x0”,交易对方信息:“x1”,则其关联的映射值仅为“x1”所对应的映射值;假设一条流水数据Y可以包含如下信息:交易本方信息:“y0”,交易对方信息:“y1”,交易平台信息:“y2”,则其关联的映射值为“y1”与“y2”所分别对应的2个映射值。In the first embodiment of the present invention, the associated mapping value can be generated based on the transaction counterparty information or transaction platform information in the type information. Assuming that a piece of transaction data X can contain the following information: transaction party information: "x0", transaction counterparty information: "x1", then its associated mapping value is only the mapping value corresponding to "x1"; assuming that a piece of transaction data Y can contain the following information: transaction party information: "y0", transaction counterparty information: "y1", transaction platform information: "y2", then its associated mapping value is the 2 mapping values corresponding to "y1" and "y2" respectively.
在第一实施例中,所述映射值的生成方式可以通过hash函数的方式生成,其具体公式不再赘述。In the first embodiment, the mapping value may be generated by a hash function, and its specific formula will not be described in detail.
步骤3,基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中,若否,将流水数据存储至缓存中;若是,基于流水数据及其关联的流水数据集合,以判断识别的准确性,并综合存入关联性数据库中。Step 3, based on the mapping value of the flow data, determine whether its associated flow data sets are all in the cache. If not, store the flow data in the cache; if so, determine the accuracy of the recognition based on the flow data and its associated flow data sets, and store them in the associative database.
在一些实施例中,所述缓存可以选择redis数据库。In some embodiments, the cache may select a redis database.
可以理解的是,假设所述流水数据为流水数据X,则其关联的流水数据集合中流水数据的数量至少为1;假设所述流水数据为流水数据Y,则其关联的流水数据集合中流水数据的数量至少为2。步骤3中,综合存入关联性数据库中指的是不仅将所述流水数据存入关联性数据库中,还将其关联的流水数据集合从缓存中转移至关联性数据库中。It can be understood that, assuming that the running data is running data X, the number of running data in its associated running data set is at least 1; assuming that the running data is running data Y, the number of running data in its associated running data set is at least 2. In step 3, the comprehensive storage in the associated database means not only storing the running data in the associated database, but also transferring its associated running data set from the cache to the associated database.
在步骤3中,以流水数据Y为例,其对应的2个映射值分别关联的至少2个流水数据S、T与流水数据Y应当为同一个流水数据,仅仅是交易双方信息与交易平台信息发生了改变。In step 3, taking transaction data Y as an example, at least two transaction data S and T respectively associated with its two corresponding mapping values should be the same transaction data as transaction data Y, and only the information of the transaction parties and the transaction platform information has changed.
在现代数据处理实践中,自动化地识别和汇集流水数据是一项重要且挑战性的任务。制式的机械识别每一条流水数据的实体信息,例如通过预设的正则表达式快速识别流水数据中的实体信息是一种常见方法,但这种方法存在其固有局限性。特别是,流水数据中包含的实体类型的顺序以及类型可能不尽相同,影响数据处理的一致性和准确性。此外,数据获取过程中的问题,如页码转换错误,可能导致流水记录被错误地合并或分割,或者与非相关信息混合,进一步降低了自动化处理的准确度。即使自动汇集流水数据的成功率高达99%,事后的查缺补漏工作仍可能导致高昂的成本,从而抵消自动化带来的效率优势。In modern data processing practice, automatically identifying and aggregating flow data is an important and challenging task. Standardized mechanical identification of the entity information of each flow data, such as quickly identifying the entity information in the flow data through a preset regular expression, is a common method, but this method has its inherent limitations. In particular, the order and type of entity types contained in the flow data may not be the same, affecting the consistency and accuracy of data processing. In addition, problems in the data acquisition process, such as page number conversion errors, may cause flow records to be incorrectly merged or split, or mixed with non-related information, further reducing the accuracy of automated processing. Even if the success rate of automatically aggregating flow data is as high as 99%, the subsequent checking and filling of gaps may still result in high costs, thereby offsetting the efficiency advantages brought by automation.
在另一些实施例中,引入基于自然语言处理(NLP)的技术,可以提高对文本中实体的识别率,然而,其识别准确率甚至不如正则表达式高,同时自然语言处理也无法解决数据界定问题。也就是说,现有技术难以精确界定哪些内容确切属于特定的流水记录,或者根本不属于任何流水数据。In other embodiments, the introduction of natural language processing (NLP)-based technology can improve the recognition rate of entities in texts, however, its recognition accuracy is not even as high as regular expressions, and natural language processing cannot solve the problem of data definition. In other words, it is difficult for existing technologies to accurately define which content belongs to a specific log record, or does not belong to any log data at all.
在本发明的实施例中,由于步骤3通过交叉验证(即基于流水数据,及其关联的流水数据集合)的方式以判断识别的准确性,采用交叉验证的方式,能够轻易的解决识别准确率以及数据界定的问题。因此,步骤2中完全可以通过正则表达式的方式实现快速的对流水数据进行识别分割。然而,上述步骤中依然存在缺陷,首先最为重要的是,本发明映射值的目的本身就是为了能够准确的分类识别流水数据,而映射值本身又由流水数据中的类型信息(例如:交易对方信息或交易平台信息)唯一确定。因此,一旦通过正则表达式分类识别有误,则映射值就错了,这本质上就形成了一个先有鸡还是先有蛋的问题。其次,由于流水数据通常是分批(通常其交易本方信息是一样的)进行处理,不难理解的是,第一个批次的流水数据必然要全部存入缓存中,而通常缓存即为内存,空间上可能无法适应庞大的数据量。In the embodiment of the present invention, since step 3 determines the accuracy of recognition by cross-validation (i.e., based on the running data and its associated running data set), the cross-validation method can easily solve the problems of recognition accuracy and data definition. Therefore, in step 2, it is entirely possible to quickly identify and segment the running data by regular expressions. However, there are still defects in the above steps. First and foremost, the purpose of the mapping value of the present invention is to accurately classify and identify the running data, and the mapping value itself is uniquely determined by the type information in the running data (for example, transaction counterparty information or transaction platform information). Therefore, once the classification and recognition by regular expression is wrong, the mapping value is wrong, which essentially forms a chicken or egg problem. Secondly, since the running data is usually processed in batches (usually the transaction party information is the same), it is not difficult to understand that the first batch of running data must be stored in the cache, and the cache is usually the memory, which may not be able to adapt to the huge amount of data in terms of space.
基于此,在本发明的第二实施例中,基于类型信息,生成关联的映射值具体为:基于类型信息中的交易时间,生成第一映射值,并基于类型信息中的交易对方信息或交易平台信息中的多个稀缺关键字,生成多个第二映射值;其中,稀缺关键字(以及下文中的关键字)指的是一个汉字或一个单词。Based on this, in the second embodiment of the present invention, based on the type information, the associated mapping values are generated specifically as follows: based on the transaction time in the type information, a first mapping value is generated, and based on multiple scarce keywords in the transaction counterparty information or the transaction platform information in the type information, multiple second mapping values are generated; wherein the scarce keyword (and the keyword hereinafter) refers to a Chinese character or a word.
采用交易时间来生成关联的映射值的优势在于,交易时间本身的格式是确定的,且通常流水数据的顺序按照交易时间进行排序,可以关联上下两条流水数据进行交叉验证从而获得准确的交易时间。The advantage of using transaction time to generate associated mapping values is that the format of the transaction time itself is fixed, and the order of the transaction data is usually sorted according to the transaction time. The upper and lower transaction data can be associated for cross-validation to obtain the accurate transaction time.
更为具体的,考虑到所述流水数据与其关联的流水数据可能存在时间差,因此,第一映射值应当涵盖的是整个映射区间(即下文中的[/>])的流水数据。因此,在第二实施例中,所述第一映射值/>可以为交易时间相关联的时间戳,如下式所示:More specifically, considering that the running data may have a time difference with its associated running data Therefore, the first mapping value should cover the entire mapping interval (that is, [/> ]). Therefore, in the second embodiment, the first mapping value/> A timestamp can be associated with the transaction time as shown below:
其中,为余运算符号,/>为交易时间的时间戳,/>为大于/>的整数。in, is the remainder operator, /> The timestamp of the transaction time, /> is greater than/> An integer.
可理解的,第二实施例中本质上利用了桶排序的思想,即:先根据交易时间生成第一映射值(本质上是一个映射区间,每一个映射区间看成一个木桶),大致确定所述流水数据在哪一个木桶中;然后再利用稀缺关键字,例如通过散列表来确定流水数据的具体位置。It can be understood that the second embodiment essentially utilizes the idea of bucket sorting, namely: first generate a first mapping value (essentially a mapping interval, each mapping interval is regarded as a barrel) according to the transaction time to roughly determine in which barrel the transaction data is; and then utilize scarce keywords, such as through a hash table, to determine the specific location of the transaction data.
因此,步骤3中,基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中具体包括步骤3.1~步骤3.3。Therefore, in step 3, based on the mapping value of the pipeline data, determining whether its associated pipeline data sets are all in the cache specifically includes steps 3.1 to 3.3.
步骤3.1,基于第一映射值,获取第一后继映射值。Step 3.1, based on the first mapping value, obtain a first subsequent mapping value.
第一后继映射值指的是第一映射值的下一个映射值。可理解的,第一后继映射值。The first successor mapping value refers to the next mapping value of the first mapping value. .
步骤3.2,基于所述流水数据的第一映射值,获取第一映射值与第一后继映射值中的所有流水数据作为待比较流水数据集。Step 3.2: Based on the first mapping value of the pipeline data, obtain all pipeline data in the first mapping value and the first subsequent mapping value as the pipeline data set to be compared.
步骤3.3,将所述流水数据的第二映射值依次与待比较流水数据集中每一个流水数据的第二映射值进行比较,若二者相似度大于等于预设的相似度阈值,则判定所述每一个流水数据为关联的流水数据。Step 3.3, compare the second mapping value of the pipeline data with the second mapping value of each pipeline data in the pipeline data set to be compared in turn, if the similarity between the two is greater than or equal to a preset similarity threshold, then determine that each pipeline data is associated pipeline data.
可理解的,在缓存中,关联的流水数据的多个第二映射值应当也是从高到低进行排序,构成向量。通常,第二映射值的总数量n至少应当大于等于5,所述预设的相似度阈值可以设定为80%。所述二者相似度指的是两个流水数据的第二映射值中重复数据的个数与总数量的比值。It is understandable that in the cache, the multiple second mapping values of the associated pipeline data should also be sorted from high to low to form a vector. Generally, the total number n of second mapping values should be at least greater than or equal to 5, and the preset similarity threshold can be set to 80%. The similarity between the two refers to the ratio of the number of repeated data in the second mapping values of the two pipeline data to the total number.
需要说明的是,在步骤3.3中,若二者相似度大于等于预设的相似度阈值,通常还需要进一步比较两条流水数据信息,从而进一步判定所述每一个流水数据是否为关联的流水数据,然而这种判定方法无外乎是继续扩张第二映射值的数量的长度进行比较,以防止先前的n太小,导致判断不精准,相应的,当第二映射值的数量扩张时,相似度阈值也应当提高。具体过程不再赘述。It should be noted that in step 3.3, if the similarity between the two is greater than or equal to the preset similarity threshold, it is usually necessary to further compare the two pieces of pipeline data information to further determine whether each of the pipeline data is associated pipeline data. However, this determination method is nothing more than continuing to expand the length of the number of second mapping values for comparison to prevent the previous n from being too small, resulting in inaccurate judgment. Accordingly, when the number of second mapping values expands, the similarity threshold should also be increased. The specific process will not be repeated.
为了防止稀缺关键字选取的是流水数据中公共的关键字,在本发明的实施例中,选取稀缺关键字的过程可以具体包括步骤S101~步骤S103。In order to prevent the scarce keywords from being common keywords in the flow data, in the embodiment of the present invention, the process of selecting the scarce keywords may specifically include steps S101 to S103.
步骤S101,获取所述流水数据中类型信息中的交易对方信息或交易平台信息中的的每一个关键字。Step S101, obtaining each keyword in the transaction counterparty information in the type information or the transaction platform information in the transaction data.
步骤S102,根据哈夫曼编码,计算出每一个关键字对应的映射值。Step S102: Calculate the mapping value corresponding to each keyword according to Huffman coding.
步骤S103,从高到低对映射值进行排序,选取映射值最大的n个,作为多个第二映射值,其中,n为第二映射值的数量。Step S103, sorting the mapping values from high to low, selecting n mapping values with the largest mapping values as multiple second mapping values, where n is the number of second mapping values.
可理解的,最大的n个第二映射值所对应的关键字即为稀缺关键字。It can be understood that the keywords corresponding to the largest n second mapping values are scarce keywords.
需要说明的是,步骤S102中使用哈夫曼编码(Huffman Coding),并非是为了对数据进行压缩,而是因为哈夫曼编码的核心思想是根据每个关键字出现的频率或概率来分配不等长的位序列,即编码。在哈夫曼编码过程中,出现频率最高的关键字被赋予最短的映射编码,而出现频率低的关键字则被赋予较长的映射编码,因此,步骤S102中每一个关键字对应的映射值即为该关键字对应的哈夫曼编码中的映射编码。It should be noted that the use of Huffman Coding in step S102 is not for compressing data, but because the core idea of Huffman Coding is to assign bit sequences of unequal lengths, i.e., codes, according to the frequency or probability of occurrence of each keyword. In the Huffman coding process, the keyword with the highest frequency of occurrence is assigned the shortest mapping code, while the keyword with the lowest frequency of occurrence is assigned a longer mapping code. Therefore, the mapping value corresponding to each keyword in step S102 is the mapping code in the Huffman coding corresponding to the keyword.
相应的,本发明还公开了一种自动汇集流水数据的管理系统,包括:数据获取模块、逻辑判断模块、第一数据存储模块与第二数据存储模块;Correspondingly, the present invention also discloses a management system for automatically collecting flow data, including: a data acquisition module, a logic judgment module, a first data storage module and a second data storage module;
数据获取模块用于获取多份流水数据;The data acquisition module is used to obtain multiple copies of flow data;
逻辑判断模块用于识别每一条流水数据的类型名称与类型信息,并基于类型信息,生成关联的映射值;以及基于所述流水数据的映射值,判断其关联的流水数据集合是否都在缓存中;以及基于流水数据及其关联的流水数据集合,以判断识别的准确性;The logic judgment module is used to identify the type name and type information of each piece of pipeline data, and generate an associated mapping value based on the type information; and based on the mapping value of the pipeline data, determine whether its associated pipeline data set is in the cache; and based on the pipeline data and its associated pipeline data set, determine the accuracy of the recognition;
第一数据存储模块用于将流水数据存储至缓存中;The first data storage module is used to store the running data in the cache;
第二数据存储模块用于将综合存入关联性数据库中。The second data storage module is used to store the synthesis into a relational database.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing related hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to memory, storage, database or other media used in the embodiments provided by the present invention may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM) or external cache memory.
作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双倍数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present invention, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the patent of the present invention. It should be pointed out that, for ordinary technicians in this field, several variations and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention shall be subject to the attached claims.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410371733.XA CN117971819B (en) | 2024-03-29 | 2024-03-29 | Management method and system for automatically collecting flow data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410371733.XA CN117971819B (en) | 2024-03-29 | 2024-03-29 | Management method and system for automatically collecting flow data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN117971819A CN117971819A (en) | 2024-05-03 |
| CN117971819B true CN117971819B (en) | 2024-05-31 |
Family
ID=90859993
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410371733.XA Active CN117971819B (en) | 2024-03-29 | 2024-03-29 | Management method and system for automatically collecting flow data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117971819B (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109146663A (en) * | 2018-07-20 | 2019-01-04 | 中国邮政储蓄银行股份有限公司 | The processing method and system of flowing water information |
| CN110020303A (en) * | 2017-11-24 | 2019-07-16 | 腾讯科技(深圳)有限公司 | Determine the alternative method, apparatus and storage medium for showing content |
| CN111538733A (en) * | 2020-06-04 | 2020-08-14 | 南京金鼎嘉崎信息科技有限公司 | Multidimensional data comprehensive analysis system and analysis method thereof |
| US10911583B1 (en) * | 2020-07-09 | 2021-02-02 | Inside Packet Ltd. | System and method for processing a network message |
| CN113901768A (en) * | 2021-09-30 | 2022-01-07 | 平安国际智慧城市科技股份有限公司 | Standard file generation method, device, equipment and storage medium |
| CN114841806A (en) * | 2022-03-17 | 2022-08-02 | 浙江邦盛科技股份有限公司 | A pattern matching system under real-time transfer link |
| CN115526722A (en) * | 2022-09-30 | 2022-12-27 | 中电金信软件有限公司 | Method and device for processing transaction of making positive |
-
2024
- 2024-03-29 CN CN202410371733.XA patent/CN117971819B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110020303A (en) * | 2017-11-24 | 2019-07-16 | 腾讯科技(深圳)有限公司 | Determine the alternative method, apparatus and storage medium for showing content |
| CN109146663A (en) * | 2018-07-20 | 2019-01-04 | 中国邮政储蓄银行股份有限公司 | The processing method and system of flowing water information |
| CN111538733A (en) * | 2020-06-04 | 2020-08-14 | 南京金鼎嘉崎信息科技有限公司 | Multidimensional data comprehensive analysis system and analysis method thereof |
| US10911583B1 (en) * | 2020-07-09 | 2021-02-02 | Inside Packet Ltd. | System and method for processing a network message |
| CN113901768A (en) * | 2021-09-30 | 2022-01-07 | 平安国际智慧城市科技股份有限公司 | Standard file generation method, device, equipment and storage medium |
| CN114841806A (en) * | 2022-03-17 | 2022-08-02 | 浙江邦盛科技股份有限公司 | A pattern matching system under real-time transfer link |
| CN115526722A (en) * | 2022-09-30 | 2022-12-27 | 中电金信软件有限公司 | Method and device for processing transaction of making positive |
Non-Patent Citations (2)
| Title |
|---|
| Use of satellite remote sensing data in the mapping of global landslide susceptibility;Hong Y 等;《Natural hazards》;20070313;245-256 * |
| 基于JPEG2000的自适应算术编解码器的研究与实现;周赟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080615;I135-107 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117971819A (en) | 2024-05-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110457302B (en) | Intelligent structured data cleaning method | |
| US10783367B2 (en) | System and method for data extraction and searching | |
| CN108256074B (en) | Verification processing method and device, electronic equipment and storage medium | |
| US20220351302A1 (en) | Transaction data processing systems and methods | |
| AU2024216290A1 (en) | Massive scale heterogeneous data ingestion and user resolution | |
| WO2019109918A1 (en) | Abstract text generation method, computer readable storage medium and computer device | |
| CN110428322A (en) | A kind of adaptation method and device of business datum | |
| US11880394B2 (en) | System and method for machine learning architecture for interdependence detection | |
| CN111858467B (en) | File data processing method, device, equipment and medium based on artificial intelligence | |
| CN112286934B (en) | Database table import method, device, equipment and medium | |
| CN113139876A (en) | Risk model training method and device, computer equipment and readable storage medium | |
| CN114139490A (en) | A method, device and device for automatic data preprocessing | |
| CN112559526A (en) | Data table export method and device, computer equipment and storage medium | |
| CN112749167A (en) | Method and device for determining broken link data and nonvolatile storage medium | |
| CN116166641A (en) | Mapping method, device, equipment and medium for data model and data standard | |
| CN111400187A (en) | Parameter dynamic verification system and method based on customized data source | |
| CN114896955A (en) | Data report processing method and device, computer equipment and storage medium | |
| CN117391086A (en) | Bid participation information extraction method, device, equipment and medium | |
| CN117971819B (en) | Management method and system for automatically collecting flow data | |
| US12039596B1 (en) | Automated rule-based electronic transaction data analysis | |
| CN119829773A (en) | Method and system for extracting attributes of document parties | |
| CN119722264A (en) | Customer risk report generation method, device, electronic device, medium and product | |
| CN111126966A (en) | Bill auditing method and device, computer equipment and computer-readable storage medium | |
| CN118886405A (en) | A method, device, equipment and medium for generating intelligent financial statements | |
| CN114580398B (en) | Text information extraction model generation method, text information extraction method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CP03 | Change of name, title or address | ||
| CP03 | Change of name, title or address |
Address after: 210012, No. 11 Ruhua Road, Yuhuatai District, Nanjing City, Jiangsu Province Room 207, 208, 209, Huashen Building Patentee after: Guotou Intelligence (Nanjing) Information Technology Co.,Ltd. Country or region after: China Address before: Room 207, 208, 209, Huashen Building, No. 11 Software Avenue, Yuhuatai District, Nanjing City, Jiangsu Province Patentee before: NANJING JINDING JIAQI INFORMATION TECHNOLOGY CO.,LTD. Country or region before: China |