CN113392075A

CN113392075A - Multithreading collaborative file batch naming method

Info

Publication number: CN113392075A
Application number: CN202110729518.9A
Authority: CN
Inventors: 朱咸超; 卢道; 王文斌; 蔡梦洁; 郭琪; 李征
Original assignee: Shenzhen Penglai Industrial Technology Co ltd
Current assignee: Shenzhen Penglai Industrial Technology Co ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-14
Anticipated expiration: 2041-06-29
Also published as: CN113392075B

Abstract

The invention relates to the technical field of file processing, in particular to a multithreading collaborative file batch naming method, which comprises the following steps of S1, extracting corresponding keywords according to the characteristics of various types of materials, taking the keywords as an initial keyword library, and storing the keywords in a corresponding file label attribute library; wherein each keyword record corresponds to: keyword name, keyword length and file type of the keyword; and step S2, setting thread number according to the actual hardware configuration of the client, and performing keyword label attribute matching by utilizing computer resources to the maximum extent through a multithreading technical means.

Description

Multithreading collaborative file batch naming method

Technical Field

The invention relates to the technical field of file processing, in particular to a multithreading collaborative file batch naming method.

Background

When illegal assets in the banking industry are treated by judicial methods, a large amount of paper materials need to be manually identified and classified according to the scanned materials after being scanned, then are renamed and are conveniently submitted to a court system, a large amount of time is consumed for treating the paper materials by identification and classification, and the pure-man work efficiency is extremely low. In addition, the document identification provided at present can only identify the ID card materials, and the material pictures such as an application form, a contract and a chapter in the credit card file materials of the bank cannot be identified. Aiming at the situation that a lot of labor cost is consumed in the material examination and naming links, an intelligent segmented multithreading identification method based on an application form, a procurement contract and a chapter material type is provided.

Disclosure of Invention

The invention aims to provide a multithreading collaborative file batch naming method, which aims to solve the problems of naming, identifying and classifying a large number of paper materials after scanning of the paper materials when illegal assets in the banking industry are subjected to judicial disposal.

In order to achieve the above purpose, the present invention is widely applied to the technical scheme of file naming, identification and classification, and particularly provides the following technical scheme: a method for naming files in batches based on multithreading collaboration comprises the following steps;

step S1, extracting corresponding keywords according to the characteristics of each type of material to serve as an initial keyword library and storing the keywords in a corresponding file label attribute library;

wherein each keyword record corresponds to: keyword name, keyword length and file type of the keyword;

step S2, setting thread number according to the actual hardware configuration of the client, and performing keyword label attribute matching by utilizing computer resources to the maximum extent through a multithreading technical means;

step S3, the recognition result returned by the Baidu OCR interface is analyzed, sorted and then stored and submitted to a matching queue;

wherein: the method comprises the steps that a data packet in a JSON format is returned by a hundredth OCR recognition interface, Fastjson is a set of JSON processing tools of an Arribaba open source, a result is analyzed into a HashMap object through the FastJson, specific recognition contents can be obtained through the HashMap object, and after all the recognition contents are extracted, the recognition contents are stored in a matching queue;

the matching queue is a data set formed by a List, and all contents to be matched are stored in an ordered form;

after each matching is completed, the matching queue destroys the matched data; according to the final matching quantity and the type of the keywords, the file type can be determined;

step S4, obtaining the total number of keywords from the matching queue in step S3, and then batching the matching keywords according to the number of the available threads, wherein the number of the keywords in each batch is as follows: total number of keywords/number of threads;

and step S5, matching the text content distributed to the thread in the step S4 with the keyword library in the step S1, wherein the matching times of each batch are as follows: the total number of keywords is the number of keywords in each batch, the matching efficiency is improved through a multithreading technical means, and finally the keywords are collected;

step S6, matching once according to 100% matching rules in the matching process of step S5, directly marking matching success if all keywords are successfully matched, recording the successfully matched keywords if not all keywords are successfully matched, and returning a matching result;

step S7: matching according to the information that the matching is not successful in the matching result of the step S6 and a specific strategy rule, and if the matching is still not successful, terminating the matching;

and step S8, finishing the step S7, automatically naming the file name corresponding to the label attribute according to the file type of the keyword which is successfully matched, and moving the file to the folder to which the file belongs.

Preferably, in the step S6, all the keywords which are not successfully matched but have the matching records therein are checked, and if it is confirmed that there is a correlation with the current file, a new matching rule and policy are made.

Preferably, when the thread number is set in step S2, a thread pool is created by itself through a constructor of threadpoolsexecutor, and the number of kernel threads, the maximum thread number, and the maximum survival time of idle threads exceeding the corePoolSize number in the thread pool are set at the time of creation.

Preferably, the number of the core threads is that each task needs to be processed in taskfime seconds, so that each thread can process 1/taskfime task per second, and the number of threads needed by the system to process taskfime tasks per second is as follows: tasks/(1/tasktime), i.e., tasks × tasktime number of threads.

Preferably, the maximum number of threads is that when the system load reaches a maximum value, the number of core threads cannot process all tasks on time, and then the number of threads needs to be increased.

Preferably, the maximum survival time of idle threads exceeding the corePoolsize number in the thread pool is increased or decreased by the number of threads;

specifically, when the load is reduced, the number of threads can be reduced, and if the idle time of one thread reaches keepalivietime, the thread exits; by default the thread pool will hold at least corePoolsize threads.

Preferably, after the OCR recognition in the step S3, the system averagely splits the result into N segments according to the number fed back by the OCR result, and each segment is allocated to N threads of the system to perform multithreading and matching at the same time;

where N is the maximum number of threads supported by the client.

Compared with the prior art, the invention has the beneficial effects that:

the invention can make different application forms, such as credit card application forms with different formats in various banks, the contract with large content difference and multiple versions can be identified and named by single picture by OCR identification and intelligent segmentation multithread batch naming method, the adopted time is only 1/N of the former time, in addition, the effective utilization rate of the system resource is improved by 90 percent compared with the prior time, in addition, the matching of the attributes of different files is carried out by establishing a keyword library and a file label attribute library, the success rate of matching is improved, the error rate of file naming is reduced, in addition, when the matching of all keywords is not successful, the matching rules and strategies can be changed in time by manual intervention, thereby improving the flexibility of matching.

Drawings

FIG. 1 is a flow chart of the method steps of the present invention;

FIG. 2 is a diagram of a file tag property library according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, an embodiment of the present invention includes:

a method for naming files in batches based on multithreading collaboration comprises the following steps;

wherein: the method comprises the steps that a data packet in a JSON format is returned by a hundredth OCR recognition interface, Fastjson is a set of JSON processing tools of an Arribaba open source, a result is analyzed into a HashMap object through the FastJson, specific recognition contents can be obtained through the HashMap object, and after all the recognition contents are extracted, the recognition contents are stored in a matching queue.

after each matching, the matching queue destroys the matched data. According to the final matching quantity and the type of the keywords, the file type can be determined;

step S4, extracting the keyword list from the matching queue in step S3, and then batching the matched keywords according to the number of the available threads, wherein the number of the keywords in each batch is as follows: total number of keywords/number of threads;

step S7: matching according to the information that the matching is not successful in the matching result of the step S6 and a specific strategy rule, and if the matching is not successful, terminating the matching;

In step S6, all the keywords that are not successfully matched but have records that are successfully matched are checked, and if it is determined that there is a relationship with the current file, a new matching rule and policy are made.

When the thread number is set in the step S2, creating a thread pool by itself through a constructor of threadpoolsexecutor, and setting the number of core threads, the maximum thread number, and the maximum survival time of idle threads exceeding the corepoolseze number in the thread pool when creating;

while creating a thread pool, assigning specific parameters to the number of core threads, the maximum number of threads and the maximum survival time of idle threads exceeding the number of corePoolSize in the thread pool;

the description is as follows:

at present, the common core i 310 generation serial processor of a household machine is taken as an example, 4 cores and 8 threads are adopted, the main frequency is 3.7GHz, under the parameter configuration of the household machine, the maximum value of the number of the adjustable threads is 8, and the minimum value is 1; the lowest value of the adjustable frequency is 1GHz, and the highest position is 3.7 GHz; this configuration is to make the resource utilization greater.

The number of the core threads is processed according to the task time second required by each task, each thread can process 1/task time task per second, the system has task tasks required to be processed per second, and the required thread number is as follows: tasks/(1/tasktime), namely tasks × tasktime threads;

assuming that the number of tasks per second of the system is 100-1000, and each task takes 0.1 second, 100 × 0.1-1000 × 0.1 threads are needed, i.e. 10-100 threads;

the corePoolSize should be set to be larger than 10, and the specific number is preferably according to the 8020 principle, i.e. the number of tasks per second of the system in 80%, and if the number of tasks per second of the system in 80% is smaller than 200 and at most 1000, the corePoolSize can be set to be 20.

When the system load reaches the maximum value, the core thread number cannot process all tasks on time, and then threads need to be added;

where 200 tasks per second require 20 threads, then when 1000 tasks per second are reached, (1000-queueCapacity) × (20/200), i.e., 60 threads, may set maxPoolSize to 60.

The maximum survival time of idle threads exceeding the number of corePoolSize in the thread pool is increased or decreased;

specifically, when the load is reduced, the number of threads can be reduced, and if the idle time of one thread reaches keepalivietime, the thread exits; under the default condition, the Thread pool can at least keep corePoolSize threads, and then the Thread pool successively bears Thread classes, and in the run method, the matching rules are accurately matched according to the label library.

After the OCR recognition in the step S3, the system averagely splits the OCR result into N segments according to the number fed back by the OCR result, where each segment is allocated to N threads of the system to perform multithreading and matching at the same time;

where N is the maximum number of threads supported by the client.

In step S1, the file tag attribute library includes a file name, a matching type, a matching policy, and a file name association feature tag;

the file name comprises an application form and a receiving contract;

the matching type comprises accurate matching and specific strategy matching, wherein the accurate matching comprises the following steps: the character pattern, the character number and the sequence of the feature tag are completely matched; the specific strategy is as follows: matching according to content rules of condition limitation

The matching strategy comprises that 100% and the alternate occurrence frequency of continuous rows is more than or equal to 2 times;

the file name association feature label comprises XX bank credit cards, recommended person credit card numbers, bank exclusive columns, credit card chapters, affiliated card applicants, applicant card types/pieces, bill mailing addresses, main card claimant signatures, card accepting modes, contract making contracts, specific charging items and standards related to the contracts are shown in charging standards, card title () numbers, A party + B party, card issuers (hereinafter called 'A party') applicants (hereinafter called 'B party')

Examples

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A multithreading collaborative file batch naming method is characterized in that: the method comprises the following steps;

and step S5, matching the text content distributed to the thread in the step S4 with the keyword library in the step S1, wherein the matching times of the keywords in each batch are as follows: the total number of keywords is the number of keywords in each batch, the matching efficiency is improved through a multithreading technical means, and finally the keywords are collected;

2. The multithreading collaborative file batch naming method according to claim 1, wherein the multithreading collaborative file batch naming method comprises the following steps: in step S6, all the keywords that are not successfully matched but have records that are successfully matched are checked, and if it is determined that there is a relationship with the current file, a new matching rule and policy are made.

3. The multithreading collaborative file batch naming method according to claim 1, wherein the multithreading collaborative file batch naming method comprises the following steps: when the thread number is set in the step S2, a thread pool is created by itself through a constructor of threadpoolsexecutor, and the number of kernel threads, the maximum thread number, and the maximum survival time of idle threads exceeding the corePoolSize number in the thread pool are set at the time of creation.

4. The multithreading collaborative file batch naming method according to claim 3, wherein the multithreading collaborative file batch naming method comprises the following steps: the number of the core threads is processed according to the task time second required by each task, each thread can process 1/task time task per second, the system has task tasks required to be processed per second, and the required thread number is as follows: tasks/(1/tasktime), i.e., tasks × tasktime number of threads.

5. The multithreading collaborative file batch naming method according to claim 4, wherein the multithreading collaborative file batch naming method comprises the following steps: when the system load reaches the maximum value, the core thread number cannot process all tasks on time, and then the threads need to be added.

6. The multithreading collaborative file batch naming method according to claim 5, wherein the multithreading collaborative file batch naming method comprises the following steps: the maximum survival time of idle threads exceeding the number of corePoolSize in the thread pool is increased or decreased;

7. The multithreading collaborative file batch naming method according to claim 1, wherein the multithreading collaborative file batch naming method comprises the following steps: after the OCR recognition in the step S3, the system averagely splits the OCR result into N segments according to the number fed back by the OCR result, where each segment is allocated to N threads of the system to perform multithreading and matching at the same time;

where N is the maximum number of threads supported by the client.