CN110795425B - Customs data cleaning and merging method, device, equipment and medium - Google Patents
Customs data cleaning and merging method, device, equipment and medium Download PDFInfo
- Publication number
- CN110795425B CN110795425B CN201911057701.8A CN201911057701A CN110795425B CN 110795425 B CN110795425 B CN 110795425B CN 201911057701 A CN201911057701 A CN 201911057701A CN 110795425 B CN110795425 B CN 110795425B
- Authority
- CN
- China
- Prior art keywords
- data
- bill
- company name
- name information
- lading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Strategic Management (AREA)
- Quality & Reliability (AREA)
- Economics (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Educational Administration (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Tourism & Hospitality (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method, a device, equipment and a medium for cleaning and merging customs data, wherein the method comprises the steps of extracting an effective bill of lading data from original customs data; extracting company name information in bill of lading data; judging whether the extracted company name information is valid company name information; matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful; matching the suffix information in the company name information according to a preset rule, and if the matching is successful, converting the suffix information into standard format suffix information; and D, judging whether bill of lading data which can be combined with the bill of lading data after finishing the step five exists in the database, and if so, carrying out data combination. The invention extracts effective bill of lading data from the original customs data to carry out cleaning, processing and merging, and generates bill of lading data with uniform format and concentrated information, thereby facilitating the user to find out useful information.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a medium for cleaning and merging customs data.
Background
The customs data is various import and export statistics data generated in the customs fulfilling import and export trade statistics functions. Through deep mining of the data contents, enterprises can be helped to timely, comprehensively and observably grasp market trends, and overseas market business conditions can be analyzed.
However, the original customs data has the following problems:
firstly, the original customs data is large in quantity, so that the difficulty of inquiring useful information by a user is high;
secondly, the number of trade countries in customs data is large, so that the data is complex;
thirdly, the customs data has more junk information.
The user can arrange and process the original customs data by himself, and the difficulty in finding out useful information is high.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a strategy for cleaning and merging customs data, which is characterized in that effective bill of lading data is extracted from original customs data to be cleaned, processed and merged, bill of lading data with uniform format and concentrated information is generated, and a user can find useful information conveniently.
In order to solve the technical problems, the first aspect of the invention discloses a customs data cleaning and merging method, which comprises the following steps:
step one, extracting an effective bill of lading data from original customs data;
step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
step four, matching the regional information in the company name information according to a preset rule, deleting the regional information in the company name information if the matching is successful, then entering a step five, and directly entering a step five if the matching is failed;
step five, matching the suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into the suffix information in a standard format, then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
In the sixth step, the step of determining whether the database has bill of lading data that can be merged with the bill of lading data after the step five includes:
step 601, directly storing the bill of lading data after the step five into a database;
step 602, sorting all bill of lading data in the database according to company name information;
step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data.
In the above method for cleaning and merging customs data, in step 604, the similarity calculation is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
In the method for cleaning and merging customs data, in the fourth step, the region information in the company name information is matched according to a preset rule, and the matching is realized through regular matching.
The invention discloses a customs data cleaning and merging device, which comprises a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;
the bill of lading data extraction module is used for extracting effective bill of lading data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill data extracted by the bill data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is effective company name information, and if so, triggering the first matching module to operate and triggering the second matching module to operate;
the first matching module is used for matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful;
the second matching module is used for matching the suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
the second judging module is used for judging whether the bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, if so, the data are combined, and the combined bill of lading data are stored in the database.
The second judging module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;
the data writing unit is used for storing the bill of lading data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, and extracting company name information in the newly stored bill data;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
The similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
According to the customs data cleaning and merging device, the first matching module matches the regional information in the company name information according to the preset rule, and the matching is achieved through regular matching.
In a third aspect the invention discloses a terminal device for customs data cleansing and merging, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method according to the first aspect of the invention when said computer program is executed.
A fourth aspect of the invention discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as disclosed in the first aspect of the invention.
Compared with the prior art, the invention has the following advantages:
1. the invention extracts effective bill of lading data from the original customs data and judges the company name information in the bill of lading data, cleans the bill of lading data of invalid company name information, and reduces the data quantity.
2. The method processes the company name information of the bill of lading data, deletes the area information in the company name information, and converts the suffix information in the company name information into the suffix information in the standard format, so that the company name information has a uniform format, and the business in the company name is highlighted, thereby being convenient for the accuracy of similarity calculation in the subsequent data combination.
3. The invention judges whether the bill of lading data can be combined by utilizing the similarity of the company name information, so that the bill of lading data of companies in different areas with the same business number can be combined, the number of the bill of lading is reduced, and one bill of lading can reflect more customs trade information.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a method for data cleansing and merging according to the present invention.
FIG. 2 is a flowchart illustrating a method of step six in the data cleansing and merging method of the present invention.
FIG. 3 is a block diagram of a data cleansing and merging device according to the present invention.
Fig. 4 is a block diagram of a second determining module in the data cleansing and merging device according to the present invention.
Detailed Description
As shown in fig. 1, a method for cleaning and merging customs data includes the following steps:
step one, extracting an effective bill of lading data from original customs data;
extracting effective data from the original customs data by using a preset data field comparison table to form the whole effective bill of lading data, for example: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.
Step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
judging whether the extracted company name information is valid company name information or not by using a preset invalid public library, for example: the company name is not available, and is invalid.
Step four, using regular matching to the regional information in the company name information, deleting the regional information in the company name information if the matching is successful, then entering step five, and directly entering step five if the matching is failed;
company area information is matched because companies will mostly have area information such as: after regular matching, the corresponding Shang Hai is company area information, and the area information is deleted to facilitate later similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word sizes of the branch companies or sub-companies in different areas are mostly similar or identical, after deleting the area information, the method is favorable for merging bill data of the branch companies or sub-companies set up in different areas belonging to a large group company.
Step five, regular matching is used for the suffix information in the company name information, and if the matching is successful, the suffix information is converted into standard format suffix information; then, entering a step six, and if the matching fails, directly entering the step six;
because part of the company name suffix is fully named, such as: "xxx Company Limited", so "xxx Company Limited" is converted into "xxx co., ltd.", which achieves the purpose of standardizing suffix information, thereby making company names more standard and uniform.
Step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
the purpose of reducing the quantity of bill of lading data and centralizing information is achieved by merging bill of lading data which can be merged, so that a user can know trade information of a company at each customs when finding the bill of lading data of the company without inquiring about each customs.
Step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
And repeating the second step to the sixth step for a plurality of times, so that the bill of lading data which is uniform in format and concentrated in information after being processed is stored in a database.
In the embodiment, as shown in fig. 2, in the step six, when judging whether there is bill of lading data in the database, which can be combined with the bill of lading data after the step five is completed, the method includes:
step 601, directly storing the bill of lading data after the step five into a database;
the bill of lading data after the step five is finished is stored in the database, and whether the bill of lading data can be combined with the bill of lading data after the step five is finished is judged.
Step 602, sorting all bill of lading data in the database according to company name information;
since the subsequent similarity calculation is based on company name information, all bill of lading data in the database is firstly ordered according to the company name information, and after the ordering is finished, the bill of lading data which is most likely to be combined with the newly stored bill of lading data can be preliminarily judged as follows: and bill of lading data adjacent to the newly stored bill of lading data.
Step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data. The similarity is greater than a threshold, which may be 80% -90%.
After the sorting is finished, if the latest stored bill of lading data is arranged in the first position, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation; if the latest stored bill of lading data is arranged in the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data;
if the latest stored bill of lading data is not arranged in the last bit or in the last bit, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data, and if the similarity is larger than a threshold value, carrying out data merging; if the similarity is not greater than the threshold value, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In this embodiment, the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
The Levenstein Distance algorithm, NGram Distance algorithm and Jaro WinklerDistance algorithm are existing algorithms, and the principles are not described here in detail.
As shown in fig. 3, a customs data cleaning and merging device comprises a bill of lading data extraction module 1, a company name information extraction module 2, a first judgment module 3, a first matching module 4, a second matching module 5 and a second judgment module 6;
the bill of lading data extraction module 1 is used for extracting effective bill of lading data from original customs data;
the bill of lading data extraction module 1 utilizes the preset data field comparison table to extract the effective data from the original customs data to form the whole effective bill of lading data, such as: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.
The company name information extraction module 2 is used for extracting company name information in the bill data extracted by the bill data extraction module 1;
the first judging module 3 is configured to judge whether the company name information extracted by the company name information extracting module is valid company name information, and if so, trigger the first matching module 4 to operate and the second matching module 5 to operate;
the first judging module 3 judges whether the extracted company name information is valid company name information by using a preset "invalid public store", for example: the company name is not available, and is invalid.
The first matching module 4 is configured to use regular matching for the area information in the company name information, and if the matching is successful, delete the area information in the company name information;
company area information is matched because companies will mostly have area information such as: after regular matching, the corresponding Shang Hai is company area information, and the area information is deleted to facilitate later similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word sizes of the branch companies or sub-companies in different areas are mostly similar or identical, after deleting the area information, the method is favorable for merging bill data of the branch companies or sub-companies set up in different areas belonging to a large group company.
The second matching module 5 is configured to use regular matching for suffix information in company name information, and if matching is successful, convert the suffix information into suffix information in standard format;
because part of the company name suffix is fully named, such as: "xxx Company Limited", so "xxx Company Limited" is converted into "xxx co., ltd.", which achieves the purpose of standardizing suffix information, thereby making company names more standard and uniform.
The second judging module 6 is configured to judge whether there is bill of lading data in the database, where the bill of lading data can be combined with the bill of lading data processed by the first matching module 4 and the second matching module 5, and if there is bill of lading data, perform data combination.
The purpose of reducing the quantity of bill of lading data and centralizing information is achieved by merging bill of lading data which can be merged, so that a user can know trade information of a company at each customs when finding the bill of lading data of the company without inquiring about each customs.
As shown in fig. 4, in this embodiment, the second determining module 6 includes: a data writing unit 61, a data sorting unit 62, a company name information extracting unit 63, a similarity calculating unit 64, and a data merging unit 65.
A data writing unit 61, configured to store the bill of lading data processed by the first matching module 4 and the second matching module 5 into a database;
a data sorting unit 62, configured to sort all bill of lading data in the database according to company name information;
the data sort unit 62 operates to sort once each time there is new bill of lading data stored in the database.
A company name information extracting unit 63 for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, extracting company name information in the newly stored bill data;
after the data sorting unit 62 performs sorting, the latest stored bill of lading data obtains its own sequence position, and since the subsequent similarity calculation is based on company name information, all bill of lading data in the database are sorted according to the company name information, and after sorting is completed, it can be preliminarily determined that the bill of lading data most likely to be combined with the latest stored bill of lading data is: and bill of lading data adjacent to the newly stored bill of lading data.
A similarity calculation unit 64 for performing similarity calculation on company name information in the newly stored bill of lading data and company name information in the adjacent bill of lading data;
and a data merging unit 65, configured to, when the similarity calculated by the similarity calculation unit 64 is greater than a threshold value, perform data merging on the newly stored bill of lading data and the adjacent bill of lading data.
After the sorting is finished, if the latest stored bill of lading data is arranged in the first position, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation; if the latest stored bill of lading data is arranged in the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data;
if the latest stored bill of lading data is not arranged in the last bit or in the last bit, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data, and if the similarity is larger than a threshold value, carrying out data merging; if the similarity is not greater than the threshold value, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In the present embodiment, the similarity calculation algorithm in the similarity calculation unit 64 is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
The terminal device for customs data cleaning and merging comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the method steps of the customs data cleaning and merging when executing the computer program.
A computer readable storage medium storing a computer program which when executed by a processor performs the method steps of customs data cleansing merging described above.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any simple modification, variation and equivalent structural changes made to the above embodiment according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.
Claims (10)
1. A method for cleaning and merging customs data, comprising the steps of:
step one, extracting an effective bill of lading data from original customs data;
step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
step four, matching the regional information in the company name information according to a preset rule, deleting the regional information in the company name information if the matching is successful, then entering a step five, and directly entering a step five if the matching is failed;
step five, matching the suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into the suffix information in a standard format, then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
2. The method for cleaning and merging customs data according to claim 1, wherein the step six of determining whether there is bill of lading data in the database that can be merged with the bill of lading data after the step five is completed comprises:
step 601, directly storing the bill of lading data after the step five into a database;
step 602, sorting all bill of lading data in the database according to company name information;
step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data.
3. A method for cleaning and merging customs data according to claim 2, wherein the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
4. A method for cleaning and merging customs data according to claim 1, 2 or 3, wherein in the fourth step, the matching of the regional information in the company name information according to the preset rule is implemented by regular matching.
5. The customs data cleaning and merging device is characterized by comprising a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;
the bill of lading data extraction module is used for extracting effective bill of lading data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill data extracted by the bill data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is effective company name information, and if so, triggering the first matching module to operate and triggering the second matching module to operate;
the first matching module is used for matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful;
the second matching module is used for matching the suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
the second judging module is used for judging whether bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, and if so, data combination is carried out.
6. The apparatus for customs data cleansing and merging of claim 5, wherein said second determining module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;
the data writing unit is used for storing the bill of lading data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, and extracting company name information in the newly stored bill data;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
7. A device for cleaning and merging customs data according to claim 6, wherein the similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
8. The device for cleaning and merging customs data according to claim 5, 6 or 7, wherein the first matching module matches the regional information in the company name information according to a preset rule, which is implemented by regular matching.
9. A customs data cleansing and merging terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-4 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911057701.8A CN110795425B (en) | 2019-10-31 | 2019-10-31 | Customs data cleaning and merging method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911057701.8A CN110795425B (en) | 2019-10-31 | 2019-10-31 | Customs data cleaning and merging method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110795425A CN110795425A (en) | 2020-02-14 |
CN110795425B true CN110795425B (en) | 2023-04-28 |
Family
ID=69442388
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911057701.8A Active CN110795425B (en) | 2019-10-31 | 2019-10-31 | Customs data cleaning and merging method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110795425B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429068B (en) * | 2020-03-31 | 2021-04-23 | 天津市商务局(天津市人民政府口岸服务办公室) | Data supervision method, device and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006098529A1 (en) * | 2005-03-16 | 2006-09-21 | Joo Seok Kim | A method of trade-related data exchanging and service providing among different kinds of systems |
CN107066599A (en) * | 2017-04-20 | 2017-08-18 | 北京文因互联科技有限公司 | A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning |
CN109858538A (en) * | 2019-01-24 | 2019-06-07 | 科大国创软件股份有限公司 | A kind of customs's classification error-detecting method based on correlation rule |
-
2019
- 2019-10-31 CN CN201911057701.8A patent/CN110795425B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006098529A1 (en) * | 2005-03-16 | 2006-09-21 | Joo Seok Kim | A method of trade-related data exchanging and service providing among different kinds of systems |
CN107066599A (en) * | 2017-04-20 | 2017-08-18 | 北京文因互联科技有限公司 | A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning |
CN109858538A (en) * | 2019-01-24 | 2019-06-07 | 科大国创软件股份有限公司 | A kind of customs's classification error-detecting method based on correlation rule |
Non-Patent Citations (1)
Title |
---|
刘健 ; 李业伟 ; .数据库关联匹配法在服贸海运费核查中的应用.中国外汇.2015,(11),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN110795425A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7324998B2 (en) | Document search methods and systems | |
US10705748B2 (en) | Method and device for file name identification and file cleaning | |
CN110781246A (en) | Enterprise association relationship construction method and system | |
CN109656999B (en) | Method, device, storage medium and apparatus for synchronizing large data volume data | |
CN105187242B (en) | A kind of user's anomaly detection method excavated based on variable-length pattern | |
CN110413569A (en) | Archives of paper quality electronization archiving method, device and terminal device | |
CN101751475B (en) | Method for compressing section records and device therefor | |
CN110888981A (en) | Title-based document clustering method and device, terminal equipment and medium | |
CN110795425B (en) | Customs data cleaning and merging method, device, equipment and medium | |
CN110825817B (en) | Enterprise suspected association judgment method and system | |
CN110704432A (en) | Data index establishing method and device, readable storage medium and electronic equipment | |
CN116032741A (en) | Equipment identification method and device, electronic equipment and computer storage medium | |
CN107590233A (en) | A kind of file management method and device | |
CN113821630A (en) | Data clustering method and device | |
CN110705297A (en) | Enterprise name-identifying method, system, medium and equipment | |
CN113486157B (en) | Method for decrypting encrypted mobile phone number | |
CN109408727B (en) | Intelligent user attention information recommendation method and system based on multidimensional perception data | |
JPH05257982A (en) | Character string recognizing method | |
CN112559775A (en) | Patent information management method and system and computer equipment | |
CN109034938B (en) | Information rapid screening and matching method and device, electronic equipment and storage medium | |
JP2018163405A (en) | Automated sorting device for transaction detail, automated sorting method and program for automated sorting | |
CN109783607A (en) | A method of the match cognization magnanimity keyword in any text | |
CN111325025B (en) | Shop name mining method and device | |
CN114064621B (en) | Method for judging repeated data | |
CN112882887B (en) | Dynamic establishment method for service fault model in cloud computing environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |