CN110795425A

CN110795425A - Method, device, equipment and medium for cleaning and merging customs data

Info

Publication number: CN110795425A
Application number: CN201911057701.8A
Authority: CN
Inventors: 李超
Original assignee: Shanghai Yiyuan Network Technology Co Ltd
Current assignee: Shanghai Yiyuan Network Technology Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-14
Anticipated expiration: 2039-10-31
Also published as: CN110795425B

Abstract

The invention discloses a method, a device, equipment and a medium for cleaning and merging customs data, wherein the method comprises the steps of extracting effective bill of lading data from original customs data; extracting company name information in the bill of lading data; judging whether the extracted company name information is valid company name information; matching the area information in the company name information according to a preset rule, and deleting the area information in the company name information if the matching is successful; matching suffix information in the company name information according to a preset rule, and if the matching is successful, converting the suffix information into standard format suffix information; and D, judging whether bill picking data capable of being combined with the bill picking data after the fifth step exists in the database, and if so, combining the data. The invention extracts effective bill drawing data from the original customs data to clean, process and combine, generates bill drawing data with uniform format and centralized information, and is convenient for users to find out useful information.

Description

Method, device, equipment and medium for cleaning and merging customs data

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a medium for cleaning and merging customs data.

Background

The customs data is the import and export statistical data generated by customs in the business statistical function of the import and export. By deeply mining the data contents, enterprises can be helped to timely, comprehensively and considerably master the market trend and analyze the business conditions of overseas markets.

But the original customs data has the following problems:

firstly, the quantity of original customs data is large, so that the difficulty of a user for inquiring useful information is high;

secondly, the customs data has more trade countries, which leads to complex data;

thirdly, the customs data has much garbage information.

It is very difficult to find out useful information by the user to process the original customs data.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a customs data cleaning and merging strategy, aiming at the deficiencies in the prior art, which extracts effective bill extraction data from the original customs data to clean, process and merge, so as to generate bill extraction data with uniform format and centralized information, thereby facilitating the user to find out useful information.

In order to solve the technical problem, the first aspect of the invention discloses a method for cleaning and merging customs data, which comprises the following steps:

step one, extracting effective bill of lading data from original customs data;

step two, extracting company name information in the bill of lading data;

step three, judging whether the extracted company name information is effective company name information; if yes, entering the step four, and if not, entering the step seven;

step four, matching the area information in the company name information according to a preset rule, if the matching is successful, deleting the area information in the company name information, and then entering step five, and if the matching is failed, directly entering step five;

step five, matching suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into standard format suffix information, and then entering step six, and if the matching is failed, directly entering step six;

step six, judging whether bill picking data which can be combined with the bill picking data after the step five exists in the database or not, and if so, combining the data;

and step seven, extracting the next effective bill of lading data from the original customs data, and entering the step two.

In the method for cleaning and merging customs data, when determining whether there is bill picking data in the database, which can be merged with the bill picking data after the completion of the step five, in the step six, the method includes:

601, directly storing the bill picking data after the fifth step into a database;

step 602, sorting all bill of lading data in the database according to company name information;

603, after the sorting is finished, extracting company name information in the bill of lading data adjacent to the latest stored bill of lading data, and extracting company name information in the latest stored bill of lading data;

step 604, calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data; and if the similarity is greater than the threshold value, carrying out data combination on the latest stored bill picking data and the adjacent bill picking data.

In the method for cleaning and merging customs data, the similarity calculation in step 604 is implemented by a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro Winkler Distance algorithm.

In the method for cleaning and merging customs data, the area information in the company name information is matched according to the preset rule in the fourth step, and the matching is realized through regular matching.

The invention discloses a customs data cleaning and merging device, which comprises a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module, wherein the bill of lading data extraction module is used for extracting company name information;

the bill drawing data extraction module is used for extracting effective bill drawing data from the original customs data;

the company name information extraction module is used for extracting company name information in the bill drawing data extracted by the bill drawing data extraction module;

the first judging module is used for judging whether the company name information extracted by the company name information extracting module is valid company name information or not, and if so, triggering the first matching module to operate and the second matching module to operate;

the first matching module is used for matching the area information in the company name information according to a preset rule, and if the matching is successful, deleting the area information in the company name information;

the second matching module is used for matching suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;

and the second judgment module is used for judging whether the database has bill picking data which can be combined with the bill picking data processed by the first matching module and the second matching module, if so, combining the data, and storing the combined bill picking data into the database.

The customs data cleaning and merging device comprises: the system comprises a data writing unit, a data sorting unit, a company name information extracting unit, a similarity calculating unit and a data merging unit;

the data writing unit is used for storing the bill picking data processed by the first matching module and the second matching module into a database;

the data sorting unit is used for sorting all bill of lading data in the database according to company name information;

the company name information extraction unit is used for extracting company name information in bill of lading data adjacent to the latest stored bill of lading data and extracting company name information in the latest stored bill of lading data after the sorting is finished;

the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;

and the data merging unit is used for merging the latest stored bill raising data and the adjacent bill raising data when the similarity calculated by the similarity calculation unit is greater than a threshold value.

In the above customs data cleaning and merging device, the similarity calculation algorithm in the similarity calculation unit is a Levenstein Distance algorithm, an NGram Distance algorithm or a Jaro Winkler Distance algorithm.

According to the device for cleaning and merging customs data, the first matching module matches the regional information in the company name information according to the preset rule, and the matching is realized through regular matching.

In a third aspect of the present invention, a customs data cleaning and merging terminal device is disclosed, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect of the present invention when executing the computer program.

A fourth aspect of the invention discloses a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method as disclosed in the first aspect of the invention.

Compared with the prior art, the invention has the following advantages:

1. the invention extracts effective bill drawing data from the original customs data and judges the company name information in the bill drawing data, thereby cleaning the bill drawing data of invalid company name information and reducing the data volume.

2. The method processes the company name information of the bill drawing data, deletes the area information in the company name information, and converts the suffix information in the company name information into the suffix information in a standard format, so that the company name information has a uniform format, and the business number in the company name is highlighted, thereby facilitating the accuracy of similarity calculation when the subsequent data are combined.

3. The invention judges whether the bill drawing data can be merged or not by utilizing the similarity of the company name information, so that the bill drawing data of companies in different areas with the same business number can be merged, the number of the bills can be reduced, and one bill can reflect more customs trade information.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of a method for data cleansing and merging according to the present invention.

FIG. 2 is a flowchart of a sixth step of the data cleansing and merging method of the present invention.

FIG. 3 is a block diagram of a data cleansing and merging apparatus according to the present invention.

FIG. 4 is a block diagram of a second determining module of the data cleansing and assembling apparatus according to the present invention.

Detailed Description

As shown in fig. 1, a method for cleaning and merging customs data includes the following steps:

step one, extracting effective bill of lading data from original customs data;

extracting effective data from the original customs data by using a preset data field comparison table to form the whole effective bill of lading data, such as: the buyer can be found according to the importer field, and the supplier can be found according to the exporter field. Therefore, no matter which country customs original customs data is subjected to data extraction, the finally extracted and generated bill of lading data is in a standard format, and later-stage data combination is facilitated.

Step two, extracting company name information in the bill of lading data;

and judging whether the extracted company name information is valid company name information or not by using a preset invalid company library, such as: if the company is named not available, the company is an invalid company.

Step four, using regular matching for the area information in the company name information, if the matching is successful, deleting the area information in the company name information, and then entering step five, and if the matching is failed, directly entering step five;

the regional information of the company is matched because the company mostly has regional information, such as: "Shang Hai XXX Co., Ltd.", after using regular matching, the corresponding "Shang Hai" is the company area information, and the deletion of the area information is convenient for the later-stage similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word numbers of the set-up branch companies or sub-companies in different areas are mostly similar or the same, after the deletion of the area information, the combination of the bill data of the set-up branch companies or sub-companies in different areas belonging to a large group company is facilitated.

Step five, using regular matching to the suffix information in the company name information, and if the matching is successful, converting the suffix information into standard format suffix information; then entering a sixth step, and if the matching fails, directly entering the sixth step;

because part of the company name suffix is full, as: "xxx Company Limited", so "xxxcompanity Limited" is converted into "xxx co.

the combined bill drawing data is combined to achieve the purposes of reducing the quantity of the bill drawing data and centralizing information, so that when a user finds the bill drawing data of a company, the user can know the trade information of the company at each customs, and does not need to inquire each customs.

And repeating the second step to the sixth step for multiple times, so that the bill picking data with uniform format and centralized information after being processed is stored in the database.

As shown in fig. 2, in this embodiment, the step six of determining whether there is bill picking data that can be merged with the bill picking data after the step five of determining whether there is bill picking data in the database includes:

and D, whether the database has bill picking data which can be merged with the bill picking data after the step five exists or not, the bill picking data after the step five needs to be stored in the database, so that the bill picking data after the step five is stored firstly, and then whether merging operation is needed or not is judged.

because the subsequent similarity calculation is mainly based on the company name information, all the bill drawing data in the database are firstly sorted according to the company name information, and after the sorting is finished, the bill drawing data which is most likely to be combined with the latest stored bill drawing data is preliminarily judged to be: bill of lading data adjacent to the latest stored bill of lading data.

step 604, calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data; and if the similarity is greater than the threshold value, carrying out data combination on the latest stored bill picking data and the adjacent bill picking data. The similarity is greater than a threshold value which can be 80% -90%.

After the sorting is finished, if the latest stored bill of lading data is ranked at the first position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first listed bill of lading data ranked behind the latest stored bill of lading data; if the latest stored bill of lading data is ranked at the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data ranked in front of the latest stored bill of lading data;

if the latest stored bill picking data is not ranked at the last position or the last position, the latest stored bill picking data and the first bill picking data ranked before the latest stored bill picking data are subjected to company name information similarity calculation, and if the similarity is larger than a threshold value, data merging is carried out; if the similarity is not larger than the threshold value, the latest stored bill of lading data and the first bill of lading data ranked behind the latest stored bill of lading data are subjected to company name information similarity calculation.

In this embodiment, the similarity calculation in step 604 is implemented by a Levenstein Distance algorithm, an NGramDistance algorithm, or a Jaro Winkler Distance algorithm.

The Levenstein Distance algorithm, the NGram Distance algorithm and the Jaro winklerDistance algorithm are conventional algorithms and are not described in detail herein.

As shown in fig. 3, a customs data cleaning and merging device includes a bill of lading data extraction module 1, a company name information extraction module 2, a first judgment module 3, a first matching module 4, a second matching module 5, and a second judgment module 6;

the bill drawing data extraction module 1 is used for extracting effective bill drawing data from the original customs data;

the bill extraction module 1 extracts effective data from the original customs data by using a preset data field comparison table to form the whole effective bill extraction data, such as: the buyer can be found according to the importer field, and the supplier can be found according to the exporter field. Therefore, no matter which country customs original customs data is subjected to data extraction, the finally extracted and generated bill of lading data is in a standard format, and later-stage data combination is facilitated.

The company name information extraction module 2 is used for extracting company name information in the bill of lading data extracted by the bill of lading data extraction module 1;

the first judging module 3 is used for judging whether the company name information extracted by the company name information extracting module is valid company name information, and if so, triggering the first matching module 4 to operate and the second matching module 5 to operate;

the first judging module 3 judges whether the extracted company name information is valid company name information by using a preset "invalid company library", such as: if the company is named not available, the company is an invalid company.

The first matching module 4 is configured to use regular matching for the area information in the company name information, and delete the area information in the company name information if matching is successful;

The second matching module 5 is configured to use regular matching for suffix information in company name information, and if matching is successful, convert the suffix information into standard format suffix information;

The second judging module 6 is configured to judge whether there is bill picking data in the database that can be merged with the bill picking data processed by the first matching module 4 and the second matching module 5, and if so, merge the data.

As shown in fig. 4, in this embodiment, the second determining module 6 includes: a data writing unit 61, a data sorting unit 62, a company name information extracting unit 63, a similarity calculating unit 64, and a data merging unit 65.

The data writing unit 61 is used for storing the bill picking data processed by the first matching module 4 and the second matching module 5 into a database;

the data sorting unit 62 is used for sorting all bill of lading data in the database according to company name information;

the data sorting unit 62 operates the sorting once each time a new bill of lading data is stored in the database.

A company name information extracting unit 63 configured to extract company name information in the bill of lading data adjacent to the latest stored bill of lading data after the sorting is completed, and extract company name information in the latest stored bill of lading data;

after the data sorting unit 62 operates and sorts, the latest stored bill picking data obtains its own sequence position, and since the subsequent similarity calculation is based on the company name information, all bill picking data in the database are sorted according to the company name information, and after the sorting is completed, the bill picking data most likely to be merged with the latest stored bill picking data can be preliminarily judged to be: bill of lading data adjacent to the latest stored bill of lading data.

A similarity calculation unit 64 for performing similarity calculation between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;

a data merging unit 65 for data merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit 64 is greater than the threshold.

In this embodiment, the similarity calculation algorithm in the similarity calculation unit 64 is a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro Winkler Distance algorithm.

A customs data cleaning and merging terminal device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the customs data cleaning and merging method steps when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps of customs data cleansing and merging.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A method for cleaning and merging customs data is characterized by comprising the following steps:

step one, extracting effective bill of lading data from original customs data;

step two, extracting company name information in the bill of lading data;

2. The method for cleaning and merging customs data according to claim 1, wherein the step six of determining whether there is bill picking data in the database that can be merged with the bill picking data after the step five of determining comprises:

3. The method for cleansing and merging customs data according to claim 2, wherein the similarity calculation in step 604 is performed by a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro WinklerDistance algorithm.

4. The customs data cleaning and merging method according to claim 1, 2 or 3, wherein the step four of matching the region information in the company name information according to a preset rule is realized by regular matching.

5. A customs data cleaning and merging device is characterized by comprising a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;

and the second judging module is used for judging whether the database has bill picking data which can be combined with the bill picking data processed by the first matching module and the second matching module, and if so, combining the data.

6. The customs data cleaning and merging apparatus of claim 5, wherein the second determining module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extracting unit, a similarity calculating unit and a data merging unit;

7. The apparatus for cleaning and merging customs data according to claim 6, wherein the similarity calculation algorithm in the similarity calculation unit is a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro WinklerDistance algorithm.

8. The customs data cleaning and merging device according to claim 5, 6 or 7, wherein the first matching module matches the region information in the company name information according to a preset rule by regular matching.

9. A customs data cleansing and merging terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor when executing said computer program implements the steps of the method according to any of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.