CN112434691A

CN112434691A - HS code matching and displaying method and system based on intelligent analysis and identification and storage medium

Info

Publication number: CN112434691A
Application number: CN202011404276.8A
Authority: CN
Inventors: 张东峰; 冯玉静; 陆欢旺; 万晓磊
Original assignee: Shanghai Sandao Intelligent Technology Co ltd
Current assignee: Shanghai Sandao Intelligent Technology Co ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-03-02

Abstract

The application relates to the technical field of form generation, and discloses an HS code matching and displaying method, a system and a storage medium based on intelligent analysis and identification, wherein the method comprises the following steps: acquiring an object to be judged; correcting the imaging problem; detecting a text in an object to be judged; recognizing the text content; extracting required fields and/or elements from the text recognition result to generate object description information to be judged; judging the category of the object to be judged according to the acquired description information of the object to be judged and the pre-trained atlas data, and performing entity link with the atlas data; and (3) training the pre-trained map data according to the provided HS coding document data by combining a semantic library to generate a model, and continuously learning and optimizing an AI algorithm through external data feedback. The method and the system can meet the requirement that an intelligent search knowledge engine in the customs declaration and pre-classification business field can rapidly acquire knowledge, and can accurately correspond to various columns by combining character recognition.

Description

HS code matching and displaying method and system based on intelligent analysis and identification and storage medium

Technical Field

The application relates to the technical field of form generation, in particular to an HS code matching and displaying method, system and storage medium based on intelligent analysis and recognition.

Background

The HS code (unified international commodity classification code) is a unified standard for quantitatively managing entry and exit accounts and return tax rates of various products by a code coordination system established by the international customs administration.

HS coding is a wide variety of classes, including 22 major classes, 96 major class sections, and tens of thousands of sub-classes in total. Information that can be used in relation to the classification of HS codes at clearance is mainly the name of the product and the specification of the product (i.e., declaration elements), and ten thousand codes are used for the total number. Therefore, when the user has knowledge requirement, it is considered to search the specific domain knowledge base for the answer of the corresponding question. At present, the knowledge in the field of customs declaration and classification is relatively mature, but a relatively laggard electronic document mode is adopted in the aspects of knowledge representation, organization, management and the like, the relation among the knowledge is not well established, and the problem of information isolated island exists among the knowledge. Meanwhile, in the pre-classification process, a plurality of classification systems have certain similarity, the same question may have similar answers in a plurality of chapters, and the answers are difficult to be accurately matched in the traditional database search. In addition, the atlas needs to manually input commodity information, the system classifies the commodity information according to the input information and fills the existing information into the column corresponding to the declaration element, so that the manually input commodity information is often incomplete, and the accuracy of the result given by the system cannot be kept on a higher horizontal line.

Disclosure of Invention

In order to build an intelligent search knowledge engine capable of meeting the requirements of rapidly acquiring knowledge in the field of customs declaration and classification services, simultaneously meet knowledge management and maintenance knowledge base of knowledge intelligent updating requirements, and improve the accuracy of output results, the application provides a method, a system and a storage medium for HS code matching and display based on intelligent analysis and recognition.

In a first aspect, the present application provides a method for matching and displaying HS codes based on intelligent parsing and recognition, including:

acquiring objects to be judged, including a picture class and a non-picture class, converting the non-picture class into a picture format, and storing the non-picture class and the picture class file in a unified manner;

analyzing the file, and analyzing the type and format of the object to be determined;

image preprocessing, namely correcting the image imaging problem of the object to be judged;

character detection, detecting the position, range and layout of a text in an object to be judged;

character recognition, namely recognizing the text content on the basis of text detection;

text extraction, namely extracting required fields and/or elements from a text recognition result to generate object description information to be judged;

judging the category of the object to be judged according to the acquired description information of the object to be judged and the pre-trained atlas data, and performing entity link with the atlas data;

the pre-trained atlas data is combined with semantic library training to generate a model according to provided HS coding document data, and an AI algorithm is continuously learned and optimized through external data feedback.

By adopting the technical scheme, model training based on deep learning is assisted by natural language processing, a database is not simply called, classification of new commodities (commodities which do not appear in the database) can be achieved, meanwhile, through external data feedback, an AI algorithm can continuously learn by self to grow up, atlas data can grow along with time, a user can use the knowledge management system, the knowledge management system can continuously learn by self, optimize and become stronger, finally, an intelligent search knowledge engine which can meet the customs declaration and pre-classification business field is built to achieve the requirement of rapidly acquiring knowledge, and meanwhile, knowledge management and a knowledge base maintenance requirements of knowledge intelligent updating are met. In addition, the method can accurately correspond to various columns by combining character recognition, so that the condition that a client judges information inaccurately and provides wrong information is avoided.

In some embodiments, the image pre-processing comprises:

inputting an image of a file to be processed into a pre-trained image correction network for geometric change and/or distortion correction to obtain a corrected first target image;

performing small-angle correction on the first target image through a CV algorithm and an affine transformation matrix to obtain a second target image;

removing the blur of the second target image through a denoising algorithm to obtain a third target image;

and carrying out binarization processing on the third target image to obtain a binarized image.

In some embodiments, the text detection comprises:

inputting the binary image into a pre-trained feature extraction network;

extracting output information of at least two convolution layers in the feature extraction network, and fusing the output information;

inputting the fused information into a full connection layer in the feature extraction network, and outputting 2k vertical direction coordinates and coordinate scores of k anchors corresponding to the text region of the binary image and k boundary regression results to realize text positioning and obtain a rectangular text box.

In some embodiments of the present invention, the substrate is,

the character recognition comprises the following steps: performing character recognition on text contents in the rectangular text box through a pre-trained character recognition network to acquire text content information;

the text extraction comprises:

generating a basic semantic analysis engine based on a preset semantic database, wherein the semantic database comprises a field basic corpus, a field dictionary and a field knowledge map;

performing field analysis processing on the text content information based on a basic semantic analysis engine;

extracting the required fields and/or elements in the text content based on the extraction requirement extraction data set.

In some embodiments, determining the category of the object to be determined according to the acquired description information of the object to be determined and the pre-trained atlas data comprises:

classifying the object of the class to be determined into classes;

judging the hierarchy category corresponding to each hierarchy from top to bottom according to the obtained object description information and a pre-trained hierarchy classification model corresponding to the hierarchy;

to a unique entity in the profile data.

In some embodiments, the hierarchical classification model corresponding to each hierarchy is trained by:

selecting a training sample, and extracting characteristic contents of description information of the sample as a query statement;

and matching the extracted query sentences and the corresponding hierarchy categories to train and obtain a hierarchy classification model corresponding to each hierarchy.

In some embodiments, the method further comprises training a hierarchical classification model corresponding to each hierarchy in the following manner:

extracting characteristic content of the determined object description information as a query statement based on the object description information to be determined;

In some embodiments, determining the hierarchy type corresponding to the hierarchy includes calculating a degree of matching based on ranking learning and semantic features, and performing search ranking.

In a second aspect, the present application provides an HS code matching and displaying system based on intelligent parsing and recognition, including:

the acquisition unit is used for acquiring a file to be processed;

the file analysis unit is used for receiving the file to be processed and analyzing the type and the format of the file to be processed;

the image preprocessing unit is used for correcting the image imaging problem of the analyzed file to be processed;

the character detection unit is used for detecting the position, the range and the layout of the text in the file to be processed on the basis of correcting the image imaging problem;

the character recognition unit is used for recognizing the text content on the basis of text detection;

the text extraction unit extracts required fields and/or elements from the text recognition result and generates object description information to be judged;

the judging unit is used for judging the category of the object to be judged according to the acquired object description information to be judged and the pre-trained atlas data;

the display unit is used for displaying the judgment result of the judgment unit; and the number of the first and second groups,

the device comprises a memory and a processor, wherein the memory is stored with a computer program which can be loaded by the processor and executes the HS code matching and displaying method based on intelligent analysis and recognition.

In a third aspect, the present application provides a computer-readable storage medium, which stores a computer program that can be loaded by a processor and execute the above HS code matching and displaying method based on intelligent parsing and recognition.

In summary, the HS code matching and displaying method, system and storage medium based on intelligent parsing and recognition provided by the present application include at least one of the following beneficial technical effects:

1. after the knowledge graph is constructed, automatically extracting a query path to generate a corresponding template, and calculating the matching degree with the problem based on the characteristics and technologies such as a sequencing learning algorithm, semantic characteristics and the characteristics (popularity) of the knowledge graph, so that a database query sentence is generated, and manual writing rules are reduced;

2. the method is based on deep learning model training, semantic recognition and result searching are carried out on the basis of natural language processing, entity disambiguation and wrongly written word error correction capability, a database is not simply called, and new commodities (commodities which do not appear in the database) can be classified;

3. according to the provided document data, a Chinese semantic library and a training model are combined to generate map data, meanwhile, through external data feedback, an AI algorithm can continuously learn and grow by self, and map data can be continuously learned and optimized by a user along with the increase of time;

4. through character recognition, data can be simply uploaded without considering input specific contents when information is input, so that the difficulty is reduced;

5. the system has clear labels for classification through the information acquired by character recognition, can accurately correspond to various columns, and avoids inaccurate judgment of the client on the information and wrong information provision;

6. the range of information acquisition is improved, and the possibility of acquiring complete information is improved.

Drawings

Fig. 1 is a block diagram of a structure of an HS code matching and displaying system based on intelligent parsing and recognition provided in the present application.

In the figure, 1, an acquisition unit; 2. a file parsing unit; 3. an image preprocessing unit; 4. a character detection unit; 5. a character recognition unit; 6. a text extraction unit; 7. a determination unit; 8. a display unit; 9. a memory; 10. a processor.

Detailed Description

The present application is described in further detail below with reference to the attached drawings.

The embodiment of the application provides an HS code matching and displaying method, system and storage medium based on intelligent analysis and identification.

The application discloses an HS code matching and displaying method based on intelligent analysis and identification, which comprises the following steps:

the method comprises the steps of obtaining an object to be judged, wherein a file to be processed comprises a photo class and a non-photo class, the non-photo class comprises a photocopy and a PDF file, meanwhile, the non-photo class is converted into a picture format and is stored together with the photo class file, the input file to be processed is stored in a file library at the same time, and model training is carried out based on manual marking so as to obtain an image correction network, a feature extraction network, a character recognition network and a deep learning extraction data set.

In the embodiment of the application, the file analysis supports the processing of files with JPG, PNG, TIF and PDF formats.

Image preprocessing, namely correcting the image imaging problem of the file to be processed; the method specifically comprises the following steps:

inputting the image of the file to be processed into a pre-trained image correction network for geometric change and/or distortion correction to obtain a corrected first target image, namely:

regressing the network parameters of the space transformation corresponding to the first target image by utilizing a positioning network in the image correction network;

calculating the position of a pixel point in the corrected first target image in the first target image by using a grid generator in the image correction network and the network parameters;

outputting the corrected first target image by using a sampler in the image correction network and the calculated position;

then, the user can use the device to perform the operation,

After image preprocessing, the following steps are carried out:

the method comprises the following steps of character detection, wherein the position, the range and the layout of a text in a file to be processed are detected, the layout analysis, the character line detection and the like are generally included, and the character detection mainly solves the problems of where characters exist and how large the range of the characters exists. The method comprises the following specific steps:

inputting the binary image into a pre-trained feature extraction network;

inputting the fused information into a full-connection layer in the feature extraction network, and outputting 2k vertical direction coordinates and coordinate scores of k anchors corresponding to the text region of the binarized image and k boundary regression results to realize text positioning and obtain a rectangular text box;

the processing algorithm adopted by the character detection comprises the following steps: fast-RCNN, Mask-RCNN, FPN, PANET, Unet, IoUNet, YOLO, SSD.

Then the step of character recognition is entered,

the character recognition is used for recognizing the text content on the basis of character detection, and the problem mainly solved by the character recognition is what each character is. In this embodiment of the present application, character recognition is performed on text contents in a rectangular text box through a pre-trained character recognition network to obtain text content information, and a processing algorithm adopted in the method includes: CRNN, AttentionOCR, RNNLM, BERT;

and then extracting required fields and/or elements from the text recognition result through text extraction, wherein the required fields and/or elements comprise:

extracting required fields and/or elements in text content from a data set based on extraction requirements, wherein the extraction requirements comprise: sequence labeling extraction, deep learning extraction and table extraction,

the processing algorithm adopted by the text extraction comprises the following steps: CRF, HMM, HAN, DPCNN, BilSTM + CRF, BERT + CRF, Regex.

And finally, outputting the result and generating the object description information to be judged.

According to the obtained description information of the object to be judged and the pre-trained map data, the type of the object to be judged is judged, the object to be judged is in entity link with the map data, the pre-trained map data is combined with the semantic library training to generate a model according to the provided HS coding document data, and the AI algorithm is continuously learned and optimized through external data feedback.

The method for judging the type of the object to be judged according to the acquired description information of the object to be judged and the pre-trained atlas data comprises the following steps:

classifying the classes of the objects of the classes to be judged;

according to the divided levels from top to bottom, according to the acquired object description information and a pre-trained level classification model corresponding to the levels, the level categories corresponding to the levels are judged, the judgment of the level categories corresponding to the levels comprises the steps of calculating the matching degree based on sequencing learning and semantic features, and searching and sequencing, and specifically comprises the following steps:

static embedded vectors of object description information to be judged and a query path based on characters and words and cosine similarity of the static embedded vectors;

the object description information to be determined is related to the context of the query path, such as the embedded vector and the cosine similarity;

the similarity between the object description information to be judged and the query path is based on the Jaccard similarity of characters and words;

the object description information to be judged and the query path are based on the levenstein similarity of the characters and the words;

querying a graph embedding vector of the subgraph;

and based on the judgment characteristics, searching and sequencing are carried out by adopting a sequencing learning algorithm.

In this embodiment of the present application, the capability of entity disambiguation and error correction is also included: for the comments in the question, based on the knowledge graph and the text similarity model, error correction, entity disambiguation and entity linking are performed in a unified manner.

In addition, when the object description information to be determined is physically linked with the map data, the following features are mined:

semantic similarity between the entity name and object description information to be determined;

semantic similarity between the entity two-degree subgraph and object description information to be judged;

semantic similarity between the entity type and the object description information to be determined;

the occurrence frequency and relationship types of the entity in the knowledge graph;

literal features of ention in question;

based on the above features, a recall ranking is performed using a ranking learning algorithm, and finally linked to a unique entity in the atlas data.

In this embodiment of the present application, the hierarchical classification model corresponding to each hierarchical level is obtained by training in the following manner:

selecting a training sample, extracting the characteristic content of the description information of the sample as a query statement, and/or extracting the characteristic content of the judged object description information as a query statement based on the object description information to be judged;

In the application of the method, commodity information needing to be inquired is input into a search box. And clicking a search button or pressing a carriage return for searching, wherein in the searching process, a search suggestion is provided below a search bar and is used for quickly selecting information expected to be inquired, and when the classified commodities are searched in a database instead of searching the classified commodities, the system carries out natural language processing and automatically classifies the commodities according to a trained model.

The application also discloses HS code matching and display system based on intelligent analysis and recognition, which comprises:

the device comprises an acquisition unit 1, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a file to be processed;

the file analysis unit 2 is used for receiving the file to be processed and analyzing the type and the format of the file to be processed;

the image preprocessing unit 3 corrects the image imaging problem of the analyzed file to be processed;

the character detection unit 4 is used for detecting the position, the range and the layout of the text in the file to be processed on the basis of correcting the image imaging problem;

a character recognition unit 5 for recognizing the text content based on the text detection;

the text extraction unit 6 is used for extracting required fields and/or elements from the text recognition result and generating object description information to be judged;

the judging unit 7 is used for judging the category of the object to be judged according to the acquired object description information to be judged and the pre-trained atlas data;

a display unit 8 for displaying the result judged by the judgment unit 7; and the number of the first and second groups,

the device comprises a memory 9 and a processor 10, wherein the memory 9 is stored with a computer program which can be loaded by the processor 10 and can execute the HS code matching and displaying method based on intelligent analysis and recognition.

The embodiment of the present application provides a storage medium, in which an instruction set is stored, where the instruction set is suitable for a processor 10 to load and execute the above-mentioned HS code matching and displaying method steps based on intelligent parsing and recognition.

Computer storage media include, for example: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method and the core idea of the present application, and should not be construed as limiting the present application. Those skilled in the art should also appreciate that various modifications and substitutions can be made without departing from the scope of the present disclosure.

Claims

1. The HS code matching and displaying method based on intelligent analysis and identification is characterized by comprising the following steps:

2. The HS code matching and displaying method based on intelligent analytic recognition according to claim 1, wherein the image preprocessing comprises:

3. The HS code matching and displaying method based on intelligent parsing and recognition according to claim 1, wherein said text detection comprises:

inputting the binary image into a pre-trained feature extraction network;

4. The HS code matching and displaying method based on intelligent analytic identification according to claim 1,

the text extraction comprises:

5. The HS code matching and displaying method based on intelligent parsing and recognition according to claim 1, wherein determining the class of the object to be determined according to the obtained object description information to be determined and pre-trained atlas data comprises:

classifying the object of the class to be determined into classes;

to a unique entity in the profile data.

6. The HS code matching and displaying method based on intelligent analytic recognition according to claim 5, wherein a hierarchical classification model corresponding to each hierarchical level is obtained by training in the following way:

7. The HS code matching and displaying method based on intelligent parsing and recognition according to claim 6, further comprising training a hierarchical classification model corresponding to each hierarchical level in the following manner:

8. The HS code matching and displaying method based on intelligent parsing and recognition of claim 5, wherein said determining the hierarchy type corresponding to the hierarchy comprises calculating matching degree based on ranking learning and semantic features, and performing search ranking.

9. HS code matching, display system based on intelligent analysis discernment, its characterized in that includes:

the device comprises an acquisition unit (1) for acquiring a file to be processed;

the file analysis unit (2) is used for receiving the file to be processed and analyzing the type and the format of the file to be processed;

the image preprocessing unit (3) is used for correcting the image imaging problem of the analyzed file to be processed;

the character detection unit (4) is used for detecting the position, the range and the layout of the text in the file to be processed on the basis of correcting the image imaging problem;

a character recognition unit (5) for recognizing the text content on the basis of the text detection;

a text extraction unit (6) which extracts required fields and/or elements from the text recognition result and generates object description information to be judged;

the judging unit (7) is used for judging the type of the object to be judged according to the acquired object description information to be judged and the pre-trained atlas data;

a display unit (8) for displaying the result determined by the determination unit (7); and the number of the first and second groups,

a memory (9) and a processor (10), the memory (9) having stored thereon a computer program that can be loaded by the processor (10) and that executes the HS code matching, presentation method based on intelligent analytics recognition as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that a computer program is stored which can be loaded by a processor (10) and which executes the HS code matching, presentation method based on intelligent analytics recognition as claimed in any one of claims 1 to 8.