CN114153972A

CN114153972A - Accessory classification method, device, equipment and medium based on optical character recognition

Info

Publication number: CN114153972A
Application number: CN202111437898.5A
Authority: CN
Inventors: 董润华
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-08

Abstract

The invention relates to an intelligent decision technology, and discloses an attachment classification method based on optical character recognition, which comprises the following steps: acquiring a text attachment set generated by an image set to be classified through optical character recognition, and extracting the real category of the text attachment and the ratio of a training keyword set to a training label; configuring an original accessory classifier by using the training keyword set and the training label ratio, and predicting the text accessory by using the original accessory classifier to obtain a predicted accessory category and a predicted value; comparing the predicted accessory category with the real category to obtain a prediction accuracy; training an original accessory classifier according to the predicted accuracy until the predicted accuracy is larger than a training threshold value to obtain a standard accessory classifier; and classifying the accessory to be classified by using a standard accessory classifier. The invention also provides an accessory classification device based on the optical character recognition, electronic equipment and a storage medium. The invention can solve the problem of ambiguous classification of accessories scanned by optical character recognition.

Description

Accessory classification method, device, equipment and medium based on optical character recognition

Technical Field

The invention relates to the technical field of intelligent decision, in particular to an attachment classification method and device based on optical character recognition, electronic equipment and a computer readable storage medium.

Background

With the popularization of computer technology, the original papery accessory records are gradually replaced by electronic accessories, and the electronic accessories are widely applied to various industries due to the advantages of high efficiency and convenience.

The records of current electronic accessories are mostly documented and saved in the form of pictures, for example: contract agreement, form list, invoice bill and the like, and do not mark according to the type or content of the attachment, and most of the cases are that the user knows the picture content and the type of the specific record of the attachment after opening the attachment. The ambiguous storage mode of the accessory classification makes the inquiry of the accessory by the user extremely inefficient.

Disclosure of Invention

The invention provides an attachment classification method and device based on optical character recognition and a computer readable storage medium, and mainly aims to solve the problem that the classification of attachments scanned by the optical character recognition is not clear.

In order to achieve the above object, the present invention provides an attachment classification method based on optical character recognition, including:

acquiring a text attachment set generated by an image set to be classified through optical character recognition;

selecting one of the text attachments from the text attachment set one by one, and extracting the real type of the text attachments which are labeled in advance and all attachment keywords, paragraph labels and table labels in the text attachments;

combining all accessory keywords extracted from the text accessories to obtain a training keyword set of the text accessories, and calculating the quantitative ratio of the paragraph labels to the table labels to obtain a training label ratio;

configuring a pre-constructed original accessory classifier by using the training keyword set and the training label ratio;

classifying and predicting the text attachment by using the original attachment classifier to obtain an attachment category and a corresponding score of the text attachment;

judging whether the score is smaller than a preset prediction threshold value or not;

when the score is smaller than a preset prediction threshold value, performing gradient adjustment on the original attachment classifier by using the score, and returning to the step of performing classification prediction on the text attachment by using the original attachment classifier to obtain an attachment category and a corresponding score of the text attachment;

when the score is larger than or equal to a preset prediction threshold value, comparing the predicted accessory category with the real category of the text accessory to obtain a prediction result of correct prediction or wrong prediction;

summarizing the prediction results of all the text attachments in the text attachment set to obtain the prediction accuracy;

judging whether the prediction accuracy is greater than or equal to a preset training threshold value or not;

if the predicted accuracy is smaller than the training threshold, returning to the step of configuring the pre-constructed original accessory classifier by using the training keyword set and the training label ratio until the predicted accuracy is larger than or equal to the training threshold, and stopping the iterative training to obtain a standard accessory classifier;

and receiving the accessories to be classified, and classifying the accessories to be classified by using the standard accessory classifier to obtain a classification result of the accessories to be classified.

Optionally, the classifying and predicting the text attachment by using the original attachment classifier to obtain the attachment category and the corresponding score of the text attachment includes:

according to each training keyword in the training keyword set, scoring the text attachments under each attachment category in a pre-constructed attachment category scoring table to obtain a keyword scoring set;

according to the training label ratio, scoring the text attachments under each attachment category in the attachment category scoring table to obtain a label ratio scoring set;

according to the keyword evaluation set and the tag comparison evaluation set, constructing a comprehensive score of the text attachment under each attachment category in the attachment category score table to obtain a comprehensive score set;

and inquiring the accessory category corresponding to the highest comprehensive score in the comprehensive score set, and taking the accessory category corresponding to the highest comprehensive score and the highest comprehensive score as the accessory category and the corresponding score of the text accessory.

Optionally, the constructing a comprehensive score of the text attachment under each attachment category in the attachment category score table according to the keyword score set and the tag ratio score set to obtain a comprehensive score set includes:

superposing the scores of the keyword score sets under the same accessory category to obtain the score of the training keyword set under each accessory category;

carrying out normalization processing on the scores of the training keyword set under each accessory category by using a pre-constructed first normalization formula to obtain a keyword normalization score set;

normalizing the scores in the label ratio score set by using a pre-constructed second normalization formula to obtain a label ratio normalization score set;

and correspondingly superposing the keyword normalized scoring set and the label normalized scoring set, and scoring under the same accessory category to obtain the comprehensive scoring set.

Optionally, the extracting the pre-labeled real type of the text attachment and all the attachment keywords, paragraph labels and table labels in the text attachment includes:

extracting an attachment number preset by the text attachment, and inquiring a real type labeled in advance by the text attachment in a pre-constructed training attachment type table according to the attachment number;

converting the text attachment into an html format to obtain an html attachment;

extracting all attachment keywords in the html attachment according to a pre-constructed attachment keyword set;

extracting all paragraph labels in the html attachment according to a preset attachment paragraph label set;

and extracting all form tags in the html attachment according to a preset attachment form tag set.

Optionally, the extracting all the accessory keywords in the html accessory according to the pre-constructed accessory keyword set includes:

performing word segmentation processing on the content in the html attachment to obtain a word set to be matched;

extracting the words existing in the word set to be matched and the accessory keyword set at the same time, and taking the words existing at the same time as the accessory keyword.

Optionally, the step of performing gradient adjustment on the original accessory classifier by using the score, and returning to the step of performing classification prediction on the text accessory by using the original accessory classifier to obtain the accessory category of the text accessory and the corresponding score includes:

calculating a difference value between the score and the prediction threshold value to obtain a prediction residual error;

setting adjustment gradients for adjusting the keyword evaluation set and the label comparison evaluation set according to the size of the prediction residual error;

and adjusting the grade of each accessory category in the keyword evaluation set and the label ratio evaluation set according to the adjustment gradient, and predicting the accessory category and the corresponding score of the text accessory according to the adjusted keyword evaluation set and the adjusted label ratio evaluation set.

Optionally, the receiving the accessory to be classified, and classifying the accessory to be classified by using the standard accessory classifier to obtain a classification result of the accessory to be classified includes:

extracting all accessory keywords, paragraph labels and table labels in the accessories to be classified;

and classifying the accessories to be classified by using the standard accessory classifier according to all accessory keywords, paragraph labels and table labels in the accessories to be classified to obtain a classification result of the accessories to be classified.

In order to solve the above problems, the present invention also provides an attachment classification apparatus based on optical character recognition, the apparatus including:

the sample data extraction module is used for acquiring a text attachment set generated by an image set to be classified through optical character recognition, selecting one text attachment from the text attachment set one by one, extracting the pre-labeled real type of the text attachment and all attachment keywords, paragraph labels and table labels in the text attachment, combining all the attachment keywords extracted from the text attachment to obtain a training keyword set of the text attachment, and calculating the quantitative ratio of the paragraph labels to the table labels to obtain a training label ratio;

the classifier prediction module is used for configuring a pre-constructed original accessory classifier by utilizing the training keyword set and the training label ratio, and performing classification prediction on the text accessory by utilizing the original accessory classifier to obtain an accessory category and a corresponding score of the text accessory;

a classification result judgment module, configured to judge whether the score is smaller than a preset prediction threshold, and when the score is smaller than the preset prediction threshold, perform gradient adjustment on the original attachment classifier by using the score, and return to the above step of performing classification prediction on the text attachment by using the original attachment classifier to obtain an attachment category of the text attachment and a corresponding score;

a classifier effect judging module, configured to, when the score is greater than or equal to a preset prediction threshold, compare the predicted accessory category with a real category of the text accessory to obtain a prediction result of correct prediction or incorrect prediction, and summarize prediction results of all text accessories in the text accessory set to obtain a prediction accuracy, and judge whether the prediction accuracy is greater than or equal to a preset training threshold, and if the prediction accuracy is less than the training threshold, return to the above-mentioned process of configuring a pre-constructed original accessory classifier using the training keyword set and the training label ratio, until the prediction accuracy is greater than or equal to the training threshold, stop the iterative training to obtain a standard accessory classifier;

and the classifier identification module is used for receiving the accessories to be classified, and classifying the accessories to be classified by using the standard accessory classifier to obtain a classification result of the accessories to be classified.

In order to solve the above problem, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method for optical character recognition based accessory classification described above.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, the at least one computer program being executed by a processor in an electronic device to implement the optical character recognition-based accessory classification method described above.

According to the method and the device, the accessory keywords and the training label ratio of the text accessory obtained by carrying out optical character recognition operation on the picture accessory are extracted, wherein the accessory keywords and the training label ratio can represent all characteristic information in one picture accessory, the accessory classifier can be more accurate by training the accessory classifier through the training keyword set and the training label ratio, in addition, the accessory classifier is trained through two judgment criteria according to the value of each accessory category and the prediction accuracy of the accessory classifier obtained through prediction of the accessory classifier, and the classification accuracy of the accessory classifier is further improved. Therefore, the accessory classification method, the accessory classification device, the electronic equipment and the computer readable storage medium based on the optical character recognition can improve the classification accuracy of the accessories scanned by the optical character recognition.

Drawings

FIG. 1 is a flowchart illustrating a method for classifying accessories based on optical character recognition according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a step of a method for classifying accessories based on optical character recognition according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a step of a method for classifying accessories based on optical character recognition according to an embodiment of the present invention;

FIG. 4 is a functional block diagram of an attachment classification device based on optical character recognition according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing the method for classifying accessories based on optical character recognition according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the application provides an attachment classification method based on optical character recognition. The executing body of the attachment classification method based on optical character recognition includes, but is not limited to, at least one of a server, a terminal and other electronic devices that can be configured to execute the method provided by the embodiments of the present application. In other words, the attachment classification method based on optical character recognition may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Referring to fig. 1, a flowchart of an attachment classification method based on optical character recognition according to an embodiment of the present invention is shown. In this embodiment, the method for classifying accessories based on optical character recognition includes:

and S1, acquiring a text attachment set generated by the photo set to be classified through optical character recognition.

In the embodiment of the present invention, the to-be-classified picture set may be a service attachment stored in a website, for example: the middle-login network registers contract picture attachments, rental list picture attachments, invoice picture attachments and the like of actual financing services among users, and the attachments require scanning of uploaded pieces. Further, in the embodiment of the present invention, a text attachment set corresponding to the to-be-classified picture set is obtained by an Optical Character Recognition technology (OCR for short).

In detail, the picture set to be classified comprises all types of attachment categories in a certain field, for example, the attachment categories in the actual financing business field comprise: the system comprises a plain text type, a plain form type, a compound type, an invoice type and the like, wherein the plain text type can be a contract agreement and basically has no form, the plain form type can be a form list and basically has no text, the compound type consists of a half form and a half plain text, and the invoice type can be a value-added tax common invoice, a special invoice and the like. In particular, an attachment may be classified as an abnormal class when it appears particularly obscured or fails to open.

S2, selecting one of the text attachments from the text attachment set one by one, and extracting the real type of the text attachment labeled in advance and all attachment keywords, paragraph labels and table labels in the text attachment.

In the embodiment of the present invention, the accessory keywords are feature words with a relatively high frequency of occurrence in each preset accessory category, for example: in the contract agreement of the plain text type, words such as 'contract', 'agreement' and 'seal' can be used as accessory keywords; in the table list of the pure table class, words such as 'list' and the like can be used as the accessory key words; in the accessories of the invoice class, terms such as 'value-added tax special invoice', 'value-added tax common invoice', 'invoice union', 'deduction union', 'taxpayer identification number' and 'invoicer' can be used as accessory keywords.

In the embodiment of the invention, the possibility scoring is carried out under each accessory category through each accessory keyword in the text accessories, when the score under a certain accessory category is higher, the possibility that the text accessories belong to the accessory category is higher, and finally all the accessory keywords in the text accessories are synthesized to predict the accessory category of the text accessories.

In this embodiment of the present invention, the paragraph tag refers to a paragraph tag in a html-format file attachment, for example: < p >, < span >, etc. The form tag refers to a form tag in a text attachment in html format, such as: < table >, < tr >, < td >, etc.;

in the embodiment of the present invention, the extracting the pre-labeled real type of the text attachment and all the attachment keywords, paragraph labels and table labels in the text attachment includes:

In the embodiment of the invention, the training accessory category table refers to a pre-constructed query table for querying the real category of each text accessory in the text accessory set, wherein the training accessory category table is established according to the corresponding relation between the accessory number preset by each text accessory and the real category of the text accessory.

In the embodiment of the present invention, the extracting all the accessory keywords in the html accessory according to the pre-constructed accessory keyword set includes:

In the embodiment of the invention, a jieba word segmentation tool or other word segmentation tools can be used for carrying out word segmentation on the html attachment.

Specifically, in the embodiment of the present invention, after the word segmentation is completed, each accessory keyword in the accessory keyword set may be extracted to match with each word to be matched in the word set to be matched, and when the matching is successful, the word to be matched is represented as an accessory keyword.

S3, combining the accessory keywords extracted from the text accessories to obtain a training keyword set of the text accessories, and calculating the quantitative ratio of the paragraph labels to the table labels to obtain a training label ratio.

In the embodiment of the invention, because the frequency of certain accessory keywords appearing in the accessories of each accessory category and the ratio of the paragraph labels to the form labels are different, all the extracted accessory keywords in the text accessories and the ratio of the paragraph labels to the form labels can be used as the accessory category characteristics of the text accessories, and the training label ratio is obtained by counting the number of the paragraph labels and the number of the form labels to provide data characteristics for the subsequent classifier category process.

And S4, configuring a pre-constructed original accessory classifier by using the training keyword set and the training label ratio.

In the embodiment of the present invention, the original accessory classifier may receive a score of the text accessory under each accessory category according to each training keyword in the training keyword set and the training label ratio, and when the score of a certain accessory category is the highest and is greater than a preset threshold, the corresponding accessory category is used as the predicted accessory category.

And S5, carrying out classification prediction on the text attachment by using the original attachment classifier to obtain the attachment category and the corresponding score of the text attachment.

In detail, as shown in fig. 2, in the embodiment of the present invention, the classifying and predicting the text attachment by using the original attachment classifier to obtain the attachment category and the corresponding score of the text attachment includes:

s51, according to each training keyword in the training keyword set, scoring the text attachments under each attachment category in a pre-constructed attachment category scoring table to obtain a keyword scoring set;

s52, according to the training label ratio, scoring the text attachments under each attachment category in the attachment category scoring table to obtain a label ratio scoring set;

s53, according to the keyword evaluation set and the label comparison evaluation set, constructing a comprehensive score of the text attachment under each attachment category in the attachment category score table to obtain a comprehensive score set;

s54, inquiring the attachment category corresponding to the highest comprehensive score in the comprehensive score set, and taking the attachment category corresponding to the highest comprehensive score and the highest comprehensive score as the attachment category and the corresponding score of the text attachment.

Wherein, the attachment category rating table refers to a pre-constructed rating table containing all the preset attachment categories, such as: plain text type, plain table type, compound type, invoice type and the like. The attachment category may be placed in a first horizontal field of the attachment category rating table. The set of keyword scores includes scores for each of the training keywords in the set of training keywords under all attachment categories in the attachment category score table. For example: when the training keyword is "contract," the rating may be 0.70 in the attachment category of the plain text class, 0.60 in the attachment category of the plain form class, 0.65 in the attachment category of the compound class, and 0.30 in the attachment category of the invoice class.

Further, the tag ratio score set includes scores under all accessory categories of the accessory category score table according to the training tag ratio value. For example: when the training label ratio is 7:3, the score under the attachment category of the plain text class may be 0.30, the score under the attachment category of the plain form class may be 0.10, the score under the attachment category of the compound class may be 0.70, and the score under the attachment category of the invoice class may be 0.20.

Further, the comprehensive scoring set refers to a scoring set obtained by integrating scores of the keyword scoring set and the label ratio scoring set for the same accessory category to obtain a comprehensive score of the same accessory category and then obtaining the comprehensive score according to the comprehensive scores of all accessory categories.

In detail, as shown in fig. 3, in the embodiment of the present invention, the constructing a comprehensive score of the text attachment under each attachment category in the attachment category score table according to the keyword score set and the tag ratio score set to obtain a comprehensive score set includes:

s531, overlapping scores of the keyword score sets in the same accessory category to obtain a score of the training keyword set in each accessory category;

s532, utilizing a pre-constructed first normalization formula to perform normalization processing on the scores of the training keyword set under each accessory category to obtain a keyword normalization score set;

s533, carrying out normalization processing on the scores in the label ratio score set by using a pre-constructed second normalization formula to obtain a label ratio normalization score set;

and S534, correspondingly superposing the keyword normalized scoring set and the label normalized scoring set, and scoring under the same accessory category to obtain the comprehensive scoring set.

In the embodiment of the present invention, the first normalization formula is as follows:

wherein G is_{Word set}All scores of the training keyword set under a certain accessory category are normalized to obtain the score, Q, of the training keyword set under the accessory category_{Word set}The normalized weight of the preset keywords is 0.5, P_{Word set scoring}The scores, S, of all the training keywords in the training keyword set under a certain accessory category are superposed_{Number of training words}Refers to the number of training keywords in the set of training keywords.

Further, the second normalization formula is as follows:

G_{label (R)}＝Q_{Label (R)}*P_{Label scoring}

Wherein G is_{Label (R)}The score of the attachment category, Q, is the score of the attachment category after the scores of the label ratios are concentrated under the attachment category and are normalized_{Label (R)}The normalized weight of the preset label ratio is 0.5, P_{Label scoring}Refers to the score for which the tag ratio score is concentrated under a certain attachment category.

S6, judging whether the score is smaller than a preset prediction threshold value;

and when the score is smaller than a preset prediction threshold value, S7, performing gradient adjustment on the original attachment classifier by using the score, and returning to the S4.

In an embodiment of the present invention, the prediction threshold may be 0.7, and when a certain highest composite score is greater than or equal to 0.7, the highest composite score is used as the score of the text attachment prediction.

In detail, in the embodiment of the present invention, the step of performing gradient adjustment on the original attachment classifier by using the score, and returning to the step of performing classification prediction on the text attachment by using the original attachment classifier to obtain the attachment category and the corresponding score of the text attachment includes:

According to a preset adjusting strategy, when the highest comprehensive score is smaller than 0.7, the adjustment gradient of the scores of the keyword scoring set and the label ratio scoring set under each accessory category needs to be set according to the size of the prediction residual error.

And when the score is greater than or equal to a preset prediction threshold value, S8, comparing the predicted accessory category with the real category of the text accessory to obtain a prediction result with correct prediction or wrong prediction.

In the embodiment of the invention, the accessory category obtained by predicting the text accessory is compared with the real category extracted in advance, when the accessory category is the same as the real category, the prediction is correct, and when the accessory category is different from the real category, the prediction is wrong.

And S9, summarizing the prediction results of all the text attachments in the text attachment set to obtain the prediction accuracy.

In the embodiment of the invention, the prediction accuracy is obtained according to the ratio of the prediction correctness to the prediction error in the prediction results of all the text attachments in the text attachment set, and the prediction effect can be obtained according to the prediction accuracy.

And S10, judging whether the prediction accuracy is greater than or equal to a preset training threshold value.

And if the prediction accuracy is smaller than the training threshold, returning to the step S4 until the prediction accuracy is larger than or equal to the training threshold, executing a step S11, and stopping the iterative training to obtain the standard accessory classifier.

In the embodiment of the invention, the training threshold is set to be 0.85, when the prediction accuracy is smaller than the training threshold, the keyword score and the score in the label ratio score set are inaccurate, scoring needs to be carried out again, the prediction accuracy is obtained again, and when the prediction accuracy is larger than or equal to the preset training threshold, the scoring accuracy is high.

S12, receiving the accessories to be classified, and classifying the accessories to be classified by using the standard accessory classifier to obtain the classification result of the accessories to be classified.

In detail, in the embodiment of the present invention, the receiving the accessory to be classified, and classifying the accessory to be classified by using the standard accessory classifier to obtain the classification result of the accessory to be classified includes

In the embodiment of the invention, after the standard accessory classifier is obtained, accessory keywords of the picture accessories in the picture set to be classified and the number ratio of the paragraph labels to the form labels are extracted one by one, and the accessory keywords of the accessories to be classified, the paragraph label number of the accessories and the form label number are used as accessory features of the accessories to be classified, and the standard accessory classifier can classify according to the accessory features to obtain a classification result.

Fig. 4 is a functional block diagram of an attachment classification apparatus based on optical character recognition according to an embodiment of the present invention.

The optical character recognition-based accessory sorting apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the optical character recognition-based accessory classification apparatus 100 may include a sample data extraction module 101, a classifier prediction module 102, a classification result determination module 103, a classifier effect determination module 104, and a classifier identification module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the sample data extraction module 101 is configured to obtain a text attachment set generated by an image set to be classified through optical character recognition, select one of the text attachments from the text attachment set one by one, extract a pre-labeled real category of the text attachment and all attachment keywords, paragraph labels and form labels in the text attachment, combine the attachment keywords extracted from the text attachment to obtain a training keyword set of the text attachment, and calculate a quantitative ratio of the paragraph labels to the form labels to obtain a training label ratio;

the classifier prediction module 102 is configured to configure a pre-constructed original accessory classifier by using the training keyword set and the training label ratio, and perform classification prediction on the text accessory by using the original accessory classifier to obtain an accessory category and a corresponding score of the text accessory;

the classification result judgment module 103 is configured to judge whether the score is smaller than a preset prediction threshold, perform gradient adjustment on the original attachment classifier by using the score when the score is smaller than the preset prediction threshold, and return to the step of performing classification prediction on the text attachment by using the original attachment classifier to obtain an attachment category of the text attachment and a corresponding score;

the classifier effect judging module 104 is configured to, when the score is greater than or equal to a preset prediction threshold, compare the accessory category obtained through prediction with the real category of the text accessory to obtain a prediction result of correct prediction or incorrect prediction, summarize the prediction results of all the text accessories in the text accessory set to obtain a prediction accuracy, judge whether the prediction accuracy is greater than or equal to a preset training threshold, and if the prediction accuracy is smaller than the training threshold, return to the above process of configuring a pre-constructed original accessory classifier using the training keyword set and the training label ratio until the prediction accuracy is greater than or equal to the training threshold, stop the iterative training to obtain a standard accessory classifier;

the classifier identification module 105 is configured to receive an accessory to be classified, and classify the accessory to be classified by using the standard accessory classifier to obtain a classification result of the accessory to be classified.

In detail, when the modules in the accessory classification device 100 based on optical character recognition according to the embodiment of the present invention are used, the same technical means as the accessory classification method based on optical character recognition described in fig. 1 to 3 is adopted, and the same technical effects can be produced, which is not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device implementing an attachment classification method based on optical character recognition according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as an attachment classification program based on optical character recognition, stored in the memory 11 and executable on the processor 10.

In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing an attachment classification program based on optical character recognition, etc.) stored in the memory 11 and calling data stored in the memory 11.

The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of an attachment classification program based on optical character recognition, etc., but also to temporarily store data that has been output or is to be output.

The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.

Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The optical character recognition based accessory classification program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

if the predicted accuracy is smaller than the training threshold, returning to the process of configuring the pre-constructed original accessory classifier by using the training keyword set and the training label ratio until the predicted accuracy is larger than or equal to the training threshold, and stopping the iterative training to obtain a standard accessory classifier;

Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An attachment classification method based on optical character recognition, the method comprising:

2. The method for classifying attachments based on optical character recognition according to claim 1, wherein the classifying and predicting the text attachments by using the original attachment classifier to obtain attachment categories and corresponding scores of the text attachments comprises:

3. The method for classifying attachments based on optical character recognition according to claim 2, wherein the step of constructing a composite score of the text attachments under each attachment category in the attachment category score table according to the keyword score set and the tag ratio score set to obtain a composite score set comprises the steps of:

4. The method for classifying attachments according to claim 1, wherein the extracting of the pre-labeled real category of the text attachments and all the attachment keywords, paragraph labels and table labels in the text attachments comprises:

5. The method for classifying accessories based on optical character recognition according to claim 4, wherein the extracting all accessory keywords in the html accessories according to the pre-constructed accessory keyword set comprises:

6. The method for classifying attachments based on optical character recognition according to claim 1, wherein the step of performing gradient adjustment on the original attachment classifier by using the score and returning to the step of performing classification prediction on the text attachments by using the original attachment classifier to obtain attachment categories and corresponding scores of the text attachments comprises:

7. The method for classifying accessories based on optical character recognition according to any one of claims 1-6, wherein the receiving accessories to be classified, and classifying the accessories to be classified by using the standard accessory classifier to obtain the classification result of the accessories to be classified comprises:

8. An attachment classification device based on optical character recognition, the device comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method for optical character recognition based attachment classification of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for optical character recognition-based accessory classification according to any one of claims 1 to 7.