CN112990177B

CN112990177B - Classified cataloguing method, device and equipment based on electronic file files

Info

Publication number: CN112990177B
Application number: CN202110391414.1A
Authority: CN
Inventors: 万玉晴; 王霄
Original assignee: Taiji Computer Corp Ltd
Current assignee: Taiji Computer Corp Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-09-21
Anticipated expiration: 2041-04-13
Also published as: CN112990177A

Abstract

The invention relates to a classification cataloguing method, a device and equipment based on electronic file files, belonging to the technical field of image processing.

Description

Classified cataloguing method, device and equipment based on electronic file files

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a classification cataloguing method, a classification cataloguing device and classification cataloguing equipment based on electronic file files.

Background

The electronic file records various files in the whole event, and plays an important role in tracing and inquiring the event, such as court electronic file and the like. The court electronic file is composed of various files generated in the whole litigation process, and comprises various related files generated in the case acceptance process of the court, the inspection court and the party, such as electronic files, images and other electronic files. With the deepening of the informatization construction of the courts, at present, all levels of courts store a large amount of case electronic files. At present, classification cataloging of court electronic file materials is mainly used for filing of files, the classification granularity is thick, the classification cataloging is used for classification cataloging of file marking, no unified standard exists, classification cataloging standards of various courts are different, different cases are different by referring to file materials, and most of business systems are manual classification cataloging operations, so that the classification cataloging efficiency of the court electronic file materials is low, and time and labor are wasted.

In the prior art, some tool systems for automatic classification and cataloguing exist, but due to the fact that the types of files in electronic files of a court are various, all files cannot be completely processed by using a single technology, and the tool systems for automatic classification and cataloguing in the prior art have the problems of insufficient precision, large amount of manual verification and the like. For example, the convolutional neural network has a very excellent effect on image classification, but for text pictures, the class characteristics of the image are not obvious; high-precision classification can be easily realized based on semantic features of text content, and for pictures which cannot acquire the text content through an OCR technology, the pictures can only be processed through an image classifier; in addition, OCR recognition is slow and over-reliance on text classification can lead to inefficiencies.

Therefore, how to improve the efficiency of classifying and cataloging electronic files according to the file type characteristics of the files becomes a technical problem to be solved urgently in the prior art.

Disclosure of Invention

The invention provides a classification cataloguing method, a classification cataloguing device and classification cataloguing equipment based on electronic file files, which effectively solve the technical problems of insufficient precision of a classification cataloguing tool system, large amount of manual verification and the like, and improve the efficiency of the classification cataloguing of the electronic file files.

The technical scheme provided by the invention is as follows:

in one aspect, a method of cataloguing based on a classification of an electronic portfolio file, the electronic portfolio file comprising: a plurality of subfiles; the method comprises the following steps:

acquiring a picture set of an electronic file, and respectively performing quality detection and pretreatment on pictures in the picture set to acquire a clear picture set;

identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier;

determining an image file category of a picture in the image class subset; respectively identifying full-text information of each text picture in the text type subset according to an image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and a text type device;

extracting file titles from the full-text information based on a dictionary and a regular expression, and judging the integrity of each subfile; in the same subfile, determining an arrangement position of a text picture corresponding to each full-text message in the subfile, wherein the arrangement position in the subfile comprises: a home page and a content page of the subfile;

acquiring a synthetic file based on the integrity of the subfiles and the arrangement position of the text picture corresponding to each full text message in the subfiles;

and calculating the semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file files according to the semantic similarity.

Optionally, the quality detection and the preprocessing are respectively performed on the pictures in the picture set to obtain a clear picture set, including:

carrying out graying processing on the pictures in the picture set to obtain grayed pictures;

performing definition detection on the grayed picture based on a Laplace operator to obtain a first clear picture and a picture to be processed;

based on image sharpening, sharpening the to-be-processed picture to obtain a second clear picture;

and acquiring the clear picture set according to the first clear picture and the second clear picture.

Optionally, the preset image classifier includes an image classification model trained by a resenext network; the identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier comprises:

and identifying an image class subset and a text class subset of the pictures in the clear picture set based on an image classification model trained by a ResNeXt network.

Optionally, the image-text identifier includes: an OCR recognizer; the text classifier comprises an SLFNs network model; the identifying the full-text information of each text picture in the text type subset respectively according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier comprises the following steps:

identifying full-text information of each text picture based on an OCR (optical character recognition) device;

based on the full text information of each text picture, text vector representation is obtained according to a multi-dimensional semantic representation method;

and inputting the text vector representation into an SLFNs network model obtained by training through a KELM algorithm in advance, and acquiring the corresponding text file type.

Optionally, the determining an arrangement position of the text picture corresponding to each full text message in the subfile includes: the first page and the content page of the subfile comprise:

if the title is successfully extracted, marking the corresponding text picture as the home page of the subfile;

and if the title extraction fails, inputting the tail sentence in the previous page of full-text information and the first sentence of the current full-text information into a pre-trained BERT model, acquiring the semantic association degree, and determining that the current text picture is the first page or the content page of the subfile according to the semantic association degree.

Optionally, the calculating semantic similarity between each image file category and each text file category and all categories in a preset cataloguing standard, and generating a catalog structure of the electronic file according to the semantic similarity includes:

calculating cosine distances between the semantic expression vector of each image file category and all category semantic vectors in a preset cataloguing standard, and calculating cosine distances between the semantic expression vector of each text file category and all category semantic vectors in the preset cataloguing standard;

and selecting the category in the preset cataloguing standard corresponding to the minimum cosine distance as the marking catalogue where the sub-file is located, and generating the catalogue structure of the electronic file.

Optionally, the preset inventory criteria include: a file classification table and a classification reference table; the file category table is provided with fixed file categories; the method further comprises the following steps:

after receiving a modification instruction of the classification reference table;

and modifying the classification reference table according to the classification reference table modification instruction.

In yet another aspect, an apparatus for cataloguing based on a classification of an electronic portfolio file, the electronic portfolio file comprising: a plurality of subfiles; the device comprises: the system comprises a quality detection and preprocessing module, a classification module, a file integrity judgment module and a directory generation module;

the quality detection and pretreatment module is used for acquiring a picture set of the electronic file, and respectively performing quality detection and pretreatment on pictures in the picture set to acquire a clear picture set;

the classification module is used for identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier; determining an image file category of a picture in the image class subset; respectively identifying full-text information of each text picture in the text type subset according to an image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and a text type device;

the file integrity judging module is used for extracting file titles from the full-text information based on a dictionary and a regular expression and judging the integrity of each subfile; in the same subfile, determining an arrangement position of a text picture corresponding to each full-text message in the subfile, wherein the arrangement position in the subfile comprises: a home page and a content page of the subfile;

the directory generation module is used for acquiring a synthesized file based on the integrity of the subfile and the arrangement position of the text picture corresponding to each full text message in the subfile; and calculating the semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file files according to the semantic similarity.

Optionally, the quality detection and preprocessing module is configured to perform graying processing on the pictures in the picture set to obtain grayed pictures; performing definition detection on the grayed picture based on a Laplace operator to obtain a first clear picture and a picture to be processed; based on image sharpening, sharpening the to-be-processed picture to obtain a second clear picture; and acquiring the clear picture set according to the first clear picture and the second clear picture.

In yet another aspect, a sorting and cataloguing apparatus for electronic portfolio files, comprising: a processor, and a memory coupled to the processor;

the memory is configured to store a computer program for performing at least the cataloguing method of the electronic volume file of any of the above;

the processor is used for calling and executing the computer program in the memory.

The invention has the beneficial effects that:

according to the classification cataloguing method, device and equipment based on the electronic file, provided by the embodiment of the invention, the clear picture set is obtained by obtaining the picture set of the electronic file and respectively carrying out quality detection and pretreatment on the pictures in the picture set; identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier; determining the image file category of the pictures in the image category subset; respectively identifying full-text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier; extracting a file title from the full-text information based on the dictionary and the regular expression, and judging the integrity of each sub-file; in the same sub-file, determining the arrangement position of the text picture corresponding to each full-text message in the sub-file, wherein the arrangement position in the sub-file comprises: the first page and the content page of the subfile; acquiring a synthesized file based on the integrity of the subfiles and the arrangement position of the text picture corresponding to each full text message in the subfiles; and calculating semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file according to the semantic similarity. The method and the device comprehensively utilize technologies such as digital image processing, machine vision and natural language processing, combine specific customer field requirements, efficiently classify and automatically catalog the electronic file, improve the automation degree of electronic file use in court business, further improve the working efficiency and save manpower.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for cataloguing documents based on electronic files according to an embodiment of the present invention;

FIG. 2 is a partial flow chart illustrating a method for cataloguing documents based on electronic file classification according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a sorting and cataloguing apparatus based on electronic file files according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a classification and cataloguing apparatus based on electronic file files according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

The court case file is composed of various documents generated in the whole litigation process, a large number of paper materials provided by parties for signature and stamping exist in the litigation process, the paper materials are mainly divided into text materials (such as appeal shapes, committee books and the like) and image materials (such as identity card copies, lawyer qualification card copies and other evidence materials) from the content, and the paper materials need to be scanned into pictures and stored in the electronic file. Before entering the system, the electronic files usually need to be named, classified, catalogued and the like manually on materials such as pictures, single-layer PDFs and the like, so that a file reading directory can be formed, the process consumes a lot of manpower and material resources, and the efficiency is low.

Based on the above, the embodiment of the invention provides a classification and cataloguing method based on electronic file files, so as to realize intelligent processing of the electronic file files, automatically generate the marking catalog, save manpower and material resources and improve the working efficiency.

Fig. 1 is a schematic flowchart of a classification and cataloguing method based on electronic file files according to an embodiment of the present invention.

Referring to fig. 1, an electronic portfolio file may include a plurality of subfiles, and the taxonomy cataloging method may include the steps of:

and S1, acquiring the picture set of the electronic file, and respectively performing quality detection and pretreatment on the pictures in the picture set to acquire a clear picture set.

In a specific implementation process, any electronic volume needing to be classified and catalogued can be defined as a target electronic volume, and the electronic volume file is classified and catalogued by applying the classification and cataloguing method based on the electronic volume file provided by the application to the target electronic volume. For example, the electronic file may be a court electronic file, a corporate electronic file, or the like. In this embodiment, a classification and cataloguing method based on electronic volume files is described by taking a court electronic volume as an example.

For example, a picture set of the electronic file may be obtained through a scanning system, where the picture set may include a complaint scanning component, a delivery certificate scanning component, an identification card scanning component, a business license scanning component, and the like, and all the scanning components form the picture set, and each scanning component is a picture. Each scanned piece is a subfile of the electronic file, each subfile may be a single piece or multiple pieces, and for example, the appealing scanning pieces may be 2 or 3 pieces.

In some embodiments, optionally, comprising: carrying out graying processing on the pictures in the picture set to obtain grayed pictures; performing definition detection on the gray picture based on a Laplace operator to obtain a first clear picture and a picture to be processed; based on image sharpening, sharpening the picture to be processed to obtain a second clear picture; and acquiring a clear picture set according to the first clear picture and the second clear picture.

For example, a complete court electronic file may be arranged and scanned according to a preset sequence, and a scanned picture set arranged according to the preset sequence is obtained, that is, the picture set.

And carrying out graying processing on the picture to obtain a grayed picture. At present, most scanning systems are based on RGB color space, each pixel point is a three-dimensional vector in the RGB space, in order to reduce calculated amount, a gray image is used, namely, a color image is converted into a gray image, and image definition detection is completed based on the gray image.

And carrying out definition detection on the gray picture. And performing convolution on the gray picture by using a Laplacian operator, calculating the gradient of the image, and calculating the gradient variance to obtain a floating point number representing the image 'fuzziness'. In a sharp picture, the gradient variance of the image edge information may be large. The Laplacian operator is an image edge detection method, and the gradient change of an image is calculated by calculating the second order differential of the image. Assume a gray scale image of

The laplacian is defined as:

(1)

the above formula is expressed as a matrix

. The matrix representation shows that the matrix is used as a mask to perform convolution operation with an original gray image, and the gradient change of each pixel point in the up, down, left and right directions can be calculated. In order to simultaneously consider the gradient change in the diagonal direction, the matrix of the laplacian used in the present embodiment is represented as:

。

image pre-processing is performed based on a threshold. The ambiguity threshold is set according to the particular image data set and if the image variance is above a predefined maximum threshold thresh2, the image is considered to be a sharp return code of 1. If the image variance is below a predefined minimum threshold thresh1, the return code for the image to be considered blurred is 2, which for blurred images, classified as an acquisition problem, requires prompting the user to reacquire the image. If the image variance is between the minimum threshold thresh1 and the maximum threshold thresh2, the image is considered to be a return code of 3 that can be pre-processed manually to meet the sharpness requirement. And obtaining a first clear picture and a picture to be processed.

And carrying out image sharpening on the picture to be processed. And acquiring a second clear picture. The preprocessing method for improving the image definition mainly adopts image sharpening, namely, contrast enhancement is carried out on an original gray image according to the gradient value of the original gray image, so that a blurred image becomes clear. Suppose the original image is

The gradient change value obtained after convolution operation is carried out on the original image by using a Laplacian operator is as follows

Sharpening the operated image

Is formulated as:

(2)

when the central coefficient of the Laplacian operator is negative,

。

and collecting the first clear picture and the second clear picture as a clear picture set.

And S2, identifying the image class subset and the text class subset in the clear picture set based on a preset image classifier.

In some embodiments, optionally, the image classifier is preset, including a resenext network trained image classification model.

For example, using the ResNext network model, a preset image classifier is trained. ResNext is a combination of ResNet and initiation, the essence of ResNext is grouping Convolution (Group Convolution), the number of groups is controlled through a variable Cardinality (Cardinality), increasing Cardinality is more effective than increasing depth and width, the accuracy of the model can be improved under the condition of not obviously increasing the parameter number, and simultaneously, because the topological structures are the same, the super-parameters are reduced, and the model transplantation is facilitated.

In the embodiment of the invention, the image file types (such as lawyer practice cards, identity cards, marriage certificates and business licenses) which are common in 22 types of court electronic files can be selected from the image type characteristics, wherein one type is a text type, other types are added, 23 types of files are shared, the corpus labeling is carried out, and the ResNext 50-32 x4d network structure is adopted to train an image classification model. The model parameters are shown in the following table:

TABLE 1 ResNext model training parameters

Model parameters	Value of
		Optimizer	SGD
momentum	0.9
		batch size	5
weight decay	0.0001
		epoch	50

After the image classification model is trained, the clear pictures are input into the trained image classification model, so that the image file types and the text pictures marked in the training are identified. And further distinguishing the recognized text type pictures so as to determine the text file type.

S3, determining the image file type of the picture in the image type subset; and respectively identifying the full text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full text information based on text semantics according to the full text information and the text type identifier.

In some embodiments, optionally, the teletext identifier comprises: an OCR recognizer; a text classifier comprising SLFNs network models; respectively identifying the full text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full text information based on text semantics according to the full text information and the text type identifier, wherein the method comprises the following steps: identifying full-text information of each text picture based on an OCR (optical character recognition) device; acquiring text vector representation according to a multi-dimensional semantic representation method based on full text information of each text picture; and inputting the text vector representation into an SLFNs network model obtained by training through a KELM algorithm in advance, and acquiring the corresponding text file type.

For example, for recognized text type pictures, OCR processing is performed to acquire full text information of each picture, and then a text semantic based classification operation is performed to determine a text file type of each text picture. In the embodiment of the invention, in order to balance the classification precision and efficiency and reduce the dependence of the model on the artificial labeled linguistic data, a simple Single-hidden Layer Feedforward Neural network (SLFNs) is adopted as a classification model, and a Kernel Extreme Learning Machine (KELM) is adopted as a Learning algorithm of a text classification model. In order to make up for the deficiency of the expression capability of the shallow model, a multi-dimensional semantic representation method is provided, so that more concise and accurate text category differences are captured and used as text feature vectors of the input model.

When carrying out multi-dimensional semantic representation on a file, firstly, a file corpus of file categories is manually marked, a category characteristic dictionary is constructed by adopting chi-square test, and after preprocessing such as word segmentation, stop word removal, name removal, low-frequency word filtration and the like, chi-square values of words and categories are calculated for category texts:

（3）

wherein c is a class label; w is a word appearing in the c-type text corpus; n is the total number of files in the file corpus; a is the number of files containing w and belonging to the category c; b is the number of files containing w but not belonging to the category c; c is the number of files which do not contain w but belong to the category C; d is the number of files that do not include w nor belong to the category c.

In bookIn the embodiment of the invention, the chi-square value of each word in each type of file can be respectively calculated, the chi-square values are arranged according to the descending order, the threshold value is manually set according to the distribution condition of the chi-square value of the feature words of each type of file, the feature words of each type of file higher than the threshold value are combined together, and the class feature dictionary Dc of the volume corpus is obtained after duplication is removed. Based on the above definition, calculating the expression vector of each file in the volume corpus set, and recording the volume corpus set as

Each file corresponds to a category label of

The corresponding feature vector of the Doc2vec file is

Multidimensional semantic representation of documents

The following are obtained:

（4）

（5）

（6）

（7）

（8）

（9）

wherein, w_i,jWord2vec vector, n, representing the ith Word in the document dj_i,jDenotes w_i,jIn the number of times of occurrence in dj, in the embodiment of the invention, words existing in a feature dictionary Dc are selected from dj, tf-idf values are calculated and used as weights for combining feature word vectors to obtain semantic representation of texts based on word granularity, and simultaneously, topic information obtained by the words based on LDA is added into the feature word vectors

And finally with the document feature vector

And (4) combining.

In classifying text files, case text may be classified based on KELM. According to the kernel function theory, the kernel function can map data to a high-dimensional feature space in an implicit mode, linear divisibility of samples in the high-dimensional space is achieved, and meanwhile the problem of ELM random initialization is solved. The core matrix of the ELM is defined by applying Mercer's conditions as follows:

the prediction output function of the KELM may be expressed as:

（10）

as can be seen from the above equation, after the kernel function is determined, there is no need to know the feature mapping

Nor does it need to give the dimension L (number of hidden neurons) of the feature space. In the invention, a Gaussian kernel function is selected as a kernel function of the ELM, and a sample is mapped to infinity from an original input spaceAnd (4) dimensional space.

（11）

In order to be a parameter of the kernel function,

the output weight matrix for adjusting the KELM according to equation (10) is:

（12）

namely:

（13）

the KELM algorithm obtains the global optimal solution of the output weight through one-time calculation, and compared with a back propagation training method based on gradient descent, the KELM algorithm has the advantages of high calculation speed and high generalization capability.

S4, extracting file titles from the full-text information based on the dictionary and the regular expression, and judging the integrity of each sub-file; in the same sub-file, determining the arrangement position of the text picture corresponding to each full-text message in the sub-file, wherein the arrangement position in the sub-file comprises: the first page and the content page of the subfile.

In some embodiments, optionally, determining an arrangement position of the text picture corresponding to each full-text message in the subfile, where the arrangement position in the subfile includes: a home page and a content page of the subfile, comprising: if the title is successfully extracted, marking the corresponding text picture as the home page of the subfile; and if the title extraction fails, inputting the tail sentence in the previous page of full-text information and the first sentence of the current full-text information into a pre-trained BERT model, acquiring the semantic association degree, and determining that the current text picture is the first page or the content page of the subfile according to the semantic association degree.

For example, in the embodiment of the present invention, each sub-file may include a plurality of single pictures, and therefore, it is necessary to determine the integrity of each sub-file and determine to which sub-file each picture belongs.

The method comprises the steps of obtaining a text type picture, predicting a file type based on a trained text classifier, trying to extract a file title from a text file based on a dictionary and a regular expression on the basis, and if the title is obtained successfully, obtaining information that the picture is a file top page except that the file type can be obtained correctly. For the case that the title text file cannot be obtained, in the embodiment of the invention, based on the characteristic that the adjacent pages have semantic relevance in the natural language context, a BERT training language model is adopted to calculate the semantic relevance of the front page and the rear page of the text picture on the text content, and whether the adjacent pages are the adjacent pages is judged, so that the problem of file integrity judgment is solved.

Two strategies are used in the training process of BERT: masked LM (MLM) and Next Sequence Prediction (NSP), wherein the NSP strategy can ensure that the trained model has adjacent page judgment capability. The model receives pairs of sentences as input, predicting whether the second sentence is a subsequent sentence in the original document. During training, 50% of the input pairs are contextual in the original document, and another 50% are randomly composed from the corpus, disconnected in content from the first sentence. To help the model distinguish between the two sentences in the training, the input is processed before entering the model as follows:

(1) a [ CLS ] tag is inserted at the beginning of the first sentence and a [ SEP ] tag is inserted at the end of each sentence.

(2) One sentence embedding representing sentence a or sentence B is added to each token.

(3) Each token is added a position imbedding to indicate its position in the sequence.

(4) In order to predict whether the second sentence is a continuation of the first sentence, the following steps are used to predict:

A. inputting the whole input sequence into a Transformer model;

B. transforming the output of the [ CLS ] token into a 2x 1 shaped vector with a simple classification layer;

C. calculating the probability of IsNextSequence by using softmax;

when training the BERT model, the Masked LM and the Next sequence Prediction are trained together, and the goal is to minimize the combined loss function of the two strategies.

In the embodiment of the invention, the BERT frame is used for judging the semantic association relationship of adjacent file pages, and whether each page is a first page of a subfile or a content page is analyzed, so that the integral judgment of the subfiles is realized.

And S5, acquiring a composite file based on the integrity of the subfiles and the arrangement position of the text pictures corresponding to each full text message in the subfiles.

For example, after the complete sub-files are judged, the text pictures are combined according to the classification positions of the text pictures in each sub-file.

And S6, calculating semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic files, and generating a directory structure of the electronic file according to the semantic similarity.

In some embodiments, optionally, calculating semantic similarities between each image file category and each text file category and all categories in the preset inventory standard, and generating a directory structure of the electronic file according to the semantic similarities, includes: calculating cosine distances between the semantic expression vector of each image file category and all category semantic vectors in a preset cataloguing standard, and calculating cosine distances between the semantic expression vector of each text file category and all category semantic vectors in the preset cataloguing standard; and selecting the category in the preset cataloguing standard corresponding to the minimum cosine distance as the marking catalogue where the sub-file is located, and generating the catalogue structure of the electronic file.

For example, in the present application, semantic similarity calculation at a phrase level may be employed, which has better performance and generalization capability than a keyword matching method, with the help of the BERT model used in the document integrity judgment of the present invention, a word vector model is obtained through training of a large number of field text documents, a phrase vector is obtained by using a word vector average value, and the semantic similarity is measured by calculating the cosine distance of two phrases, where the calculation of the cosine distance is the prior art and is not described herein. And finally, matching each image file type and each text file type with the court marking catalogue to generate the catalogue of the electronic file, wherein each catalogue item in the catalogue is associated with a corresponding picture.

According to the classification cataloguing method based on the electronic file, provided by the embodiment of the invention, the clear picture set is obtained by obtaining the picture set of the electronic file and respectively carrying out quality detection and pretreatment on the pictures in the picture set; identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier; determining the image file category of the pictures in the image category subset; respectively identifying full-text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier; extracting a file title from the full-text information based on the dictionary and the regular expression, and judging the integrity of each sub-file; in the same sub-file, determining the arrangement position of the text picture corresponding to each full-text message in the sub-file, wherein the arrangement position in the sub-file comprises: the first page and the content page of the subfile; acquiring a synthesized file based on the integrity of the subfiles and the arrangement position of the text picture corresponding to each full text message in the subfiles; and calculating semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file according to the semantic similarity. The method and the device comprehensively utilize technologies such as digital image processing, machine vision and natural language processing, combine specific customer field requirements, efficiently classify and automatically catalog the electronic file, improve the automation degree of electronic file use in court business, further improve the working efficiency and save manpower.

Based on a general inventive concept, the embodiment of the present invention further provides another classification cataloguing method based on electronic file files.

Fig. 2 is a partial schematic flow chart of another classification and cataloguing method based on electronic file files according to an embodiment of the present invention.

Referring to fig. 2, on the basis of the above embodiment, the preset inventory criteria include: a file classification table and a classification reference table; the file category table is provided with fixed file categories; the method of the embodiment of the application can further comprise the following steps:

s21, receiving a modification instruction of the classification reference table;

and S22, modifying the classification reference table according to the classification reference table modification instruction.

For example, in the automatic generation process of the paper reading catalog, the file type of the court case is related to the specific case group, and the file types contained by the cases of different cases are different. In the present application, the file category table and the classification reference table may be set in a preset cataloging standard. The file type table is a fixed file type, for example, a static file list is set according to the need of a court for files of a case, and is used for recording the file type of each case necessary in the case; the classification reference table can be set to be dynamic and used for recording the file types possibly appearing in the cases of each case, the file types are dynamically increased by the service, and operation and maintenance personnel regularly check, modify and confirm.

The classification cataloguing method based on the electronic file provided by the embodiment of the invention further meets the requirements of users by setting the static file classification table and the dynamic classification reference table.

Based on a general inventive concept, the embodiment of the present invention further provides a sorting and cataloguing apparatus based on electronic file files.

Fig. 3 is a schematic structural diagram of a sorting and cataloguing apparatus based on electronic file files according to an embodiment of the present invention, and referring to fig. 3, the apparatus according to the embodiment of the present invention may include the following structures: the electronic portfolio file includes: a plurality of subfiles; the device comprises: a quality detection and preprocessing module 31, a classification module 32, a file integrity judgment module 33, and a catalog generation module 34.

The quality detection and preprocessing module 31 is configured to obtain a picture set of the electronic file, and perform quality detection and preprocessing on pictures in the picture set respectively to obtain a clear picture set;

a classification module 32, configured to identify an image class subset and a text class subset in the clear picture set based on a preset image classifier; determining the image file category of the pictures in the image category subset; respectively identifying full-text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier;

the file integrity judging module 33 is used for extracting file titles from the full-text information based on the dictionary and the regular expression and judging the integrity of each sub-file; in the same sub-file, determining the arrangement position of the text picture corresponding to each full-text message in the sub-file, wherein the arrangement position in the sub-file comprises: the first page and the content page of the subfile;

the directory generation module 34 is configured to obtain a synthesized file based on the integrity of the subfile and the arrangement position of the text picture corresponding to each full-text message in the subfile; and calculating semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file according to the semantic similarity.

Optionally, the quality detection and preprocessing module 31 is configured to perform graying processing on the pictures in the picture set to obtain grayed pictures; performing definition detection on the gray picture based on a Laplace operator to obtain a first clear picture and a picture to be processed; based on image sharpening, sharpening the picture to be processed to obtain a second clear picture; and acquiring a clear picture set according to the first clear picture and the second clear picture.

Optionally, the classification module 32 is configured to identify full-text information of each text image based on an OCR identifier; acquiring text vector representation according to a multi-dimensional semantic representation method based on full text information of each text picture; and inputting the text vector representation into an SLFNs network model obtained by training through a KELM algorithm in advance, and acquiring the corresponding text file type.

Optionally, the file integrity judgment module 33 is configured to mark the corresponding text picture as a home page of the subfile if the title is successfully extracted; and if the title extraction fails, inputting the tail sentence in the previous page of full-text information and the first sentence of the current full-text information into a pre-trained BERT model, acquiring the semantic association degree, and determining that the current text picture is the first page or the content page of the subfile according to the semantic association degree.

Optionally, the catalog generation module 34 is configured to calculate cosine distances between semantic representation vectors of each image file category and semantic vectors of all categories in the preset cataloging standard, and calculate cosine distances between semantic representation vectors of each text file category and semantic vectors of all categories in the preset cataloging standard; and selecting the category in the preset cataloguing standard corresponding to the minimum cosine distance as the marking catalogue where the sub-file is located, and generating the catalogue structure of the electronic file.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

According to the classification cataloguing device based on the electronic file, which is provided by the embodiment of the invention, the clear picture set is obtained by obtaining the picture set of the electronic file and respectively carrying out quality detection and pretreatment on the pictures in the picture set; identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier; determining the image file category of the pictures in the image category subset; respectively identifying full-text information of each text picture in the text type subset according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier; extracting a file title from the full-text information based on the dictionary and the regular expression, and judging the integrity of each sub-file; in the same sub-file, determining the arrangement position of the text picture corresponding to each full-text message in the sub-file, wherein the arrangement position in the sub-file comprises: the first page and the content page of the subfile; acquiring a synthesized file based on the integrity of the subfiles and the arrangement position of the text picture corresponding to each full text message in the subfiles; and calculating semantic similarity between each image file type and each text file type and all types in a preset cataloguing standard based on the synthetic file, and generating a directory structure of the electronic file according to the semantic similarity. The method and the device comprehensively utilize technologies such as digital image processing, machine vision and natural language processing, combine specific customer field requirements, efficiently classify and automatically catalog the electronic file, improve the automation degree of electronic file use in court business, further improve the working efficiency and save manpower.

Fig. 4 is a schematic structural diagram of a classification and cataloguing apparatus based on electronic file files according to an embodiment of the present invention, please refer to fig. 4, the classification and cataloguing apparatus based on electronic file files according to an embodiment of the present invention includes: a processor 41, and a memory 42 coupled to the processor.

The memory 42 is used for storing a computer program, and the computer program is at least used for the classification cataloguing method based on the electronic file files in any one of the above embodiments;

the processor 41 is used to invoke and execute computer programs in memory.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, file, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, files, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A classification cataloguing method based on an electronic file, wherein the electronic file comprises: a plurality of subfiles; the method comprises the following steps:

acquiring a picture set of an electronic file, and respectively performing quality detection and pretreatment on pictures in the picture set to acquire a clear picture set, wherein the method comprises the following steps: carrying out graying processing on the pictures in the picture set to obtain grayed pictures; performing definition detection on the grayed picture based on a Laplace operator to obtain a first clear picture and a picture to be processed; based on image sharpening, sharpening the to-be-processed picture to obtain a second clear picture; acquiring the clear picture set according to the first clear picture and the second clear picture;

extracting file titles from the full-text information based on a dictionary and a regular expression, and judging the integrity of each subfile; in the same subfile, determining the arrangement position of the text picture corresponding to each full text message in the subfile, including: if the title is successfully extracted, marking the corresponding text picture as the home page of the subfile; if the title extraction fails, inputting a tail sentence in the previous page of full-text information and a head sentence of the current full-text information into a pre-trained BERT model, acquiring semantic association degree, and determining that the current text picture is a head page or a content page of the subfile according to the semantic association degree; the arrangement positions in the subfiles include: a home page and a content page of the subfile;

2. The method of claim 1, wherein the pre-set image classifier comprises a ResNeXt network trained image classification model; the identifying an image class subset and a text class subset in the clear picture set based on a preset image classifier comprises:

3. The method of claim 1, wherein the teletext identifier comprises: an OCR recognizer; the text classifier comprises an SLFNs network model; the identifying the full-text information of each text picture in the text type subset respectively according to the image-text identifier, and identifying the text file type of each text picture corresponding to the full-text information based on text semantics according to the full-text information and the text type identifier comprises the following steps:

4. The method according to claim 1, wherein the calculating semantic similarity between each image file category and each text file category and all categories in a preset cataloguing standard respectively, and generating a directory structure of the electronic file according to the semantic similarity comprises:

5. The method of claim 1, wherein the preset inventory criteria comprises: a file classification table and a classification reference table; the file category table is provided with fixed file categories; the method further comprises the following steps:

6. An apparatus for cataloguing a classification based on an electronic portfolio file, the electronic portfolio file comprising: a plurality of subfiles; the device comprises: the system comprises a quality detection and preprocessing module, a classification module, a file integrity judgment module and a directory generation module;

the quality detection and pretreatment module is used for acquiring a picture set of the electronic file, respectively performing quality detection and pretreatment on pictures in the picture set to acquire a clear picture set, and performing graying treatment on the pictures in the picture set to acquire a grayed picture; performing definition detection on the grayed picture based on a Laplace operator to obtain a first clear picture and a picture to be processed; based on image sharpening, sharpening the to-be-processed picture to obtain a second clear picture; acquiring the clear picture set according to the first clear picture and the second clear picture;

the file integrity judging module is used for extracting file titles from the full-text information based on a dictionary and a regular expression and judging the integrity of each subfile; in the same subfile, determining the arrangement position of the text picture corresponding to each full text message in the subfile, including: if the title is successfully extracted, marking the corresponding text picture as the home page of the subfile; if the title extraction fails, inputting a tail sentence in the previous page of full-text information and a head sentence of the current full-text information into a pre-trained BERT model, acquiring semantic association degree, and determining that the current text picture is a head page or a content page of the subfile according to the semantic association degree; the arrangement positions in the subfiles include: a home page and a content page of the subfile;

7. An apparatus for cataloguing electronic portfolio documents, comprising: a processor, and a memory coupled to the processor;

the memory is used for storing a computer program, and the computer program is at least used for executing the classification cataloguing method of the electronic volume file according to any one of claims 1-5;