CN115328871A

CN115328871A - Evaluation method for format data stream file conversion based on machine learning model

Info

Publication number: CN115328871A
Application number: CN202211244772.0A
Authority: CN
Inventors: 胡夕国; 胡玥
Original assignee: Nantong Zhonghong Network Technology Co ltd
Current assignee: Nantong Zhonghong Network Technology Co ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2022-11-11
Anticipated expiration: 2042-10-12
Also published as: CN115328871B

Abstract

The invention relates to the field of data identification, in particular to an evaluation method for format data stream file conversion based on a machine learning model, which comprises the following steps: step one, a training sample set is obtained; training the constructed neural network model by using a training sample set to obtain a trained neural network model; and step three, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model. The scheme of the invention can train the neural network model by acquiring the training sample set, and can realize the rapid and efficient evaluation of the conversion quality of the format data stream file.

Description

Evaluation method for format data stream file conversion based on machine learning model

Technical Field

The invention relates to the field of data identification, in particular to an evaluation method for format data stream file conversion based on a machine learning model.

Background

With the development of global digitalization and informatization, the mass production of electronic documents brings a change of covering the earth to the life of people, and paper documents are gradually replaced to become a main object of reading and processing of people in many fields and application scenes.

Most electronic documents are format data stream files stored in PDF, PNG and other formats; however, the layout data stream file is not suitable for reading and using on terminals or media with different screen or window sizes. For example, in order to read a layout data stream file of A4 layout on a small screen terminal, due to the lack of reflowable function of text lines/columns, the page needs to be reduced to the screen size to represent a complete line/column. When the large-layout book is reduced to the size of the screen, the characters are easy to be seen and unclear; or the document page needs to be continuously scrolled according to the reading position to complete the complete reading of each row/column.

In order to support reading of format data stream files on terminals or media of different sizes, reflowable conversion processing is performed on format data stream files in the prior art, so as to obtain reflowable files of multiple different formats, such as text format TXT, HTML, WORD files, and the like. However, the typesetting and format of the reflowable files with different text formats are different, i.e. the reading effect is different; therefore, how to determine a better-effective reflowable conversion processing evaluation method so as to provide better reading experience for users.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide an evaluation method for format data stream file conversion based on a machine learning model, wherein the adopted technical scheme is as follows:

the invention provides an evaluation method for format data stream file conversion based on a machine learning model, which comprises the following steps:

step one, acquiring a training sample set;

training the constructed neural network model by using a training sample set to obtain a trained neural network model;

thirdly, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model;

the acquisition process of the training sample set comprises the following steps:

extracting interesting regions of the format data stream file before and after conversion respectively to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; calculating the absolute value of the conversion error in any interested area pair so as to obtain the sum of the conversion error values of the format data stream file;

respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; obtaining conversion error distribution characteristics according to the form vector before conversion and the form vector after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion heterogeneity based on the conversion error distribution characteristic sequences;

classifying the different format data stream files based on the sum of the conversion diversity of each format data stream file and the conversion error value to obtain different category clusters; performing statistical analysis on each category of clusters to obtain a type descriptor; and calculating the membership degree of the type descriptor, and when the membership degree is greater than or equal to a threshold value, converting the format data stream file normally to be used as a training sample until a training sample set is obtained.

Preferably, the input of the neural network model is the sum of the conversion heterogeneity and the conversion error value of each format data stream file, and the output is the membership degree.

Preferably, the conversion error distribution is characterized by: and calculating the cosine similarity of the pre-conversion form vector and the post-conversion form vector in each interested region pair.

Preferably, the obtaining process of the conversion heterogeneity degree is as follows:

respectively calculating the similarity of a conversion error distribution characteristic sequence of a conversion record of the current format data stream file and the conversion error distribution characteristic sequences of conversion records of other format data stream files, sequencing the similarities from large to small, and selecting a conversion error distribution characteristic sequence of the Kth similar format data stream file and a conversion error distribution characteristic sequence of the format data stream file corresponding to the maximum similarity as the most similar distribution characteristic sequences;

calculating to obtain conversion diversity according to the conversion error distribution characteristic sequence, the most similar distribution characteristic sequence and the distribution characteristic sequence of the Kth file of the conversion record of the current file:

wherein the content of the first and second substances,

in order to convert the error distribution signature sequence,

the most similar sequence of distribution features is,

for the kth similar distribution signature sequence,

is composed of

A loss function.

Preferably, the conversion error value is an absolute value of a difference between an aspect ratio of the pre-conversion region of interest and an aspect ratio of the post-conversion region of interest in any one of the region-of-interest pairs, wherein the aspect ratio is

Where w is the width, L is the length, min () is the minimum value, max is the maximum value.

Preferably, the specific process of classifying the data stream files of different formats to obtain different category clusters is as follows:

calculating the difference distance of any two format data stream files according to the sum of the conversion diversity degree and the conversion error value of each format data stream file:

wherein the content of the first and second substances,

is the sum of the conversion error values of the format data stream file X,

is the sum of the conversion error values of the layout data stream file Y,

for the conversion heterogeneity of the layout data stream file X,

converting the heterology of the format data stream file Y;

and clustering the data stream files of each format according to the difference distance to obtain different category clusters.

Preferably, the membership is:

wherein the content of the first and second substances,

is the total number of neighborhood cluster sets for sample M,

is the reachable distance of sample M from sample S in the neighborhood cluster.

The invention has the beneficial effects that:

according to the scheme, the converted format data stream files are analyzed to obtain the conversion state of the conversion process, namely the sum of conversion errors and the conversion diversity degree, the converted files are analyzed according to the conversion state corresponding to each format data stream file, so that a high-quality training sample set is determined and used for training a neural network model, an accurate and stable trained neural network model is obtained, and the subsequent quick and efficient evaluation on the conversion quality of the format data stream files to be tested is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method of evaluating layout data stream file transformations based on a machine learning model according to the present invention.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the embodiments, structures, features and effects thereof according to the present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments. In the following description, the different references to "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Specifically, for an example of a format data stream file of a PDF, a method for evaluating format data stream file conversion based on a machine learning model provided by the present invention is introduced, please refer to fig. 1, which includes the following steps:

step one, a training sample set is obtained.

Firstly, extracting interesting regions of a format data stream file before conversion and after conversion respectively to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; and calculating a conversion error absolute value of a difference value between the area of interest before conversion and the area of interest after conversion in any one area of interest pair, and further obtaining the sum of conversion error values of the format data stream file.

The region of interest pair in this embodiment refers to a region to which the same information in the format data stream file belongs before and after conversion.

The method for acquiring the absolute value of the conversion error in the embodiment comprises the following steps: respectively calculating the aspect ratio of a region of interest (ROI 1) before conversion and a region of interest (ROI 2) after conversion, and obtaining a conversion error absolute value by taking the difference of the aspect ratio of the two regions of interest, specifically, carrying out the following processing on the value of the aspect ratio of each region of interest before conversion:

the length-width ratio of the above is relative to the long side

And narrow side

Calculating the proportion; similarly, the above processing is performed for the aspect ratio of each converted region of interest. And (4) performing difference on the aspect ratio before and after conversion, and solving an absolute value to obtain a conversion error value of each interested region pair.

It should be noted that, in this embodiment, the number of the interested region pairs is 5, and then the conversion error value is also 5, and certainly, an implementer may determine the interested region pairs according to the actual content condition of the format data stream file and the strictness degree of the format data stream file, that is, the stricter, the larger the number is, and the smaller the number is; thus, the sum Q of the conversion errors can be obtained and used for representing the conversion errors of the whole layout data stream file.

As another embodiment, the conversion error values before and after conversion may also be characterized by calculating the similarity of the region of interest before conversion and the region of interest after conversion with the same information.

Secondly, respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; and obtaining conversion error distribution characteristics according to the form vector before conversion and the form vector after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion diversity degree based on the conversion error distribution characteristic sequences.

The convex hull detection in the above is specifically:

respectively regarding the region of interest before conversion or the region of interest after conversion as a rectangular image, marking the pixel with non-background color as 1 and the pixel with background color as 0, and merging the pixels to a connected domain; obtaining the outline of the connected region of the region of interest before or after conversion by a method of searching the outline of the outer ring of the connected region, and obtaining the convex hull of the region of interest before or after conversion by using a convex hull algorithm.

The connected domain is a connected domain, which is a whole of non-background-color objects such as characters and images in each pre-conversion region of interest or post-conversion region of interest.

It should be noted that the convex hull obtained in this embodiment is used to determine the arrangement of the elements, such as the text of the region of interest before conversion or the text of the region of interest after conversion; the reason is that certain positioning errors, such as abnormal line feed and dislocation, occur in the format data stream file conversion process, so that the convex hull is used for representing the form of the region of interest before conversion or the region of interest after conversion, thereby comparing the arrangement characteristics before and after conversion, and effectively representing the characteristics of the errors. Meanwhile, once the convex hull is displaced, the form of the convex hull is changed, and in order to more effectively represent the detail features of the convex hull and compare the features of the convex hull before and after conversion, a Fourier descriptor is used for representing the convex hull.

The process of obtaining the form vector of the fourier descriptor in the above is:

transforming the coordinates of the convex hull into frequency domain signals based on the Fourier descriptor, and extracting the frequency and energy of the frequency domain signals of the convex hull according to the frequency domain signals of the convex hull to obtain the form vector of the Fourier descriptor

，

For the nth order frequency domain signal, sorting the elements of the vector by frequency; the above-mentioned form vector of the fourier descriptor includes a pre-transform form vector and a post-transform form vector.

It should be noted that, because there are more inflection points where the convex hull is hard, the high frequency component is not useful for the present solution, and therefore, the current ROI contour feature is processed, and the low frequency component is used to represent with fewer vectors, so as to reduce the subsequent calculation amount; specifically, the method comprises the following steps: selecting low-frequency components from the frequency domain signals of the convex hulls corresponding to the contours to serve as low-frequency shape descriptors of the contours; in the present embodiment, the low frequency component of the frequency domain signal of the current contour is extracted, and since the number of stages of the fourier descriptor is not constant, 3 low frequency morphological vectors with the number of stages of 2,3,4 are selected as the low frequency morphological vectors in the present embodiment in view of the fact that the morphological feature of the convex hull is mainly the secondary low frequency, and thus the low frequency morphological vectors of the current contour are represented by vectors with fewer dimensions

Wherein the low frequency shape vector of the current contour comprises the low frequency shape vector before conversion

And the transformed low frequency morphology vector

。

The obtaining process of the conversion error distribution characteristics comprises the following steps:

calculating cosine similarity before and after conversion of any contour according to the form vector before and after conversion of each contour to obtain the difference degree before and after conversion of each interested region pair:

，

in order to convert the low-frequency shape vector,

to be the transformed low frequency morphology vector.

In this embodiment, the difference degrees T of all the pairs of regions of interest are obtained and are sequentially arranged from large to small, so as to obtain a conversion error distribution characteristic sequence, that is, the conversion error distribution characteristic sequence is obtained

，

Is the degree of difference of the ith pair of regions of interest.

It should be noted that the characteristic of the distribution of the error of the false conversion represents the distribution of the large error to the small error that the human eyes can experience, so that the human eye can see the conversion result.

So far, the conversion heterogeneity P can be calculated according to the conversion error distribution characteristic sequence of the format data stream file:

wherein, the first and the second end of the pipe are connected with each other,

in order to convert the error distribution signature sequence,

is the most similar sequence of the distribution characteristics,

for the kth similar distribution feature sequence,

is composed of

A loss function.

Therefore, the larger P is, the fact that the current format data stream file is a case which is difficult to match with similar files is meant, so that the current format data stream file is regarded as a small sample, and the case needs to be dug difficultly, but whether the file is caused by an abnormal file needs to be judged, otherwise, the whole machine learning performance is influenced.

Most similar distribution characteristics of the above

And Kth similar distribution characteristics

The acquisition process comprises the following steps:

in this embodiment, the similarity between the conversion error distribution feature sequence of the conversion record of the current format data stream file and the conversion error distribution feature sequences of the conversion records of other format data stream files is calculated, the similarities are sorted from large to small, and the conversion error distribution feature sequence of the kth similar format data stream file is selected

And taking the conversion error distribution characteristic sequence of the format data stream file corresponding to the maximum similarity as the most similar distribution characteristic sequence

。

Wherein the current file conversion record has a distribution characteristic sequence

The layout data stream file distribution characteristic sequence is a 5-dimensional vector, the component of the vector represents the proportion of each margin, and the distribution characteristic sequence of the layout data stream files with similar distribution conditions can be obtained by calculating the cosine distance of the vector converted and recorded by all the layout data stream files on the same day.

It should be noted that the function of selecting the K-th similar distribution feature sequence in this embodiment is to avoid that the distribution feature is not a typical feature, so an implementer may determine whether the conversion condition of the current format data stream file conversion record file is typical through adjustment of K.

Then, classifying different format data stream files based on the sum of conversion diversity of each format data stream file and a conversion error value to obtain different category clusters; performing statistical analysis on each category cluster to obtain a category cluster descriptor; and calculating the membership degree of the category cluster descriptor, and when the membership degree is greater than or equal to a threshold value, converting the format data stream file normally to be used as a training sample until a training sample set is obtained.

In the embodiment, the sum of conversion error values before and after conversion of the obtained data stream files with different formats and the conversion diversity degree are obtained; classifying the data stream files with different formats to obtain different category clusters; specifically, in this embodiment, different format data stream files are clustered based on a kmedoids algorithm; wherein the search radius eps is defaulted to 0.15, and the minimum value minpts in the cluster is set to 4; the cluster difference distance is:

wherein the content of the first and second substances,

it can be constrained that the similarity of the sample in the current situation with another sample is not large.

In the embodiment, the clustering dimension of kmedoids is based on two independent feature spaces, namely the sum Q of conversion error values and the conversion heterogeneity P, the class value of the cluster is defined as 10, and an implementer can continuously improve the class value and avoid over-segmentation in the clustering process.

It should be noted that, because the unique condition factors of the file conversion configuration (the file conversion configuration is a two-dimensional vector composed of the sum of the conversion error values and the conversion diversity) of each format data stream file are complex and are related to the file conversion configuration in the file conversion record, the probability of the occurrence of a certain type of abnormality in the record is almost nonexistent, and thus the sum of the conversion error values and the conversion diversity in the conversion process of multiple files can be determined.

Therefore, in the embodiment, the uniqueness degree of the currently different layout data stream files is classified according to the calculation of the obtained uniqueness degree of each layout data stream file conversion, and the difference distance between the conversion states of any two layout data stream files is determined, and it should be noted that the uniqueness degree is actually the calculated difference distance.

Under the constraint, the conversion state records of each format data stream file can be divided into different category clusters, so that the different category clusters are located in different cluster types, and then what cluster type the conversion records belong to when other similar format data stream files are converted can be obtained through relative statistics.

And obtaining the category-recording clusters of the data stream files with different formats, numbering the cluster types, and obtaining the numbering mode by using the cluster labels of the kmedoids algorithm result.

In this embodiment, the conversion configuration in the conversion process when the current format data stream files are converted is determined based on the existing information, so as to determine the unique degree of each file conversion record. When the file conversion configuration of each file is unique, the histogram features can be obviously changed, so that the context conversion state of the conversion process in the current conversion is determined through P and Q.

Therefore, each current format data stream file is divided into a plurality of different categories, and when isolated points appear in the clustering process, the isolated points are rare typical files and can be used as reactive samples, so that the isolated points are clustered independently and used as a new unique conversion state analysis.

Based on the obtained category clusters, obtaining a type descriptor Z of each category cluster:

based on the classification result, the distribution of the cluster types to which the conversion belongs during the conversion is counted to obtain a descriptor Z, specifically:

firstly, calculating the histogram distribution of the cluster to which each sample belongs in the set of the conversion process during one-time conversion in each category cluster.

Unlike conventional histogram statistics, for one sample, after the statistical value of the cluster increases by 1, the type of another closest cluster also increases by 1. This is done to smooth the sample distribution and prevent the transition process conditions from being in critical transition states and causing the distribution to be more extreme due to clustering.

In this embodiment, the relative distance threshold is determined to prevent one sample from being counted in a cluster erroneously because it is too far away from the other samples.

In this embodiment, the method for determining the type of the closest cluster is to find K nearest neighbor samples, and determine a cluster to which a nearest sample that does not belong to the cluster to which the sample belongs. Calculating whether the sample exceeds a relative distance threshold, if so, not counting, and otherwise, counting one of the category clusters; to this end, there is one type descriptor Z per transition state.

Calculating the membership degree of the type descriptor Z:

for K nearest neighbor samples, having the kth reachable distance, calculating the kth local reachable density of each layout data stream file:

wherein, the k-th local reachable density of the sample M and the nearest neighbor sample S, and the inverse of the average k-th reachable distance from all samples in the k-th reachable distance neighborhood of the sample M to the sample M.

In the above

The density of the sample M is characterized, and when the density of the sample M and the surrounding samples is higher, the reachable distance of each sample is more likely to be smaller respective kth distance, and the lrd value is larger at the moment; the lower the concentration of the sample M to the surrounding samples, the more likely the reachable distance of each sample is to be the actual distance between the two samples, and the smaller the lrd value. Because the membership degree lrd of each type descriptor Z can be used as a basis for determining whether the conversion condition is typical during conversion, when lrd is too low, the file content of the whole conversion process is considered to be abnormal during conversion, the sample at the moment should not be added into the training set, otherwise, the training sample set is added, and thus, the training sample set is obtained.

In this embodiment, a set threshold of the membership degree is also set, and when the membership degree is smaller than the set threshold, the sample is considered to be too abnormal and not to be included in the training, otherwise, the training is performed.

And step two, training the constructed neural network model by using the training sample set to obtain the trained neural network model.

In this embodiment, the sum of the conversion heterogeneity and the conversion error value in the training sample set is used as the input of the neural network model, and the membership is used as the output of the neural network, so as to train the constructed neural network model. Wherein, the loss function of the network model is:

in the training process, normalization processing is carried out on the membership degrees of all samples based on a set threshold value, the normalized value is used as the weight of the corresponding sample, and an improved loss function is obtained as follows:

wherein, in the step (A),

is a mean square error function.

Therefore, for subsequent other format data stream files, embodied difficulty caused by errors and defects of impression can be effectively analyzed, and if an implementer brings the format data stream file into a data set, the method automatically plays a role in difficulty mining.

The neural network model in this embodiment adopts an FCN network model.

In the embodiment, after the abstract sample file conversion configuration of the format data stream file is obtained, and a possible experience conversion state probability after the file conversion configuration occurs is counted based on the conversion record of the FCN learning sample, a high-quality training sample is obtained for training of the neural network model, and a basis is provided for subsequent evaluation of the conversion quality of the format data stream file.

And step three, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model.

In this embodiment, the sum of the conversion heterogeneity and the conversion error value of the format data stream file to be detected is obtained, where the sum of the conversion heterogeneity and the conversion error value is obtained according to the calculation method for obtaining the sum of the conversion heterogeneity and the conversion error value in the second step.

And inputting the sum of the obtained conversion diversity degree and the conversion error value into a trained neural network model, outputting the membership degree, and judging whether the format data stream file conversion is normal or not according to the membership degree.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for evaluating format data stream file conversion based on a machine learning model is characterized by comprising the following steps:

step one, acquiring a training sample set;

respectively extracting interesting regions of the format data stream file before and after conversion to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; calculating the absolute value of the conversion error in any interested area pair so as to obtain the sum of the conversion error values of the format data stream file;

respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; obtaining conversion error distribution characteristics according to the form vectors before and after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion heterogeneity based on the conversion error distribution characteristic sequences;

2. The method as claimed in claim 1, wherein the input of the neural network model is a sum of transformation dissimilarity and transformation error value of each version of the data stream file, and the output is membership.

3. The method for evaluating a layout data stream file transformation based on a machine learning model according to claim 1, wherein the transformation error distribution characteristics are: and calculating the cosine similarity of the morphological vector before conversion and the morphological vector after conversion in each interested region pair.

4. The evaluation method for format data stream file conversion based on machine learning model according to claim 1, wherein the obtaining process of conversion heterogeneity is:

wherein the content of the first and second substances,

in order to convert the error distribution signature sequence,

is the most similar sequence of the distribution characteristics,

for the kth similar distribution signature sequence,

is composed of

A loss function.

5. The method for evaluating a conversion of a layout data stream file based on a machine learning model according to claim 1, wherein the conversion error value is an absolute value of a difference between an aspect ratio of a pre-conversion region of interest and an aspect ratio of a post-conversion region of interest in any one of the pairs of regions of interest, wherein the aspect ratio is calculated

6. The evaluation method for format data stream file conversion based on machine learning model according to claim 1, wherein the specific process of classifying different format data stream files to obtain different category clusters is as follows:

wherein the content of the first and second substances,

is the sum of the conversion error values of the format data stream file X,

is the sum of the conversion error values of the layout data stream file Y,

for the conversion heterogeneity of the layout data stream file X,

converting the heterology of the format data stream file Y;

7. The method for evaluating a layout data stream file transformation based on a machine learning model according to claim 1, wherein the degree of membership is:

wherein the content of the first and second substances,

is the total number of neighborhood cluster sets for sample M,