CN115328871A - Evaluation method for format data stream file conversion based on machine learning model - Google Patents

Evaluation method for format data stream file conversion based on machine learning model Download PDF

Info

Publication number
CN115328871A
CN115328871A CN202211244772.0A CN202211244772A CN115328871A CN 115328871 A CN115328871 A CN 115328871A CN 202211244772 A CN202211244772 A CN 202211244772A CN 115328871 A CN115328871 A CN 115328871A
Authority
CN
China
Prior art keywords
conversion
data stream
stream file
format data
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211244772.0A
Other languages
Chinese (zh)
Other versions
CN115328871B (en
Inventor
胡夕国
胡玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Zhonghong Network Technology Co ltd
Original Assignee
Nantong Zhonghong Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Zhonghong Network Technology Co ltd filed Critical Nantong Zhonghong Network Technology Co ltd
Priority to CN202211244772.0A priority Critical patent/CN115328871B/en
Publication of CN115328871A publication Critical patent/CN115328871A/en
Application granted granted Critical
Publication of CN115328871B publication Critical patent/CN115328871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of data identification, in particular to an evaluation method for format data stream file conversion based on a machine learning model, which comprises the following steps: step one, a training sample set is obtained; training the constructed neural network model by using a training sample set to obtain a trained neural network model; and step three, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model. The scheme of the invention can train the neural network model by acquiring the training sample set, and can realize the rapid and efficient evaluation of the conversion quality of the format data stream file.

Description

Evaluation method for format data stream file conversion based on machine learning model
Technical Field
The invention relates to the field of data identification, in particular to an evaluation method for format data stream file conversion based on a machine learning model.
Background
With the development of global digitalization and informatization, the mass production of electronic documents brings a change of covering the earth to the life of people, and paper documents are gradually replaced to become a main object of reading and processing of people in many fields and application scenes.
Most electronic documents are format data stream files stored in PDF, PNG and other formats; however, the layout data stream file is not suitable for reading and using on terminals or media with different screen or window sizes. For example, in order to read a layout data stream file of A4 layout on a small screen terminal, due to the lack of reflowable function of text lines/columns, the page needs to be reduced to the screen size to represent a complete line/column. When the large-layout book is reduced to the size of the screen, the characters are easy to be seen and unclear; or the document page needs to be continuously scrolled according to the reading position to complete the complete reading of each row/column.
In order to support reading of format data stream files on terminals or media of different sizes, reflowable conversion processing is performed on format data stream files in the prior art, so as to obtain reflowable files of multiple different formats, such as text format TXT, HTML, WORD files, and the like. However, the typesetting and format of the reflowable files with different text formats are different, i.e. the reading effect is different; therefore, how to determine a better-effective reflowable conversion processing evaluation method so as to provide better reading experience for users.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide an evaluation method for format data stream file conversion based on a machine learning model, wherein the adopted technical scheme is as follows:
the invention provides an evaluation method for format data stream file conversion based on a machine learning model, which comprises the following steps:
step one, acquiring a training sample set;
training the constructed neural network model by using a training sample set to obtain a trained neural network model;
thirdly, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model;
the acquisition process of the training sample set comprises the following steps:
extracting interesting regions of the format data stream file before and after conversion respectively to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; calculating the absolute value of the conversion error in any interested area pair so as to obtain the sum of the conversion error values of the format data stream file;
respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; obtaining conversion error distribution characteristics according to the form vector before conversion and the form vector after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion heterogeneity based on the conversion error distribution characteristic sequences;
classifying the different format data stream files based on the sum of the conversion diversity of each format data stream file and the conversion error value to obtain different category clusters; performing statistical analysis on each category of clusters to obtain a type descriptor; and calculating the membership degree of the type descriptor, and when the membership degree is greater than or equal to a threshold value, converting the format data stream file normally to be used as a training sample until a training sample set is obtained.
Preferably, the input of the neural network model is the sum of the conversion heterogeneity and the conversion error value of each format data stream file, and the output is the membership degree.
Preferably, the conversion error distribution is characterized by: and calculating the cosine similarity of the pre-conversion form vector and the post-conversion form vector in each interested region pair.
Preferably, the obtaining process of the conversion heterogeneity degree is as follows:
respectively calculating the similarity of a conversion error distribution characteristic sequence of a conversion record of the current format data stream file and the conversion error distribution characteristic sequences of conversion records of other format data stream files, sequencing the similarities from large to small, and selecting a conversion error distribution characteristic sequence of the Kth similar format data stream file and a conversion error distribution characteristic sequence of the format data stream file corresponding to the maximum similarity as the most similar distribution characteristic sequences;
calculating to obtain conversion diversity according to the conversion error distribution characteristic sequence, the most similar distribution characteristic sequence and the distribution characteristic sequence of the Kth file of the conversion record of the current file:
Figure 752268DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
in order to convert the error distribution signature sequence,
Figure 247359DEST_PATH_IMAGE004
the most similar sequence of distribution features is,
Figure DEST_PATH_IMAGE005
for the kth similar distribution signature sequence,
Figure 551301DEST_PATH_IMAGE006
is composed of
Figure DEST_PATH_IMAGE007
A loss function.
Preferably, the conversion error value is an absolute value of a difference between an aspect ratio of the pre-conversion region of interest and an aspect ratio of the post-conversion region of interest in any one of the region-of-interest pairs, wherein the aspect ratio is
Figure 560845DEST_PATH_IMAGE008
Where w is the width, L is the length, min () is the minimum value, max is the maximum value.
Preferably, the specific process of classifying the data stream files of different formats to obtain different category clusters is as follows:
calculating the difference distance of any two format data stream files according to the sum of the conversion diversity degree and the conversion error value of each format data stream file:
Figure 532212DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE011
is the sum of the conversion error values of the format data stream file X,
Figure 929696DEST_PATH_IMAGE012
is the sum of the conversion error values of the layout data stream file Y,
Figure DEST_PATH_IMAGE013
for the conversion heterogeneity of the layout data stream file X,
Figure 393038DEST_PATH_IMAGE014
converting the heterology of the format data stream file Y;
and clustering the data stream files of each format according to the difference distance to obtain different category clusters.
Preferably, the membership is:
Figure 65328DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE017
is the total number of neighborhood cluster sets for sample M,
Figure 360043DEST_PATH_IMAGE018
is the reachable distance of sample M from sample S in the neighborhood cluster.
The invention has the beneficial effects that:
according to the scheme, the converted format data stream files are analyzed to obtain the conversion state of the conversion process, namely the sum of conversion errors and the conversion diversity degree, the converted files are analyzed according to the conversion state corresponding to each format data stream file, so that a high-quality training sample set is determined and used for training a neural network model, an accurate and stable trained neural network model is obtained, and the subsequent quick and efficient evaluation on the conversion quality of the format data stream files to be tested is facilitated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method of evaluating layout data stream file transformations based on a machine learning model according to the present invention.
Detailed Description
To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the embodiments, structures, features and effects thereof according to the present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments. In the following description, the different references to "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
Specifically, for an example of a format data stream file of a PDF, a method for evaluating format data stream file conversion based on a machine learning model provided by the present invention is introduced, please refer to fig. 1, which includes the following steps:
step one, a training sample set is obtained.
Firstly, extracting interesting regions of a format data stream file before conversion and after conversion respectively to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; and calculating a conversion error absolute value of a difference value between the area of interest before conversion and the area of interest after conversion in any one area of interest pair, and further obtaining the sum of conversion error values of the format data stream file.
The region of interest pair in this embodiment refers to a region to which the same information in the format data stream file belongs before and after conversion.
The method for acquiring the absolute value of the conversion error in the embodiment comprises the following steps: respectively calculating the aspect ratio of a region of interest (ROI 1) before conversion and a region of interest (ROI 2) after conversion, and obtaining a conversion error absolute value by taking the difference of the aspect ratio of the two regions of interest, specifically, carrying out the following processing on the value of the aspect ratio of each region of interest before conversion:
Figure 600531DEST_PATH_IMAGE020
the length-width ratio of the above is relative to the long side
Figure DEST_PATH_IMAGE021
And narrow side
Figure 410224DEST_PATH_IMAGE022
Calculating the proportion; similarly, the above processing is performed for the aspect ratio of each converted region of interest. And (4) performing difference on the aspect ratio before and after conversion, and solving an absolute value to obtain a conversion error value of each interested region pair.
It should be noted that, in this embodiment, the number of the interested region pairs is 5, and then the conversion error value is also 5, and certainly, an implementer may determine the interested region pairs according to the actual content condition of the format data stream file and the strictness degree of the format data stream file, that is, the stricter, the larger the number is, and the smaller the number is; thus, the sum Q of the conversion errors can be obtained and used for representing the conversion errors of the whole layout data stream file.
As another embodiment, the conversion error values before and after conversion may also be characterized by calculating the similarity of the region of interest before conversion and the region of interest after conversion with the same information.
Secondly, respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; and obtaining conversion error distribution characteristics according to the form vector before conversion and the form vector after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion diversity degree based on the conversion error distribution characteristic sequences.
The convex hull detection in the above is specifically:
respectively regarding the region of interest before conversion or the region of interest after conversion as a rectangular image, marking the pixel with non-background color as 1 and the pixel with background color as 0, and merging the pixels to a connected domain; obtaining the outline of the connected region of the region of interest before or after conversion by a method of searching the outline of the outer ring of the connected region, and obtaining the convex hull of the region of interest before or after conversion by using a convex hull algorithm.
The connected domain is a connected domain, which is a whole of non-background-color objects such as characters and images in each pre-conversion region of interest or post-conversion region of interest.
It should be noted that the convex hull obtained in this embodiment is used to determine the arrangement of the elements, such as the text of the region of interest before conversion or the text of the region of interest after conversion; the reason is that certain positioning errors, such as abnormal line feed and dislocation, occur in the format data stream file conversion process, so that the convex hull is used for representing the form of the region of interest before conversion or the region of interest after conversion, thereby comparing the arrangement characteristics before and after conversion, and effectively representing the characteristics of the errors. Meanwhile, once the convex hull is displaced, the form of the convex hull is changed, and in order to more effectively represent the detail features of the convex hull and compare the features of the convex hull before and after conversion, a Fourier descriptor is used for representing the convex hull.
The process of obtaining the form vector of the fourier descriptor in the above is:
transforming the coordinates of the convex hull into frequency domain signals based on the Fourier descriptor, and extracting the frequency and energy of the frequency domain signals of the convex hull according to the frequency domain signals of the convex hull to obtain the form vector of the Fourier descriptor
Figure DEST_PATH_IMAGE023
Figure 89468DEST_PATH_IMAGE024
For the nth order frequency domain signal, sorting the elements of the vector by frequency; the above-mentioned form vector of the fourier descriptor includes a pre-transform form vector and a post-transform form vector.
It should be noted that, because there are more inflection points where the convex hull is hard, the high frequency component is not useful for the present solution, and therefore, the current ROI contour feature is processed, and the low frequency component is used to represent with fewer vectors, so as to reduce the subsequent calculation amount; specifically, the method comprises the following steps: selecting low-frequency components from the frequency domain signals of the convex hulls corresponding to the contours to serve as low-frequency shape descriptors of the contours; in the present embodiment, the low frequency component of the frequency domain signal of the current contour is extracted, and since the number of stages of the fourier descriptor is not constant, 3 low frequency morphological vectors with the number of stages of 2,3,4 are selected as the low frequency morphological vectors in the present embodiment in view of the fact that the morphological feature of the convex hull is mainly the secondary low frequency, and thus the low frequency morphological vectors of the current contour are represented by vectors with fewer dimensions
Figure DEST_PATH_IMAGE025
Wherein the low frequency shape vector of the current contour comprises the low frequency shape vector before conversion
Figure 176372DEST_PATH_IMAGE026
And the transformed low frequency morphology vector
Figure DEST_PATH_IMAGE027
The obtaining process of the conversion error distribution characteristics comprises the following steps:
calculating cosine similarity before and after conversion of any contour according to the form vector before and after conversion of each contour to obtain the difference degree before and after conversion of each interested region pair:
Figure 178308DEST_PATH_IMAGE028
Figure 944138DEST_PATH_IMAGE026
in order to convert the low-frequency shape vector,
Figure 833597DEST_PATH_IMAGE027
to be the transformed low frequency morphology vector.
In this embodiment, the difference degrees T of all the pairs of regions of interest are obtained and are sequentially arranged from large to small, so as to obtain a conversion error distribution characteristic sequence, that is, the conversion error distribution characteristic sequence is obtained
Figure DEST_PATH_IMAGE029
Figure 165221DEST_PATH_IMAGE030
Is the degree of difference of the ith pair of regions of interest.
It should be noted that the characteristic of the distribution of the error of the false conversion represents the distribution of the large error to the small error that the human eyes can experience, so that the human eye can see the conversion result.
So far, the conversion heterogeneity P can be calculated according to the conversion error distribution characteristic sequence of the format data stream file:
Figure 747512DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,
Figure 266218DEST_PATH_IMAGE003
in order to convert the error distribution signature sequence,
Figure 693789DEST_PATH_IMAGE004
is the most similar sequence of the distribution characteristics,
Figure 348761DEST_PATH_IMAGE005
for the kth similar distribution feature sequence,
Figure 898691DEST_PATH_IMAGE006
is composed of
Figure 780059DEST_PATH_IMAGE007
A loss function.
Therefore, the larger P is, the fact that the current format data stream file is a case which is difficult to match with similar files is meant, so that the current format data stream file is regarded as a small sample, and the case needs to be dug difficultly, but whether the file is caused by an abnormal file needs to be judged, otherwise, the whole machine learning performance is influenced.
Most similar distribution characteristics of the above
Figure 401533DEST_PATH_IMAGE004
And Kth similar distribution characteristics
Figure 786378DEST_PATH_IMAGE005
The acquisition process comprises the following steps:
in this embodiment, the similarity between the conversion error distribution feature sequence of the conversion record of the current format data stream file and the conversion error distribution feature sequences of the conversion records of other format data stream files is calculated, the similarities are sorted from large to small, and the conversion error distribution feature sequence of the kth similar format data stream file is selected
Figure 569527DEST_PATH_IMAGE005
And taking the conversion error distribution characteristic sequence of the format data stream file corresponding to the maximum similarity as the most similar distribution characteristic sequence
Figure 938191DEST_PATH_IMAGE004
Wherein the current file conversion record has a distribution characteristic sequence
Figure 97777DEST_PATH_IMAGE003
The layout data stream file distribution characteristic sequence is a 5-dimensional vector, the component of the vector represents the proportion of each margin, and the distribution characteristic sequence of the layout data stream files with similar distribution conditions can be obtained by calculating the cosine distance of the vector converted and recorded by all the layout data stream files on the same day.
It should be noted that the function of selecting the K-th similar distribution feature sequence in this embodiment is to avoid that the distribution feature is not a typical feature, so an implementer may determine whether the conversion condition of the current format data stream file conversion record file is typical through adjustment of K.
Then, classifying different format data stream files based on the sum of conversion diversity of each format data stream file and a conversion error value to obtain different category clusters; performing statistical analysis on each category cluster to obtain a category cluster descriptor; and calculating the membership degree of the category cluster descriptor, and when the membership degree is greater than or equal to a threshold value, converting the format data stream file normally to be used as a training sample until a training sample set is obtained.
In the embodiment, the sum of conversion error values before and after conversion of the obtained data stream files with different formats and the conversion diversity degree are obtained; classifying the data stream files with different formats to obtain different category clusters; specifically, in this embodiment, different format data stream files are clustered based on a kmedoids algorithm; wherein the search radius eps is defaulted to 0.15, and the minimum value minpts in the cluster is set to 4; the cluster difference distance is:
Figure DEST_PATH_IMAGE031
wherein the content of the first and second substances,
Figure 133866DEST_PATH_IMAGE032
it can be constrained that the similarity of the sample in the current situation with another sample is not large.
In the embodiment, the clustering dimension of kmedoids is based on two independent feature spaces, namely the sum Q of conversion error values and the conversion heterogeneity P, the class value of the cluster is defined as 10, and an implementer can continuously improve the class value and avoid over-segmentation in the clustering process.
It should be noted that, because the unique condition factors of the file conversion configuration (the file conversion configuration is a two-dimensional vector composed of the sum of the conversion error values and the conversion diversity) of each format data stream file are complex and are related to the file conversion configuration in the file conversion record, the probability of the occurrence of a certain type of abnormality in the record is almost nonexistent, and thus the sum of the conversion error values and the conversion diversity in the conversion process of multiple files can be determined.
Therefore, in the embodiment, the uniqueness degree of the currently different layout data stream files is classified according to the calculation of the obtained uniqueness degree of each layout data stream file conversion, and the difference distance between the conversion states of any two layout data stream files is determined, and it should be noted that the uniqueness degree is actually the calculated difference distance.
Under the constraint, the conversion state records of each format data stream file can be divided into different category clusters, so that the different category clusters are located in different cluster types, and then what cluster type the conversion records belong to when other similar format data stream files are converted can be obtained through relative statistics.
And obtaining the category-recording clusters of the data stream files with different formats, numbering the cluster types, and obtaining the numbering mode by using the cluster labels of the kmedoids algorithm result.
In this embodiment, the conversion configuration in the conversion process when the current format data stream files are converted is determined based on the existing information, so as to determine the unique degree of each file conversion record. When the file conversion configuration of each file is unique, the histogram features can be obviously changed, so that the context conversion state of the conversion process in the current conversion is determined through P and Q.
Therefore, each current format data stream file is divided into a plurality of different categories, and when isolated points appear in the clustering process, the isolated points are rare typical files and can be used as reactive samples, so that the isolated points are clustered independently and used as a new unique conversion state analysis.
Based on the obtained category clusters, obtaining a type descriptor Z of each category cluster:
based on the classification result, the distribution of the cluster types to which the conversion belongs during the conversion is counted to obtain a descriptor Z, specifically:
firstly, calculating the histogram distribution of the cluster to which each sample belongs in the set of the conversion process during one-time conversion in each category cluster.
Unlike conventional histogram statistics, for one sample, after the statistical value of the cluster increases by 1, the type of another closest cluster also increases by 1. This is done to smooth the sample distribution and prevent the transition process conditions from being in critical transition states and causing the distribution to be more extreme due to clustering.
In this embodiment, the relative distance threshold is determined to prevent one sample from being counted in a cluster erroneously because it is too far away from the other samples.
In this embodiment, the method for determining the type of the closest cluster is to find K nearest neighbor samples, and determine a cluster to which a nearest sample that does not belong to the cluster to which the sample belongs. Calculating whether the sample exceeds a relative distance threshold, if so, not counting, and otherwise, counting one of the category clusters; to this end, there is one type descriptor Z per transition state.
Calculating the membership degree of the type descriptor Z:
for K nearest neighbor samples, having the kth reachable distance, calculating the kth local reachable density of each layout data stream file:
Figure 90845DEST_PATH_IMAGE016
wherein, the k-th local reachable density of the sample M and the nearest neighbor sample S, and the inverse of the average k-th reachable distance from all samples in the k-th reachable distance neighborhood of the sample M to the sample M.
In the above
Figure DEST_PATH_IMAGE033
The density of the sample M is characterized, and when the density of the sample M and the surrounding samples is higher, the reachable distance of each sample is more likely to be smaller respective kth distance, and the lrd value is larger at the moment; the lower the concentration of the sample M to the surrounding samples, the more likely the reachable distance of each sample is to be the actual distance between the two samples, and the smaller the lrd value. Because the membership degree lrd of each type descriptor Z can be used as a basis for determining whether the conversion condition is typical during conversion, when lrd is too low, the file content of the whole conversion process is considered to be abnormal during conversion, the sample at the moment should not be added into the training set, otherwise, the training sample set is added, and thus, the training sample set is obtained.
In this embodiment, a set threshold of the membership degree is also set, and when the membership degree is smaller than the set threshold, the sample is considered to be too abnormal and not to be included in the training, otherwise, the training is performed.
And step two, training the constructed neural network model by using the training sample set to obtain the trained neural network model.
In this embodiment, the sum of the conversion heterogeneity and the conversion error value in the training sample set is used as the input of the neural network model, and the membership is used as the output of the neural network, so as to train the constructed neural network model. Wherein, the loss function of the network model is:
in the training process, normalization processing is carried out on the membership degrees of all samples based on a set threshold value, the normalized value is used as the weight of the corresponding sample, and an improved loss function is obtained as follows:
Figure 274702DEST_PATH_IMAGE034
wherein, in the step (A),
Figure DEST_PATH_IMAGE035
is a mean square error function.
Therefore, for subsequent other format data stream files, embodied difficulty caused by errors and defects of impression can be effectively analyzed, and if an implementer brings the format data stream file into a data set, the method automatically plays a role in difficulty mining.
The neural network model in this embodiment adopts an FCN network model.
In the embodiment, after the abstract sample file conversion configuration of the format data stream file is obtained, and a possible experience conversion state probability after the file conversion configuration occurs is counted based on the conversion record of the FCN learning sample, a high-quality training sample is obtained for training of the neural network model, and a basis is provided for subsequent evaluation of the conversion quality of the format data stream file.
And step three, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model.
In this embodiment, the sum of the conversion heterogeneity and the conversion error value of the format data stream file to be detected is obtained, where the sum of the conversion heterogeneity and the conversion error value is obtained according to the calculation method for obtaining the sum of the conversion heterogeneity and the conversion error value in the second step.
And inputting the sum of the obtained conversion diversity degree and the conversion error value into a trained neural network model, outputting the membership degree, and judging whether the format data stream file conversion is normal or not according to the membership degree.
The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (7)

1. A method for evaluating format data stream file conversion based on a machine learning model is characterized by comprising the following steps:
step one, acquiring a training sample set;
training the constructed neural network model by using a training sample set to obtain a trained neural network model;
thirdly, evaluating the conversion quality of the format data stream file to be tested by using the trained neural network model;
the acquisition process of the training sample set comprises the following steps:
respectively extracting interesting regions of the format data stream file before and after conversion to obtain a plurality of interesting region pairs, wherein each interesting region pair comprises an interesting region before conversion and an interesting region after conversion; calculating the absolute value of the conversion error in any interested area pair so as to obtain the sum of the conversion error values of the format data stream file;
respectively carrying out convex hull detection on the region of interest before conversion and the region of interest after conversion in each region of interest pair to obtain two corresponding convex hulls; performing Fourier transform on each convex hull to obtain frequency domain information, obtaining two corresponding frequency domain signals, and respectively using the two frequency domain signals as a pre-conversion form vector and a post-conversion form vector in the region of interest pair; obtaining conversion error distribution characteristics according to the form vectors before and after conversion, obtaining conversion error distribution characteristic sequences of all interested region pairs, and obtaining conversion heterogeneity based on the conversion error distribution characteristic sequences;
classifying the different format data stream files based on the sum of the conversion diversity of each format data stream file and the conversion error value to obtain different category clusters; performing statistical analysis on each category of clusters to obtain a type descriptor; and calculating the membership degree of the type descriptor, and when the membership degree is greater than or equal to a threshold value, converting the format data stream file normally to be used as a training sample until a training sample set is obtained.
2. The method as claimed in claim 1, wherein the input of the neural network model is a sum of transformation dissimilarity and transformation error value of each version of the data stream file, and the output is membership.
3. The method for evaluating a layout data stream file transformation based on a machine learning model according to claim 1, wherein the transformation error distribution characteristics are: and calculating the cosine similarity of the morphological vector before conversion and the morphological vector after conversion in each interested region pair.
4. The evaluation method for format data stream file conversion based on machine learning model according to claim 1, wherein the obtaining process of conversion heterogeneity is:
respectively calculating the similarity of a conversion error distribution characteristic sequence of a conversion record of the current format data stream file and the conversion error distribution characteristic sequences of conversion records of other format data stream files, sequencing the similarities from large to small, and selecting a conversion error distribution characteristic sequence of the Kth similar format data stream file and a conversion error distribution characteristic sequence of the format data stream file corresponding to the maximum similarity as the most similar distribution characteristic sequences;
calculating to obtain conversion diversity according to the conversion error distribution characteristic sequence, the most similar distribution characteristic sequence and the distribution characteristic sequence of the Kth file of the conversion record of the current file:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 34832DEST_PATH_IMAGE002
in order to convert the error distribution signature sequence,
Figure 30601DEST_PATH_IMAGE003
is the most similar sequence of the distribution characteristics,
Figure 872655DEST_PATH_IMAGE004
for the kth similar distribution signature sequence,
Figure 346493DEST_PATH_IMAGE005
is composed of
Figure 223182DEST_PATH_IMAGE006
A loss function.
5. The method for evaluating a conversion of a layout data stream file based on a machine learning model according to claim 1, wherein the conversion error value is an absolute value of a difference between an aspect ratio of a pre-conversion region of interest and an aspect ratio of a post-conversion region of interest in any one of the pairs of regions of interest, wherein the aspect ratio is calculated
Figure 858694DEST_PATH_IMAGE007
Where w is the width, L is the length, min () is the minimum value, max is the maximum value.
6. The evaluation method for format data stream file conversion based on machine learning model according to claim 1, wherein the specific process of classifying different format data stream files to obtain different category clusters is as follows:
calculating the difference distance of any two format data stream files according to the sum of the conversion diversity degree and the conversion error value of each format data stream file:
Figure 453623DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
is the sum of the conversion error values of the format data stream file X,
Figure 665905DEST_PATH_IMAGE010
is the sum of the conversion error values of the layout data stream file Y,
Figure 397101DEST_PATH_IMAGE011
for the conversion heterogeneity of the layout data stream file X,
Figure 469093DEST_PATH_IMAGE012
converting the heterology of the format data stream file Y;
and clustering the data stream files of each format according to the difference distance to obtain different category clusters.
7. The method for evaluating a layout data stream file transformation based on a machine learning model according to claim 1, wherein the degree of membership is:
Figure 285740DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 101380DEST_PATH_IMAGE014
is the total number of neighborhood cluster sets for sample M,
Figure DEST_PATH_IMAGE015
is the reachable distance of sample M from sample S in the neighborhood cluster.
CN202211244772.0A 2022-10-12 2022-10-12 Evaluation method for format data stream file conversion based on machine learning model Active CN115328871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211244772.0A CN115328871B (en) 2022-10-12 2022-10-12 Evaluation method for format data stream file conversion based on machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211244772.0A CN115328871B (en) 2022-10-12 2022-10-12 Evaluation method for format data stream file conversion based on machine learning model

Publications (2)

Publication Number Publication Date
CN115328871A true CN115328871A (en) 2022-11-11
CN115328871B CN115328871B (en) 2023-01-03

Family

ID=83914173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211244772.0A Active CN115328871B (en) 2022-10-12 2022-10-12 Evaluation method for format data stream file conversion based on machine learning model

Country Status (1)

Country Link
CN (1) CN115328871B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153320A1 (en) * 2005-09-28 2010-06-17 Lili Diao Method and arrangement for sim algorithm automatic charset detection
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111553399A (en) * 2020-04-21 2020-08-18 佳都新太科技股份有限公司 Feature model training method, device, equipment and storage medium
CN111667050A (en) * 2020-04-21 2020-09-15 佳都新太科技股份有限公司 Metric learning method, device, equipment and storage medium
CN113408251A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Layout document processing method and device, electronic equipment and readable storage medium
CN114244603A (en) * 2021-12-15 2022-03-25 中国电信股份有限公司 Anomaly detection and comparison embedded model training and detection method, device and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153320A1 (en) * 2005-09-28 2010-06-17 Lili Diao Method and arrangement for sim algorithm automatic charset detection
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111553399A (en) * 2020-04-21 2020-08-18 佳都新太科技股份有限公司 Feature model training method, device, equipment and storage medium
CN111667050A (en) * 2020-04-21 2020-09-15 佳都新太科技股份有限公司 Metric learning method, device, equipment and storage medium
CN112949780A (en) * 2020-04-21 2021-06-11 佳都科技集团股份有限公司 Feature model training method, device, equipment and storage medium
CN113408251A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Layout document processing method and device, electronic equipment and readable storage medium
CN114244603A (en) * 2021-12-15 2022-03-25 中国电信股份有限公司 Anomaly detection and comparison embedded model training and detection method, device and medium

Also Published As

Publication number Publication date
CN115328871B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN110210413B (en) Multidisciplinary test paper content detection and identification system and method based on deep learning
Karatzas et al. ICDAR 2011 robust reading competition-challenge 1: reading text in born-digital images (web and email)
US8249343B2 (en) Representing documents with runlength histograms
Shahab et al. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images
KR100523898B1 (en) Identification, separation and compression of multiple forms with mutants
CN101719142B (en) Method for detecting picture characters by sparse representation based on classifying dictionary
CN111027297A (en) Method for processing key form information of image type PDF financial data
CN104182722B (en) Method for text detection and device and text message extracting method and system
CN112633382A (en) Mutual-neighbor-based few-sample image classification method and system
CN115497010B (en) Geographic information identification method and system based on deep learning
Gordo et al. Document classification and page stream segmentation for digital mailroom applications
CN111340032A (en) Character recognition method based on application scene in financial field
US20220215679A1 (en) Method of determining a density of cells in a cell image, electronic device, and storage medium
Devi et al. Pattern matching model for recognition of stone inscription characters
CN104899551B (en) A kind of form image sorting technique
CN112784932A (en) Font identification method and device and storage medium
CN115328871B (en) Evaluation method for format data stream file conversion based on machine learning model
Ahmed et al. Enhancing the character segmentation accuracy of bangla ocr using bpnn
CN111553361A (en) Pathological section label identification method
Yang et al. License plate detection based on sparse auto-encoder
Van Phan et al. Collecting handwritten nom character patterns from historical document pages
Lakshmi et al. A new hybrid algorithm for Telugu word retrieval and recognition
Vitadhani et al. Detection of clickbait thumbnails on YouTube using tesseract-OCR, face recognition, and text alteration
CN114863163A (en) Method and system for cell classification based on cell image
Ali et al. Urdu text in natural scene images: a new dataset and preliminary text detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant