CN111738224B

CN111738224B - Intelligent analysis method, system and storage medium for medicine document content

Info

Publication number: CN111738224B
Application number: CN202010737944.2A
Authority: CN
Inventors: 葛亚飞; 王立君; 林加旗; 魏巍; 包卿
Original assignee: Zhejiang Mingdu Intelligent Control Technology Co ltd
Current assignee: Mingdu Zhiyun Zhejiang Technology Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-12-08
Anticipated expiration: 2040-07-28
Also published as: CN111738224A

Abstract

The invention discloses an intelligent analysis method for drug document contents, which comprises the steps of respectively obtaining cell coordinates and contents in a first table and a second table to be analyzed, and identifying cells with consistent contents; respectively acquiring minimum table matrixes of a first table and a second table; acquiring an abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position of the minimum table matrixes in the corresponding table; and comparing the contents of the cells with inconsistent contents in the abnormal cell group, and finding out and marking inconsistent character sets in the contents of the cells. Finally, the number of inconsistent cell results presented to the user is reduced, and the user can conveniently and quickly check and find the wrong and abnormal tables.

Description

Intelligent analysis method, system and storage medium for medicine document content

Technical Field

The invention relates to the technical field of data processing and analysis, in particular to an intelligent analysis method, system and storage medium for drug document content.

Background

A Spreadsheet (Spreadsheet), also known as a Spreadsheet, is a grid made up of a series of rows and columns, in which values, calculations, text, and the like may be stored. A common electronic form, for example, an Excel form, is submitted to a version management server for version management. In daily word processing work in some fields, a large number of documents are often required to be processed, and a large number of tables exist in the documents, and the tables have high similarity and are mutually referred and nested, and meanwhile, due to the reason of large processing workload and the like, a plurality of users are often required to collaboratively edit the documents. For example, a large number of documents and a large number of tables exist in the documents when the medicine enterprise research and development organization prepares to arrange the medicine declaration data. The tables have various conditions of high similarity, mutual reference, nesting and the like. Meanwhile, because the associated tables are manually sorted by multiple persons, various error conditions such as inconsistent contents of corresponding cells, lost table rows and columns, disordered table row and column sequences and the like often exist. However, due to the fact that the tables are numerous and distributed in a large number of different documents, workload of later-stage manual inspection is extremely large, the error condition is difficult to find, data errors are directly caused finally, compliance requirements are difficult to meet, and the progress of medicine declaration is seriously delayed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an intelligent analysis method for medicine document contents, which is used for analyzing the table content difference in a document and comprises the following steps:

s1, respectively obtaining coordinates and contents of cells in the first table and the second table to be analyzed, and identifying cells with consistent contents;

s2, respectively obtaining minimum table matrixes of the first table and the second table, wherein the minimum table matrixes are minimum rectangular table areas containing all cells with consistent contents in the tables;

s3, acquiring an abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table, wherein the abnormal cell group comprises but is not limited to the coordinate and the content of a cell with inconsistent content;

and S4, comparing the contents of the cells with inconsistent contents in the abnormal cell group, and finding out and marking inconsistent character sets in the cell contents.

Preferably, the step S3 specifically includes:

s31, respectively acquiring the row number and the column number of the first table and the second table;

s32, when the difference value of the number of rows or the number of columns of the first table and the second table is smaller than a preset value, the abnormal cell group is obtained according to the difference between the minimum table matrixes of the two tables and the position of the minimum table matrixes in the corresponding table, otherwise, the subsequent difference analysis is not carried out.

Preferably, when the difference between the number of rows or the number of columns of the first table and the second table is smaller than a preset value, the step S32 includes:

s101, traversing each cell of the minimum table matrix when the row number and the column number of the minimum table matrix of the two tables are consistent;

s102, comparing whether the contents of the corresponding positions of the two minimum table matrixes are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group.

Preferably, when the difference between the number of rows or the number of columns of the first table and the second table is smaller than a preset value, the step S32 further includes:

s103, if the row number and the column number of the minimum table matrix are the same, transposing the first table to form a first transposing table;

s104, comparing the minimum table matrix of the second table with the minimum table matrix of the first conversion table and identifying cells with inconsistent contents;

and S105, if no cell with inconsistent contents exists, not performing subsequent difference analysis, otherwise, generating a second abnormal cell group, wherein the second abnormal cell group comprises but is not limited to the coordinates and the contents of the cell with inconsistent contents between the first conversion table minimum table matrix and the second table minimum table matrix.

Preferably, the step S32 further includes:

s106, comparing the number of inconsistent cells between the two minimum table matrixes of the first table and the second table with the number of inconsistent cells of the two minimum table matrixes of the first conversion table and the second table;

s107, if the number of inconsistent cells between the two minimum table matrixes of the first conversion table and the second table is small, acquiring the coordinate and the content of the inconsistent cells between the two minimum table matrixes of the first conversion table and the second table, and updating the coordinate and the content of the inconsistent cells corresponding to the first table and the second table in the abnormal cell group;

and S108, if the number of inconsistent cells between the two minimum table matrixes of the first conversion table and the second table is not less than the number of inconsistent cells between the two minimum table matrixes of the first table and the second table, the abnormal cell group is not updated.

Preferably, the step S32 further includes:

s201, when the number of rows and columns of the minimum table matrix of the second table after being transposed is equal to that of the minimum table matrix of the first table, the second table is transposed to form a second transposed table;

s202, comparing whether the contents of the corresponding positions of the two minimum table matrixes of the second transposed table and the first table are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group.

Preferably, the step S202 specifically includes: and if the two minimum table matrixes of the second transposed table and the first table have cells with inconsistent contents, recording the coordinates and the contents of the cells with inconsistent contents into abnormal cell groups corresponding to the first table and the second table.

Preferably, the first table and the second table are located within different electronic documents.

The invention also discloses an intelligent analysis system, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the steps of any one of the intelligent analysis methods for the medicine document content when executing the computer program.

The invention also discloses a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the steps of the intelligent analysis method for the drug document content.

The intelligent analysis method for the drug document content disclosed by the invention identifies the cells with consistent content by respectively obtaining the coordinates and the content of the cells in the first table and the second table to be analyzed, and carries out subsequent comparative analysis through the two table pairs, thereby greatly reducing the calculated amount of comparative analysis. In the process of carrying out comparative analysis on the two tables, various conditions such as embedding of a small table into a large table, transposition of rows and columns of the tables, disorder of row and column sequences of the tables, missing of rows and columns inside and around the tables and the like are fully considered, and analysis is respectively carried out according to the conditions of the two tables to find out the coordinates and the positions of inconsistent cells between the two tables. Finally, in the abnormal result processing step, the real abnormal table pairs and the inconsistent character sets in the inconsistent cells can be screened out by eliminating the inconsistent cells generated by disordered row-column sequence and missing row-column sequence and according to the found real inconsistent cells and the inconsistent character sets in the contents of the inconsistent cells, so that the number of results presented to the user is finally reduced, and the user can conveniently and quickly find the error and abnormal tables. The method can realize comparative analysis of the two tables, and find and position inconsistent cells and cell content differences of the associated tables. The method is suitable for scenes that a large number of similar tables need to be processed and the tables are nested and multiplexed at multiple positions, can reduce a large amount of repetitive work of manually checking the consistency of the tables, and avoids errors that the contents of corresponding cells of the associated tables are inconsistent.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic flow chart of an intelligent analysis method for drug document contents disclosed in this embodiment.

Fig. 2 is a schematic flowchart of step S21 disclosed in this embodiment.

FIG. 3 is a schematic diagram illustrating the transformation of the table to be analyzed according to the present embodiment.

Fig. 4 is a schematic flowchart of the step S212 disclosed in this embodiment.

Fig. 5 is a schematic flowchart of step S3 disclosed in this embodiment.

Fig. 6 is a schematic flow chart of the step S32 in the state one disclosed in this embodiment.

Fig. 7 is a schematic flowchart of the specific process of step S32 in state two disclosed in this embodiment.

Fig. 8 is a schematic flow chart illustrating the step S32 in the first state of the present embodiment.

Fig. 9 is a schematic specific flowchart of the step S32 in another state of the third embodiment.

Fig. 10 is a schematic flow chart illustrating the step S32 in the fourth situation according to this embodiment.

Fig. 11 is a schematic specific flowchart of the step S32 in another state of the fourth embodiment.

Fig. 12 is a schematic structural diagram of the intelligent analysis system disclosed in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly under and obliquely below the second feature, or simply meaning that the first feature is at a lesser elevation than the second feature.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of "first," "second," and similar terms in the description and claims of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

At present, various enterprises such as medicine enterprise research and development institutions and the like have a large number of documents and a large number of tables in the documents when preparing to arrange medicine declaration data. The tables have high similarity, and are mutually referenced and nested. Because the associated tables are usually manually sorted by multiple persons, various error conditions such as inconsistent content of corresponding cells, missing rows and columns of the tables, disordered row and column sequences and the like are inevitably generated in the tables. In addition, due to the fact that the forms are numerous and distributed in a large number of different documents, workload of later-stage manual inspection is extremely large, the error condition is difficult to find completely and quickly, data errors are caused, and requirements for material compliance are difficult to meet. To solve the technical problems, as shown in fig. 1, the embodiment discloses an intelligent analysis method for drug document contents, which specifically includes:

step S1, respectively obtaining the coordinates and the content of the cells in the first table and the second table to be analyzed, and identifying the cells with consistent content.

Specifically, coordinates and contents of each cell of two tables to be analyzed are obtained, and table structured data corresponding to the coordinates and the contents of each table cell are generated respectively. Wherein the table structured data may include information such as document information to which the table belongs, the position of the table in the document, the coordinates of the cells in the table, and/or the contents of the cells. The first table and the second table may be located in different electronic documents or may be located in the same electronic document.

And processing each table into a data structure in which the coordinates of the cells and the contents of the cells are in one-to-one correspondence by acquiring the coordinates and the contents of the cells of the table. Specifically, the tables in documents of various formats such as Word, Excel, PDF and the like can be read by the existing tools and methods, each cell of the tables is traversed, and the coordinates of all the cells are separated by spaces to form a character string; the contents of all the cells also take the blank as a separator to form a character string, and the coordinates of the cells are ensured to be in one-to-one correspondence with the contents. In some embodiments, due to the existence of merged cells in the table, when reading the coordinates of each cell in the table, it is ensured that the row coordinates of the cells in the same row are equal, and the column coordinates of the cells in the same column are equal. The missing cell coordinates resulting from merging cells in such cases may be supplemented with empty content, which may be that the cell content corresponding to the coordinates is empty.

Specifically, each table may form a data structure formed as follows,

{

string file id// the id of the document in which the table is located

Integer location// the location of the table in the document

String coordinate, coordinate information of all cells in table

String content, content information of all cells of table

}

The file id and location fields are used for indicating the document to which the form belongs and the position of the form in the document, and are used for positioning the form when presenting the form analysis comparison result to the user. The coordinate and content fields are cell coordinates and content information for table association analysis and comparative analysis. The data after the table structuring can be directly used for subsequent analysis, but when the number of the tables is large, the structured data occupies a large amount of memory; in order to avoid structured operation in each analysis, the method recommends to persist the table structured data. Any relational, non-relational database, such as MySQL, SqlServer, Oracle, MongoDB, Elasticissearch, etc., may be selected for the persistence of the table structured data. And forming table structured data by extracting the coordinates and the contents of the table cells, ensuring that the row and column coordinates of the cells in the same row and column are consistent, and preparing for subsequent analysis.

Step S2, respectively obtaining minimum table matrices of the first table and the second table, where the minimum table matrices are minimum rectangular table areas containing all cells with consistent contents in the tables.

In some embodiments, the step further comprises:

and step S21, comparing the structured data of each table pair by pair, obtaining the coordinates and the content of the cells with consistent content in the two tables, and judging whether the two tables have an association relation. The step is used for screening out the tables related to the content so as to carry out the next comparative analysis, and the non-related tables are regarded as two completely different tables and are not subjected to the comparative analysis. That is, when there is no association between two tables, it can be regarded as a normal different table, and there is no need to present the difference between different tables without association to the user. Because the disclosed method is used to identify to the user those cells that should have been identical in content in different documents or in different tables of the same document, but that may have been inconsistent due to human error.

As shown in fig. 2, the step S21 specifically includes:

step S211, comparing the structured data of the two tables pair by pair, and obtaining the coordinates and the content of the cells with the same content in the two tables.

Step S212, judging the association state of the two tables according to the number and/or distribution positions of the cells with consistent contents. As shown in fig. 4, wherein the step S212 further includes:

step S2121, acquiring the number of content-consistent cells of each table and the distribution positions of the cells in the table.

Step S2122, obtaining a minimum table matrix of each table, wherein the minimum table matrix is a minimum rectangular table area containing all cells with consistent contents in the table.

Specifically, taking A, B two tables as an example, take a table structured data, and parse the coordinate and content fields into two linked lists, coordinateListA and contentListA, respectively. The coordinates and contents of the corresponding index positions of the two linked lists are in one-to-one correspondence. Similarly, the B table structured data is taken and analyzed to obtain coordinateListB and contentListB. Traversing the linked lists contentListA and contentListB to find the equal elements; the corresponding coordinates are found in the coordinateListA and coordinateListB, respectively, based on the content element. A. The coordinates of the consistent cells of the B table contents can form a dictionary sameCell, the key value is the consistent cell coordinates of the A table, and the value is the consistent cell coordinates of the B table.

A, B, converting the table into a matrix, as shown in fig. 3, taking a table as an example, the table shown in fig. 3 is an original table matrix, and initializing the a table matrix into the original matrix according to coordinates, wherein each element value is initially 0. According to the dictionary sameCell, the element of the coordinate position of the consistent cell is rewritten into 1, and the coordinate of the lower right corner 1 forms a large matrix. And removing all 0 rows and columns on the upper part and the left side of the large matrix to obtain the minimum table matrix and the minimum table matrix first element coordinates. Specifically, a plurality of cells surrounded by the dotted line frame form a large matrix, and a plurality of cells filled with gray form a minimum table matrix. The minimum table matrix can be thought of as a small table embedded in a large table. The coordinates of each matrix are from 0, the coordinate of the first element [0, 0] of the minimum table matrix in the upper figure in the large matrix is [2, 2], namely the first element coordinate of the minimum table matrix.

Step S2123, when the number of the cells with consistent content is greater than a preset value, and/or two times of the number of the cells with consistent content is greater than a preset proportion of the sum of the number of the cells contained in the two tables, and/or the number of the cells with consistent content is greater than a preset proportion of the total number of the cells in the minimum table matrix, judging that the two tables are associated table pairs.

Specifically, after the matrix of table A, B is obtained, it is determined whether or not two tables A, B are related tables. The judgment rule of the association table can be formulated according to experience and actual conditions. The preset determination rule in this embodiment may be:

at least n cells of the two tables have the same content, namely the number of 1 in the minimum table matrix is larger than n.

Twice the number of consistent cells is greater than m percent of the sum of the two table cell numbers.

The number of rows and columns of the minimum table matrix of the tables A and B is greater than 1, and the number of 1 is greater than L percent of the total number of elements of the minimum table matrix.

Only if one or more of the above rules are satisfied is the association table with the association relationship, otherwise there may be two unrelated tables. In this embodiment, n may be a preferred recommendation value of 3, m may be a recommendation value of 50, and L may be a recommendation value of 50, which may be set according to specific document situations.

By analyzing the minimum table matrix of the two tables, the nesting situation of the contents of part of the tables can be fully considered, that is, under the condition that the small table is embedded in the rest of the large tables, only the embedded contents of the small table have the association relationship with the other table, at this time, the area where the small table embedded in the table is located needs to be distinguished by obtaining the form of the minimum table matrix, and then the embedded table area is compared with the nesting area in the other tables or other tables to determine the association relationship between the embedded table area and the nesting area in the other tables or other tables. In addition, because of the common situation, human errors can only cause a small number of errors, namely, a small number of inconsistent cells are generated in the association table. Therefore, when the number of inconsistent cells in the two tables to be analyzed is too large, i.e. the difference between the two tables is large, the two tables can be regarded as normal and unrelated different tables, and are not presented to the user.

Through the steps, all the structured tables are traversed, and the associated table pairs, the consistent cell coordinate dictionary sameCell thereof and the table matrix are found out and used as input parameters for next comparison and analysis. And performing table association analysis by comparing the cell contents of the tables, establishing a minimum table matrix formed by cells with consistent contents, and screening whether the two tables are associated table pairs or not by a preset custom criterion. And only the correlation table pairs are subjected to subsequent comparative analysis, so that the calculation amount of the comparative analysis is greatly reduced.

Step S3, obtaining an abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table, where the abnormal cell group includes, but is not limited to, the content inconsistent cell coordinates and content.

As shown in fig. 5, the step S3 specifically includes:

and S31, respectively acquiring the row number and the column number of the first table and the second table.

In this embodiment, the step S3 can be specifically divided into the following states to be processed respectively:

and in the state I, judging that the row number and the column number of the minimum table matrix in the two tables of the associated table pair are consistent.

In the second state, the minimum table matrix of one of the two tables in the associated table pair is determined to be consistent with the row and column number of the minimum table matrix of the other table after being transposed.

And in the third state, the difference between the row number and the column number of the minimum table matrix in the two tables of the associated table pair is judged to be n, wherein n is smaller than a preset value.

And in the state IV, the difference between the row number and the column number of the two tables of the associated table pair is judged to be n, wherein n is smaller than a preset value.

For the state one, the step obtains an abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table, as shown in fig. 6, specifically includes:

and S101, traversing each cell of the minimum table matrix when the row number and the column number of the minimum table matrix of the two tables are consistent.

And S102, comparing whether the contents of the corresponding positions of the two minimum table matrixes are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group. Namely, if the cell contents of the corresponding positions of the two minimum table matrixes are the same, no recording is performed. I.e. an abnormal cell in the associated table pair without a content error.

Specifically, taking the related table pair a and B in this embodiment as an example, the number of rows and the number of columns of the minimum table matrix of the two tables are the same at A, B. Traversing each element of the minimum table matrix, comparing A, B whether the content of the corresponding coordinate of the minimum table matrix is equal, respectively recording A, B all unequal cell coordinates and content, and simultaneously recording the associated table pair to generate an abnormal cell group. The exception cell set may employ the dictionary data structure DifTablecells with key values as the associated table pairs, such as A B. The value is a linked list formed by arrays, and each element of the linked list is the coordinate and the content of two inconsistent cells of the table, such as [ A cell coordinate, A cell content, B cell coordinate, B cell content ]. When A, B the contents of the coordinates corresponding to the minimum table matrix are all equal, the contents of the tables A, B match and no recording is performed.

For the case in the state one, the step of screening out the abnormal cell group on the association table pair according to the distribution position of the inconsistent-content cell may specifically further include the following steps:

and S103, if the row number and the column number of the minimum table matrix are the same, transposing the first table to form a first transposing table.

And S104, comparing the minimum table matrix of the second table with the minimum table matrix of the first conversion table and identifying the cells with inconsistent contents.

S106, comparing the number of inconsistent cells between the two minimum table matrixes of the first table and the second table with the number of inconsistent cells between the two minimum table matrixes of the first conversion table and the second table.

S107, if the number of inconsistent cells between the two minimum table matrixes of the first conversion table and the second table is small, obtaining the coordinate and the content of the inconsistent cells between the two minimum table matrixes of the first conversion table and the second table, and updating the coordinate and the content of the inconsistent cells corresponding to the first table and the second table in the abnormal cell group.

For example, in this embodiment, if A, B has the same number of rows and columns in the minimum table matrix, there may be a possibility that the two tables will be aligned after being transposed. And after one table is rotated, comparing whether elements of corresponding coordinates are equal or not, and comparing the obtained number of the inconsistent cells with the content between the previous A and B tables, wherein the smaller number is the correct comparison result. For example, the transposed table C formed after the table a is transposed traverses each cell in the minimum table matrix area of the transposed table C and the table B, compares whether the elements or contents of the corresponding coordinates of the minimum table matrices of C and B are equal, records the coordinates and contents of all corresponding cells with inconsistent contents in the minimum table matrices of C and B, respectively, and records the associated table pair at the same time. If the number of inconsistent content cells obtained after the minimum table matrix of the transposed table C and the table B formed after the table A is transposed is analyzed, the inconsistent content cells are smaller than the number of inconsistent content cells of the minimum table matrix of the tables A and B obtained in the previous step. It indicates that the row content of the possible a table is just related to the column content of the B table, and the column content of the a table is just related to the row content of the B table, so that only the table after the conversion has less inconsistent cell number than the original table and another table. By acquiring and screening inconsistent cell grids of contents before and after transposing two tables with the same row number and column number of the minimum table matrix, the tables only exchanging the contents of the rows and columns of the tables can be effectively distinguished, and the tables only exchanging the contents of the rows and columns can be regarded as normal associated tables with the same contents without being presented to a user.

For the second state, as shown in fig. 7, the step of obtaining the abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table specifically includes:

s201, when the number of rows and columns of the minimum table matrix of the second table after being transposed is equal to that of the minimum table matrix of the first table, the second table is transposed to form a second transposed table.

S202, comparing whether the contents of the corresponding positions of the two minimum table matrixes of the second transposed table and the first table are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group. Specifically, if two minimum table matrixes of the second transposed table and the first table have cells with inconsistent contents, the coordinates and the contents of the cells with inconsistent contents are recorded into abnormal cell groups corresponding to the first table and the second table.

Specifically, in this embodiment, that is, when A, B one of the minimum table matrices is rotated, the numbers of rows and columns of the two matrices are the same. For the situation, after a table is transposed, each cell in the minimum table matrix is traversed, whether the cell contents or elements of the corresponding coordinates of the minimum table matrix are equal is compared A, B, all unequal cell coordinates and contents in A, B are recorded respectively, and the related table pair is recorded at the same time, so that an abnormal cell group is generated or recorded, wherein the data format of the abnormal cell group can be parameterized by the steps. If the comparison results are all equal, the contents of table A, B match, and recording is not performed. In this embodiment, after one of the minimum table matrices of a and B is rotated, the numbers of rows and columns of the two matrices are consistent, that is, it indicates that the contents on the nested small table in the a table and the nested small table in the B table are likely to be only replaced by the row contents and the column contents, and the conversion of the row contents and column contents of the tables is only different in expression manner, and can be regarded as a normal associated table with the same contents, and is not required to be presented to a user, and only one of the small tables is rotated and then correspondingly compared with the other small table, so as to find out a real content inconsistent cell and present the true content inconsistent cell to the user.

For the row number difference n of the minimum table matrix in the two tables in the state three, as shown in fig. 8, the step of obtaining the abnormal cell group according to the difference between the minimum table matrices of the two tables and the position in the corresponding table specifically includes:

step S301, when the number of rows of the minimum table matrix of the first table in the related table pair is N more than the minimum table matrix of the second table, and N is less than a preset value, acquiring the N rows with the most inconsistent cells in the minimum table matrix of the first table, and recording the coordinates and the contents of the cells in the N rows.

In step S302, the transition table matrix is formed by removing the N rows from the minimum table matrix of the first table.

Step S303, whether the cell contents of the corresponding positions of the transition table matrix and the minimum table matrix of the second table are the same or not is sequentially compared, if different cells exist, an abnormal cell group is generated or recorded, and the abnormal cell group comprises but is not limited to an associated table pair, the coordinates and the contents of the cells with inconsistent contents in the transition table matrix and the second table, and the coordinates and the contents of the cells in the N rows.

In this embodiment, the value of n may be preset according to an actual usage environment, and n is 2 in this embodiment for example. Specifically, when the minimum table matrix row number of the a and B is different by more than 2 rows, the a and B tables are not considered to be related tables, and no comparative analysis is performed. The case where the minimum table matrix of a is one row more than the minimum table matrix of B will be specifically described. Finding out the row with the most '0' in the minimum table matrix of A, wherein the row is the row with the most data, and recording the coordinates and the content of each cell element of the row. The row is removed from the minimum table matrix of a and the elements below the row are shifted up to form table a'. And comparing whether the cell contents or elements of the corresponding coordinates of the minimum table matrix of A' and the minimum table matrix of B are equal, if not, respectively recording all unequal cell coordinates and contents, simultaneously recording the associated table pairs, and generating or recording an abnormal cell group, wherein the abnormal cell group comprises but is not limited to the associated table pairs, the coordinates and contents of the cells with inconsistent contents in the transition table matrix and the second table, and the coordinates and contents of the excessive N rows of cells. Other cases of phase difference can be compared and analyzed by referring to the above method.

For the difference n between the number of columns of the minimum table matrix in the two tables in the state three, as shown in fig. 9, the step of obtaining the abnormal cell group according to the difference between the minimum table matrices of the two tables and the position in the corresponding table specifically includes:

step S401, when the number of columns of the minimum table matrix of the first table in the related table pair is N more than that of the minimum table matrix of the second table, and N is less than a preset value, acquiring the N columns with the most inconsistent cells in the minimum table matrix of the first table, and recording the coordinates and the contents of the cells in the N columns.

In step S402, the transition table matrix is formed by removing the N columns from the minimum table matrix of the first table.

Step S403, sequentially comparing whether the cell contents of the corresponding positions of the transition table matrix and the minimum table matrix of the second table are the same, and if there are different cells, generating or entering an abnormal cell group, where the abnormal cell group includes, but is not limited to, an associated table pair, coordinates and contents of cells with inconsistent contents in the transition table matrix and the second table, and coordinates and contents of the N columns of cells.

In the present embodiment, the value of n may be preset according to the actual usage environment, and n is 2 in the present embodiment for example. Specifically, when the number of columns of the minimum table matrix of a and B differs by more than 2 columns, the two tables a and B are not considered to be related tables, and no column comparison analysis is performed. The case where the minimum table matrix of a has one more column than the minimum table matrix of B will be specifically described. Finding out the column with the most '0' in the minimum table matrix of A, wherein the column is the column with the most 0, and recording the coordinates and the content of each cell element in the column. The column is removed from the minimum table matrix for A, and the elements to the right of the column are shifted to the left, forming Table A'. And comparing whether the cell contents or elements of the corresponding coordinates of the minimum table matrix of A' and the minimum table matrix of B are equal, if not, respectively recording all unequal cell coordinates and contents, simultaneously recording the associated table pairs, and generating or recording an abnormal cell group, wherein the abnormal cell group comprises but is not limited to the associated table pairs, the cell coordinates and contents with inconsistent contents in the transition table matrix and the second table, and the coordinates and contents of the excessive N columns of cells. Other cases of phase difference can be contrasted with the above method.

In the present embodiment, when the number of rows or columns of the minimum table matrix of tables a and B differs by n rows or n columns. That is, it is likely that the inconsistency between the cells in rows or columns on the contents of the nested small table in the a table and the nested small table in the B table is caused by some of the rows and columns in the tables being out of order and missing due to human negligence while the specific tables are being processed. Therefore, it is necessary to firstly eliminate the inconsistent cells in rows or columns in the nested table caused by human errors, and then perform the comparison analysis of the corresponding cell contents to avoid the influence of the inconsistent cells in the whole row or whole column on the comparison of other cells, so as to more accurately and quickly find out the inconsistent cells in contents caused by the disorder of the row and column sequence or the loss of the row and column and other inconsistent cells in single contents caused by input errors.

For one of the states four, the number of columns of the two tables of the associated table pair differs by n, where n is smaller than the preset value. As shown in fig. 10, the step of obtaining the abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table specifically includes:

step S501, when the difference between the number of columns of the two tables is larger than R, wherein R is a preset value, the two tables are not considered to be related tables, and no column comparison analysis is performed.

Step S502, otherwise, when the number of columns of the two tables is different by n columns, wherein n is not more than R, all the column positions of the inconsistent cells in the table with more columns are obtained, and if part or all of the columns are positioned in the minimum table matrix of the table, the processing is carried out according to the step of the state three.

Step S503, if the column is located outside the minimum table matrix of the table, recording coordinates and contents of all cells in the column, and generating or entering an abnormal cell group of the associated table pair.

Specifically, in this embodiment, the minimum table matrix of the table is formed by removing four non-uniform cells, and the two tables may have the case where the outermost layers of the four sides differ by rows and columns. The value of R can be specified according to actual conditions, and in this embodiment, R is set to 2, for example, when the difference between tables a and B exceeds 2, the two tables are not considered to be related tables, and no comparative analysis is performed. The case where table a has one more column than table B will be specifically described. And judging whether the column A is the first column or the last column more than the column B, if the more columns are not on two sides, belonging to a case 3, and analyzing and processing. And judging whether the first rows of the table A are all 0, if so, judging whether the first rows are multiple rows, otherwise, judging whether the last rows are all 0, if so, judging that the last rows are multiple rows, otherwise, judging that the multiple rows are not on two sides of the table, belonging to the state three, and not performing comparative analysis. After finding one more column, if the association table pair A, B already exists in the diffablecells, add value to the coordinates and contents of the column of cells; if the association table pair A, B does not exist in DifTablecells, then the association table pair A B, and the coordinates and contents of the column of cells, are added to the dictionary DifTablecells, where the coordinates and contents of the corresponding cell in Table B are empty.

In the case of state four, the row numbers of the two tables of the associated table pair differ by n, where n is less than the predetermined value. As shown in fig. 11, the step of obtaining the abnormal cell group according to the difference between the minimum table matrixes of the two tables and the position in the corresponding table specifically includes:

step S601, when the line number difference of the two tables is larger than R, wherein R is a preset value, the two tables are not considered to be related tables, and no comparative analysis is performed.

Step S602, otherwise, when the row number difference between the two tables is n rows, where n is not greater than R, acquiring all row positions of inconsistent cells in the table with more row numbers, and if the row is partially or completely located in the minimum table matrix of the table, processing according to the step of the state three.

Step S603, if the row is located outside the minimum table matrix of the table, recording coordinates and contents of all cells in the row, and generating or entering an abnormal cell group of the associated table pair.

Specifically, in this embodiment, the minimum table matrix of the table is formed by removing four non-uniform cells, and the two tables may have the condition that the outermost layers of the four sides are different by a row. The value of R can be specified according to actual conditions, and in this embodiment, R is set to 2, for example, when the difference between tables a and B exceeds 2, the two tables are not considered to be related tables, and no comparative analysis is performed. The case where table a has one more row than table B will be described in detail. And judging whether the row A is the first row or the last row more than the row B, if the more rows are not on both sides, belonging to the case 3, and analyzing and processing. And judging whether the first rows of the table A are all 0, if so, judging whether the first rows are more rows, if not, judging whether the last rows are all 0, if so, judging that the last rows are more rows, if not, judging that the more rows are not on two sides of the table, belonging to the state three, and not carrying out contrastive analysis. After finding an extra row, if the association table pair A, B already exists in the diffablecells, add value to the coordinates and contents of the row of cells; if the association table pair A, B does not exist in DifTablecells, then the association table pair A B, and the coordinates and contents of the row of cells are added to the dictionary DifTablecells, where the coordinates and contents of the corresponding cell in Table B are empty. The dictionary difftablecells obtained in the above steps can be used as input parameters for the difference content processing in the following step S4.

The contrast analysis algorithm can consider the conditions that a small table is embedded into a large table, rows and columns of the table are transposed, the rows and columns of the table are disordered, and rows and columns inside and around the table are missing, and find out the coordinates and the positions of inconsistent cells of the associated table pairs.

The table comparative analysis in step S3 fully considers the situations of embedding a small table into a large table, transposing rows and columns of the table, disorder of rows and columns of the table, and missing rows and columns inside and around the table, finds out the coordinates and positions of inconsistent cells of the associated table pairs, has stronger applicability and universality for various tables, and can help users perform comparative analysis on various tables.

Step S4, comparing the contents of the cells with inconsistent contents in the abnormal cell group, and finding out and marking inconsistent character sets in the cell contents.

Specifically, the step S4 is to filter the association table according to the number of the inconsistent cells, compare the contents of the inconsistent cells, and find and mark inconsistent character sets in the cell contents. Since human error usually causes only a few errors, a few inconsistent cells are generated. When the number of inconsistent cells in the associated table pair is too large, the difference between the two tables is large, and the two tables can be regarded as normal different tables and are not presented to the user. In addition to this, it can be considered that the exception form is caused by various reasons and needs to be presented to the user for inspection and processing. One or more relatively discrete inconsistent cells may be generated due to a human error, or a row-column order may be confused or a row-column may be lost, thereby generating a plurality of inconsistent cells in a row or column. Therefore, the number of cells resulting from the row-column sequence confusion and the row-column loss is subtracted from the number of cells resulting from the definition of all content inconsistency cells, and the resulting number of cells is the true inconsistency cell number. When the number of the real inconsistent cells is less than or equal to M, the associated table pair is considered to be an abnormal table and needs to be presented to a user for processing; otherwise, the different table is considered normal and is not presented to the user, and in the present embodiment, the preferred value of M may be 3.

By traversing the abnormal cell group, namely traversing the DifTablecells of the dictionary, inconsistent cells generated due to missing of lines and columns are removed firstly, namely removing array elements of which the coordinates of the A table cell are empty or the coordinates of the B table cell are empty in the value linked list of the dictionary. Then, the inconsistent cells generated by the disordered row-column sequence are removed. A. The inconsistent cell linked List after the inconsistent cells generated by the row and column missing are removed by the B table difference result is difCell, the data structure is List < String [ ] >, wherein the array String [ ] is [ A cell coordinate, A cell content, B cell coordinate, B cell content ]. And taking out all the A cell coordinates in the difCell, converting the A cell coordinates into a minimum table matrix difMA of the difference result by adopting a method of converting a table into the minimum table matrix, wherein for the minimum table matrix, the coordinate with an element value of 1 represents the coordinate of the inconsistent cell. The essential condition for the two columns of cells to be disordered is that two columns of values in the difMA are all 1, and the column coordinates of the two columns of cells are exchanged to be equal to the content of the corresponding cells in the B table. The essential condition for the occurrence of the disorder of the two rows of cells is that two rows of values in the difMA are all 1, and the contents of the corresponding cells in the B table are equal after exchanging the row coordinates of the two rows of cells. The presence of more than 2 columns or 2 rows of cell order scrambling can be extended with reference to the above conditions. The recommendation considers at most the case of 3 rows or 3 columns of cells in a disordered order, and other cases are considered to be normal different tables and are not presented to the user. And (4) according to the essential condition judgment, finding out the cells with disordered row-column sequences, removing the cells from the difCell, and finally obtaining the number of the remaining cells as the number of the real inconsistent cells. And removing the associated table pairs with the real inconsistent unit cell number larger than n from the dictionary DifTablecells, and obtaining the remaining result, namely the difference table pairs required to be presented to the user.

Since the contents of two corresponding inconsistent cells may be many times similar, it is difficult for a user to quickly find the true inconsistency of the contents of the two cells. Therefore, the corresponding inconsistent cell contents of the difference table pairs need to be contrasted and analyzed, and inconsistent character sets are identified. The contents of the two cells can be regarded as two character strings a and b, the longest common subsequence l of the character strings a and b is searched, the character sets except l in the character strings a and b are inconsistent character sets, and the character sets can be highlighted by adding labels before and after the characters.

In step S4, the analysis result is filtered again according to the number of the actual inconsistent cells, so that inconsistent cells generated by disorder of the row and column sequence and missing of the row and column can be eliminated, and the inconsistent character set is found according to the actual inconsistent cells and the content of the inconsistent cells. And screening out the truly abnormal table pairs and the inconsistent character sets in the inconsistent cells, reducing the number of results presented to the user, and facilitating the user to quickly find the wrong and abnormal tables. Meanwhile, the contents of the inconsistent cells are contrastively analyzed, so that inconsistent character sets can be found and highlighted, and users can find abnormal contents of the forms conveniently.

According to the intelligent analysis method for the drug document content, disclosed by the invention, the cell coordinates and the content in the first table and the second table to be analyzed are respectively obtained, the cells with consistent content are identified, and the two tables are used for carrying out subsequent comparative analysis, so that the calculation amount of comparative analysis is greatly reduced. In the process of carrying out comparative analysis on the two tables, various conditions such as embedding of a small table into a large table, transposition of rows and columns of the tables, disorder of row and column sequences of the tables, missing of rows and columns inside and around the tables and the like are fully considered, and analysis is respectively carried out according to the conditions of the two tables to find out the coordinates and the positions of inconsistent cells between the two tables. Finally, in the abnormal result processing step, the real abnormal table pairs and the inconsistent character sets in the inconsistent cells can be screened out by eliminating the inconsistent cells generated by disordered row-column sequence and missing row-column sequence and according to the found real inconsistent cells and the inconsistent character sets in the contents of the inconsistent cells, so that the number of results presented to the user is finally reduced, and the user can conveniently and quickly find the error and abnormal tables. The method can realize comparative analysis of the two tables, and find and position inconsistent cells and cell content differences of the associated tables. The method is suitable for scenes that a large number of similar tables need to be processed and the tables are nested and multiplexed at multiple positions, can reduce a large amount of repetitive work of manually checking the consistency of the tables, and avoids errors that the contents of corresponding cells of the associated tables are inconsistent.

As shown in fig. 12, the present invention further provides an intelligent analysis system 1 for table content differences, which includes a memory 11, a processor 12, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the intelligent analysis method for drug document contents as described in the above embodiments.

The intelligent analysis system for table content differences may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the schematic diagram is merely an example of an intelligent analysis system for table content differences and does not constitute a limitation of an intelligent analysis system device for table content differences, and may include more or less components than those shown, or combine some components, or different components, for example, the intelligent analysis system device for table content differences may also include input-output devices, network access devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor is a control center of the intelligent analysis system device for table content differences, and various interfaces and lines are used to connect various parts of the intelligent analysis system device for table content differences.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the intelligent analysis system device for table content differences by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like, and the memory may include a high speed random access memory, and may further include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The intelligent analysis system for table content differences, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the processes in the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above embodiments of the intelligent analysis method for the content of a drug document. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

In summary, the above-mentioned embodiments are only preferred embodiments of the present invention, and all equivalent changes and modifications made in the claims of the present invention should be covered by the claims of the present invention.

Claims

1. An intelligent analysis method for drug document content, which is used for analyzing table content differences in a document, and is characterized by comprising the following steps:

s3, obtaining an abnormal cell group according to the difference between the minimum table matrices of the two tables and the position of the minimum table matrix of the two tables in the corresponding table, where the abnormal cell group includes, but is not limited to, the coordinates and the content of a cell with inconsistent content, and specifically includes:

when the row number and the column number of the minimum table matrixes of the two tables are consistent, traversing each cell of the minimum table matrix, comparing whether the contents of the corresponding positions of the two minimum table matrixes are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group;

when the number of rows and the number of columns of the minimum table matrix of the second table after being transposed are equal to those of the minimum table matrix of the first table, the second table is transposed to form a second transposed table; comparing whether the contents of the corresponding positions of the two minimum table matrixes of the second transposed table and the first table are equal, if so, not recording the abnormal cell group, otherwise, recording the abnormal cell group;

2. The intelligent analysis method for drug document content according to claim 1, wherein the step S3 specifically comprises:

3. The intelligent analysis method for drug document content according to claim 2, wherein when the difference between the number of rows or the number of columns of the first table and the second table is smaller than a preset value, the step S32 further comprises:

4. The intelligent analysis method for drug document content according to claim 3, wherein the step S32 further comprises:

5. The intelligent analysis method for drug document content according to claim 4, wherein: the comparing whether the contents of the corresponding positions of the two minimum table matrixes of the second transposed table and the first table are equal or not, if so, the abnormal cell group is not recorded, otherwise, the abnormal cell group is recorded, and the method specifically includes:

and if the two minimum table matrixes of the second transposed table and the first table have cells with inconsistent contents, recording the coordinates and the contents of the cells with inconsistent contents into abnormal cell groups corresponding to the first table and the second table.

6. The intelligent analysis method for drug document contents according to any one of claims 1 to 5, wherein: the first form and the second form are located within different electronic documents.

7. An intelligent analysis system comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, realizes the steps of the method according to any of claims 1-6.

8. A computer-readable storage medium storing a computer program, characterized in that: the computer program realizing the steps of the method according to any of claims 1-6 when executed by a processor.