CN116090432A - Document matching method and electronic device - Google Patents

Document matching method and electronic device Download PDF

Info

Publication number
CN116090432A
CN116090432A CN202310182575.9A CN202310182575A CN116090432A CN 116090432 A CN116090432 A CN 116090432A CN 202310182575 A CN202310182575 A CN 202310182575A CN 116090432 A CN116090432 A CN 116090432A
Authority
CN
China
Prior art keywords
query
character
candidate character
cells
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310182575.9A
Other languages
Chinese (zh)
Inventor
姚贡之
高煜光
程文渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hongji Information Technology Co Ltd
Original Assignee
Shanghai Hongji Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hongji Information Technology Co Ltd filed Critical Shanghai Hongji Information Technology Co Ltd
Priority to CN202310182575.9A priority Critical patent/CN116090432A/en
Publication of CN116090432A publication Critical patent/CN116090432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a document matching method and electronic equipment, wherein the method comprises the following steps: identifying a document to be queried to extract a plurality of candidate character cells contained in the document to be queried; matching the query keyword with each candidate character cell to determine a valid character cell; and determining a target logic row where each effective character cell is located according to the position information of each effective character cell.

Description

Document matching method and electronic device
Technical Field
The application relates to the technical field of document processing, in particular to a document matching method and electronic equipment.
Background
In the field of document retrieval, it is common to identify text contained in unstructured documents and then match keywords from the identified text. In the text matching process, in order to improve recall, recognition results are also fault-tolerant, and in the case of different data distribution, interpretability is lacking and effects are quite different.
Disclosure of Invention
The invention aims to provide a document matching method and electronic equipment, so as to solve the problem that the document identification effect is insufficient.
In a first aspect, the present invention provides a document matching method, including: identifying a document to be queried to extract a plurality of candidate character cells contained in the document to be queried; matching the query keyword with each candidate character cell to determine a valid character cell; and determining a target logic row where each effective character cell is located according to the position information of each effective character cell.
In the embodiment, the character cells contained in the document to be queried can be determined according to the form of the character cells, and compared with each independent character, the characters forming the unit can better contain local errors, so that the screening accuracy and effectiveness of the effective character cells are improved; further, after the valid character cells are determined, the target logic row can be determined according to the position information of the valid character cells, and keywords needing to be screened can be accurately and effectively presented based on the presentation of the target logic row.
In an alternative embodiment, the matching the query keyword with each candidate character cell to determine a valid character cell includes: dividing a plurality of the candidate character cells into a plurality of logic columns, wherein each logic column comprises one or more character cells; for each logic column, determining a column semantic tag corresponding to the logic column according to character cells contained in the logic column; screening on each column of semantic tags according to the query tags in the query keywords to determine a semantic matching group matched with the query tags; and screening candidate character cells in the semantic matching group according to the query content in the query keyword to determine valid character cells matched with the query content.
In the above embodiment, the method includes analyzing the document to be queried to obtain a plurality of candidate character cells including position information and character content, dividing all candidate character cells into a plurality of logic columns according to the position information of each candidate character cell, extracting column semantic tags of the logic columns, screening out the logic columns with the same column semantic tags as the query tags, and retrieving valid character cells matched with the query content from the screened logic columns.
In an alternative embodiment, the candidate character cell includes character content and location information, the location information including upper left and lower right corner coordinate information of a circumscribed rectangle of the character content; the dividing the plurality of candidate character cells into a plurality of logical columns includes: for any two first candidate character cells and second candidate character cells, if the abscissa of the upper left corner of the circumscribed rectangular frame of the second candidate character cell is larger than the abscissa of the lower right corner of the circumscribed rectangular frame of the first candidate character cell, determining that the column information of the second candidate character cell is larger than the column information of the first candidate character cell; determining the column information of all candidate character unit cells; and determining the candidate character cells with the same column information as the same logic column so as to divide the plurality of candidate character cells into a plurality of logic columns.
In an optional implementation manner, the screening according to candidate character cells in the semantic matching group of the query content in the query keyword to determine valid character cells matched with the query content includes: calculating the matching value of the query content in the query keyword and the candidate character cells in the semantic matching group aiming at each semantic matching group; and taking the candidate character cell with the matching value larger than the first threshold value as a valid character cell.
In the above embodiment, the valid character cell can be determined by the matching value with the query content, and the required character cell can be more accurately screened out under the condition of having a certain fault tolerance.
In an alternative embodiment, the calculating the matching value of the query content in the query keyword and the candidate character cell in the semantic matching group includes: aiming at each candidate character cell in the semantic matching group, carrying out transformation processing on the candidate character cell to obtain a transformed character cell; and determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformation character cell and the query keyword.
In the embodiment, by performing the change processing on the candidate character cells, errors existing in character recognition can be relieved, and the accuracy of the calculated matching value can be improved, so that the accuracy of the screened effective character cells can be further improved.
In an alternative embodiment, the transforming the candidate character cell to obtain a transformed character cell includes: calculating the confidence coefficient of each character in the candidate character cell; and masking the characters with the confidence coefficient smaller than the second threshold value to obtain transformed character cells.
In the embodiment, some characters with lower confidence coefficient can be masked, so that the transformation character unit lattice can more accurately represent the content actually presented in the document to be queried, and the accuracy of calculating the matching value is further improved.
In an optional implementation manner, the determining, according to the transformed character cell and the query content in the query keyword, a matching value of the query content in the query keyword and the candidate character cell includes: calculating the similarity of the transformation character cells and query contents in the query keywords, wherein the similarity is used as a matching value of the query contents in the query keywords and the candidate character cells; or determining the number of the masked characters in the transformed character cell; and if the number of the masked characters is larger than a third threshold value, taking the set numerical value as a matching value of query contents in the query keywords and the candidate character cells.
In an alternative embodiment, the third threshold is determined by: MT (k) =
log (len (Q)/k); wherein MT (k) represents a third threshold; q represents query content in the query keyword; len represents the length of Q; log () represents a logarithmic function; k represents a set integer.
In an alternative embodiment, the set value is determined by: exp (MT-Ev (O)); wherein MT represents the third threshold; exp () represents an exponential function with a natural constant as a base; o represents any one candidate character cell; ev (O) represents a difference between the number of masked characters in the transformed character cell corresponding to the candidate character cell O and the third threshold.
In an alternative embodiment, the transforming the candidate character cell to obtain a transformed character cell includes: calculating the confidence coefficient of each character in the candidate character cell; and performing word-in-word replacement on the characters with the confidence degrees smaller than the second threshold value to obtain transformed character cells.
In the embodiment, some characters with low confidence degree can be replaced by some adjectives, so that the transformation character cells can more accurately represent the content actually presented in the document to be queried, and the accuracy of calculating the matching value is further improved.
In an optional embodiment, the replacing the character with the confidence level smaller than the second threshold with the preset close character to obtain a transformed character cell includes: performing word-form-word-close replacement on the characters with the confidence coefficient smaller than the second threshold value to obtain a plurality of groups of transformed character cells; the determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformation character cell and the query keyword comprises the following steps: calculating the similarity between each group of transformation character unit grids and query contents in the query keywords; and screening out the maximum similarity from the similarity between each group of transformation character cells and query contents in the query keywords, and determining the maximum similarity as a matching value between the query contents in the query keywords and the candidate character cells.
In the above embodiment, by calculating a plurality of sets of matching values, the matching values selected from the plurality of sets of matching values can make the calculated matching values better express the matching between the query content and the candidate character cell.
In an alternative embodiment, the calculating the matching value of the query content in the query keyword and the candidate character cell in the semantic matching group includes: calculating the confidence coefficient of each character in the candidate character cell; masking the characters with the confidence coefficient smaller than the second threshold value to obtain first transformation character cells; performing word-form-word-close replacement on the characters with the confidence coefficient smaller than the second threshold value to obtain second transformation character cells; determining a first matching value of the query content in the query keyword and the candidate character cell according to the first transformation character cell and the query content in the query keyword; determining a second matching value of the query content in the query keyword and the candidate character cell according to the second transformation character cell and the query content in the query keyword; and screening larger matching values from the first matching values and the second matching values, and taking the larger matching values as matching values of query contents in the query keywords and candidate character cells in the semantic matching group.
In the above embodiment, the matching values may be calculated based on two ways, so as to select a matching value with a larger matching value, thereby improving the accuracy of calculating the matching value.
In an alternative embodiment, the calculating the matching value of the query content in the query keyword and the candidate character cell in the semantic matching group includes:
calculating a matching value of query contents in the query keywords and candidate character cells in the semantic matching group by using a confidence algorithm; wherein the confidence algorithm defines: confidence (Q, O) =sim (Q, transform (O)); wherein Q represents query content in the query keyword; o represents any one candidate character cell; confidence (Q, O) represents a matching value between query content in the query keyword and any one of the candidate character cells;
sim (Q, transform (O)) represents a function for calculating the similarity between Q and transform (O), and transform (O) represents a transformation function for performing a transformation process on O.
In an optional implementation manner, the determining, according to the location information of each valid character cell, the target logic row where the valid character cell is located includes: determining a plurality of initial logic rows according to the position information of each effective character cell; determining the number of the valid character cells contained in each initial logic row; and determining a target logic row according to the number of the units of each initial logic row.
In an optional implementation manner, the determining the target logic row according to the number of the cells of each initial logic row includes: and determining the target logic row from the initial logic row with the largest unit number in the unit numbers of the initial logic rows.
In a second aspect, the present invention provides a document matching apparatus comprising: the identification module is used for identifying the document to be queried so as to extract a plurality of candidate character cells contained in the document to be queried; the matching module is used for matching the query keyword with each candidate character cell so as to determine an effective character cell; and the determining module is used for determining the target logic row where the effective character cell is located according to the position information of each effective character cell.
In a third aspect, the present invention provides an electronic device comprising: a processor, a memory storing machine-readable instructions executable by the processor, which when executed by the processor perform the steps of the method of any of the preceding embodiments, when the electronic device is running.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic block diagram of an electronic device according to an embodiment of the present application; FIG. 2 is a flowchart of a document matching method provided in an embodiment of the present application; FIG. 3 is a schematic diagram of a document to be queried in one example provided by embodiments of the present application; FIG. 4 is an alternative flowchart of step 220 of the document matching method provided by embodiments of the present application; FIG. 5 is another schematic view of a document to be queried in one example provided by embodiments of the present application; FIG. 6 is yet another schematic diagram of a document to be queried in one example provided by embodiments of the present application; fig. 7 is a schematic diagram of a functional module of a document matching device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
As known from the research of the inventor of the application, in the field of document retrieval, the following problems are generally existed in the processing mode, in which, before the text is usually recognized by adopting a general character recognition algorithm for unstructured documents, and then keyword matching is performed: 1) The recognition result of the characters has no range of lines and rows, so that the subsequent semantic enhancement is difficult to further carry out; 2) The characters identified by the character recognition algorithm are logically broken: for example, a single line within a cell may have multiple blank, space-spaced text segments, which may often be identified as multiple independent text portions; and then, when the characters are matched, the logical matching of adjacent characters can be realized only by means of word segmentation and aggregation, and a threshold mechanism is added to control the matching effect. For example, a 50% threshold value indicates that the keyword is considered to be successfully matched with 50% similarity to the text to be matched, but this processing method does not match the keyword by taking all the text of the cell as a whole, so that the matching result of the character recognition algorithm is completely unavailable in the case of serious breakage (for example, when the content of the cell is split into a plurality of lines of text due to narrow columns).
The inventors of the present application have also appreciated that in the process of matching characters in the above manner, in order to enhance recall, fault tolerance is also required for the recognition result of the character recognition algorithm (for example, the character string PN10 is recognized as the character string PNT 0), so that the above threshold is required to be added with fault tolerance-related consideration, determined by empirical values, and has poor interpretability and greatly different effects under different data distribution conditions.
Based on the research, the document matching method can relieve the problem that the matching threshold is difficult to adjust and no logic row and column information exists, so that semantic information cannot be added to enhance the matching effect, and the keyword document matching effect is improved.
Illustratively, the document matching method of the present application may be used in a machine flow automation (Robotic Process Automation, RPA) technology to implement matching of characters in a document in the machine flow automation process. The RPA technology can simulate the operation of staff to a computer through a keyboard and a mouse in daily work, and can replace human beings to execute operations such as logging in a system, operating software, reading and writing data, downloading files, reading mails and the like. The automatic robot is used as a virtual labor force of an enterprise, so that staff can be liberated from repeated and low-value work, and energy can be put into high-added-value work, thereby realizing the digital intelligent transformation of the enterprise, reducing the cost and increasing the benefit.
RPA is a software-based robot that uses a software robot to replace manual tasks in a business process and interacts with the front-end system of a computer like a person, so that RPA can be seen as a software-based program robot running on a personal PC or server that replaces human automation by mimicking operations performed by a user on a computer, such as retrieving mail, downloading attachments, logging in systems, data processing analysis, etc., to be fast, accurate, and reliable. Although the problems of speed and accuracy in human work are solved by the specific rules which are set as in the traditional physical robot, the traditional physical robot is a robot with combination of software and hardware, and can execute work by matching with software under the support of specific hardware; the RPA robot is in a pure software layer, and can be deployed into any PC and any server to complete specified work as long as corresponding software is installed.
That is, RPA is a way to perform business operations using "digital staff" instead of humans and its related technology. Essentially, the RPA realizes unmanned operation of objects such as a system, software, a webpage, a document and the like on a computer by a simulator through a software automation technology, acquires service information, executes service actions, and finally realizes automatic process of a flow, labor cost saving and processing efficiency improvement. It is known from the description that, in order to implement RPA, it is necessary to find the target contents to be operated from the document or the screen, so as to automatically operate on the contents. Therefore, the retrieval of keywords in a document based on the input keywords is one of the techniques focused on realizing RPA.
In the implementation process of the document matching method, an optical character recognition (Optical Character Recognition, abbreviated as OCR) technology can be used. OCR technology refers to the process of translating character shapes on paper into computer text using character recognition methods. Namely, the text data is scanned, and then the image file is analyzed and processed to obtain the text and layout information.
For the convenience of understanding the present embodiment, an electronic device that performs the document matching method disclosed in the embodiment of the present application will be described first.
As shown in fig. 1, a block schematic diagram of an electronic device is provided. The electronic device 100 may include a memory 111, a memory controller 112, a processor 113, a peripheral interface 114, an input output unit 115, and a display unit 116. Those of ordinary skill in the art will appreciate that the configuration shown in fig. 1 is merely illustrative and is not limiting of the configuration of the electronic device 100. For example, electronic device 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The above-mentioned memory 111, memory controller 112, processor 113, peripheral interface 114, input/output unit 115 and display unit 116 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The processor 113 is used to execute executable modules stored in the memory.
The Memory 111 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory 111 is configured to store a program, and the processor 113 executes the program after receiving an execution instruction, and a method executed by the electronic device 100 defined by the process disclosed in any embodiment of the present application may be applied to the processor 113 or implemented by the processor 113.
The processor 113 may be an integrated circuit chip having signal processing capabilities. The processor 113 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; or digital signal processor (digital signal processor, DSP) and application specific integrated circuit
(Application Specific Integrated Circuit, ASIC for short), field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The peripheral interface 114 couples various input/output devices to the processor 113 and the memory 111. In some embodiments, the peripheral interface 114, the processor 113, and the memory controller 112 may be implemented in a single chip. In other examples, they may be implemented by separate chips.
The input-output unit 115 described above is used to provide input data to a user. The input/output unit 115 may be, but is not limited to, a mouse, a keyboard, and the like.
The display unit 116 described above provides an interactive interface (e.g., a user-operated interface) between the electronic device 100 and a user or is used to display image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the touch display may be a capacitive touch screen or a resistive touch screen, etc. supporting single-point and multi-point touch operations. Supporting single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are passed to the processor for calculation and processing.
The memory 111 of the electronic device 100 in this embodiment stores therein a computer program of an OCR algorithm for recognizing a document, and the processor 113 can extract character information in the document when the OCR algorithm is called.
When the electronic device 100 in this embodiment is used to execute the steps in the document matching method, the matching result of the file to be queried may be displayed in the display unit 116 of the electronic device 100.
The electronic device 100 in the present embodiment may be used to perform each step in each method provided in the embodiments of the present application. The implementation of the document matching method is described in detail below by several embodiments.
Referring to fig. 2, a flowchart of a document matching method according to an embodiment of the present application is shown. The specific flow shown in fig. 2 will be described in detail.
Step 210, identifying the document to be queried to extract a plurality of candidate character cells contained in the document to be queried.
Alternatively, the document to be queried may be identified by a character recognition algorithm, for example, an OCR algorithm, to identify candidate character cells that it contains. The document to be queried may be an unstructured document containing a table, for example, a portable file format (Portable Document Format, abbreviated to PDF) document, a picture document, and the like. The identification of the document to be queried may employ a parsing tool, which may be, for example, a PDF parser or an OCR tool, etc. Candidate character cells, which refer to the smallest circumscribed rectangular box of each string in the form document, may be denoted as O.
Illustratively, the candidate character cell may include location information and character content.
The position information may be the upper left corner coordinates and lower right corner coordinates of the smallest circumscribed rectangular box of the character content in each candidate character cell. For example, the position information of any candidate character cell may be represented as [ o.x0, o.y0, o.x1, o.y1], where (o.x0, o.y0) represents the upper left corner coordinates of the minimum bounding rectangular box and (o.x1, o.y1) represents the lower right corner coordinates of the minimum bounding rectangular box. Illustratively, the upper left corner of the entire document to be queried may be the origin. Character content refers to specific characters contained in candidate character cells, which may include numbers, letters, symbols, and the like, for example.
Step 220, matching the query keyword with each candidate character cell to determine a valid character cell.
The query terms may be, for example, terms provided according to actual query requirements. The query keywords may be keywords related to the document to be queried.
The query term may be denoted as Q. Illustratively, the query term Q may include a query tag and query content. For example, the query keyword Q may be in the format of "query tag=query content". For example, the query keyword may be "device name=injection molding machine", "model number
MA3600/2250G "," price=360000 ", etc. The query tag of each query keyword Q may be denoted as q.tag, and the query content of each query keyword Q may be denoted as q.val.
For any target candidate character cell in the plurality of candidate character cells, the query tag of the query keyword can be compared with the first candidate character cell in the column of the target candidate character cell, and the query content of the query keyword is matched with the target candidate character cell to determine whether the target candidate character cell can be used as a valid character cell. If the character content of the first candidate character cell in the column of the target candidate character cell is the same as the meaning expressed by the query tag of the query keyword, and the character content of the target candidate character cell is the same as the query content of the query keyword, the target candidate character cell can be determined as a valid character cell.
In the example shown in fig. 3, twenty-eight candidate character cells are determined in the document to be queried, wherein seven columns and four rows of candidate character cells are included. Taking the example that the query keyword is Q is "price=360000", if the current matching is that the third row and the second column of character content are candidate character cells of the "sea-sky injection molding machine", the character content is different from the query content q.val of Q of the query keyword as 360000, and the character content of the first candidate character cell of the column of the "sea-sky injection molding machine" is "model" and the meaning expressed by the query label q.tag of Q of the query keyword as price is also different, so that the candidate character cell cannot be used as a valid character cell. For another example, if the candidate character cell whose character content of the fifth column of the fourth row is "2,220,000.00" is currently matched, since the character content is different from the value of "2,220,000.00" which is 360000 from the query content q.val of the query keyword Q, the character content of the first candidate character cell in the column of "2,220,000.00" is "the device total price (yuan/renminbi)" which is the same as the meaning expressed by the query tag q.tag price of the query keyword Q, since the character content of the candidate character cell is different from the query content of the query keyword Q, the candidate character cell cannot be used as a valid character cell. For another example, if the character content of the third row and the fifth column is "360,000.00" of the candidate character cell, the character content is the same as the value of q.val being 360000 of the query content q. 360,000.00 and the query keyword Q, and the character content of the first candidate character cell in the column of "360,000.00" is the same as the meaning expressed by the price of q.tag of the query tag Q being the query keyword Q, the candidate character cell can be used as the valid character cell.
Step 230, determining the target logic row where the valid character cell is located according to the position information of each valid character cell.
Taking the above example as an example, if the selected valid character cell is the candidate character cell of the fifth column of the third row, the determined target logical row may be the content of the third row shown in fig. 3. Through the presentation of the whole logic row, the complete information of the content to be queried can be expressed more completely.
For tabulated documents to be queried, as shown in FIG. 4, step 220 described above may include: step 221 to step 224.
Step 221, dividing the plurality of candidate character cells into a plurality of logical columns.
Wherein each logical column includes one or more character cells. For example, each candidate character cell may be labeled with row information and column information, which may be labeled o.row and o.col, respectively.
The row information and column information of each candidate character cell may be determined based on the location information thereof.
For any two first candidate character cells and second candidate character cells in all the candidate character cells determined in the document to be queried, if the abscissa of the upper left corner of the circumscribed rectangular frame of the second candidate character cell is larger than the abscissa of the lower right corner of the circumscribed rectangular frame of the first candidate character cell, the column information of the second candidate character cell is larger than the column information of the first candidate character cell; if the ordinate of the upper left corner of the circumscribed rectangular frame of the second candidate character cell is larger than the ordinate of the lower right corner of the circumscribed rectangular frame of the first candidate character cell, the column information of the second candidate character cell is larger than the row information of the first candidate character cell.
Taking fig. 3 as an example, taking the upper left corner of the whole document to be queried as the origin, the row information and the column information of the candidate character cell with the minimum value of the upper left corner coordinate of the minimum circumscribed rectangular frame can be respectively recorded as the minimum value. For example, the character content of the candidate character cell is "serial number", and since the abscissa and ordinate of the upper left-hand corner of the minimum bounding rectangle frame are both minimum values, the row information and column information of the candidate character cell of the "serial number" may be one. The left upper corner coordinate of the minimum circumscribed rectangular frame of the candidate character cell of the 'name', the horizontal coordinate of which is larger than the horizontal coordinate of the right lower corner of the character content of the candidate character cell as the 'serial number', the column information of which can be larger than the column information of the character content of the candidate character cell as the 'serial number', and the vertical coordinate of which is not larger than the vertical coordinate of the right lower corner of the character content of the candidate character cell as the 'serial number', so that the row information of which can be not larger than the column information of the character content of the candidate character cell as the 'serial number'. Thus, it can be determined that the candidate character cell line information of which the character content is "name" is one, and the column information can be two. Similarly, the row information o.row and column information o.col of all candidate character cells can be determined.
Candidate character cells having the same column information may be determined to be the same logical column.
As shown in fig. 5, a column framed by a dashed box may be referred to as a logical column. In the example shown in fig. 5, seven logical columns may be determined.
Step 222, for each logic column, determining a column semantic tag corresponding to the logic column according to the character cells contained in the logic column.
The column semantic tags of the logic columns are used for representing semantic information of all candidate character cells in the logic columns. Alternatively, column semantic tags of the form document may be collected first, and configuration of relevant rules (e.g., header rules, column data rules, as described below) may be performed on different column semantic tags (e.g., device name, manufacturer), etc. In actual application, aiming at any logic column, the character content of the candidate character cell in the logic column is subjected to rule matching, and the column semantic label of the logic column is determined.
Rule matching includes matching of header rules and matching of column data rules. The matching reliability of the header rule is higher, the header rule can be matched first, and if the matching is unsuccessful, the column data rule can be matched. If the header rule matching is successful, the matching of the column data rule may not be performed any more. The header rule refers to a keyword or an expression which the header needs to conform to. Column data rules refer to expressions to which column data (except for the header portion) needs to conform. For example, the header rule configured corresponding to the column semantic tag "price" is "++price |total price|purchase original value |amount|evaluation net value" $ "; the column data rule is "((\d {1,3},) ++ \d {3} |\d+) (.\d {1,2 })".
For any logic column, if the character content of any candidate character cell in the logic column accords with the header rule correspondingly configured by the designated column semantic tag, determining the column semantic tag of the logic column as the designated column semantic tag.
For example, if the character content of a candidate character cell in a certain logical column is "total price", since the header rule configured corresponding to the column semantic label "price" is "ζprice |total price|purchase original value |amount|evaluation net value" $ ], the character content is considered to conform to the header rule configured corresponding to the specified column semantic label "price", the column semantic label of the logical column is "price".
If the candidate character cells in some logic columns of the table document cannot be matched with the header rules, the column data rules can be matched. For example, for any logic column, if the character content of the candidate character cells exceeding the preset proportion in the logic column accords with the column data rule correspondingly configured by the designated column semantic tag, determining the column semantic tag of the logic column as the designated column semantic tag.
Wherein, the preset proportion can be 2/3, 4/5 etc. For example, if the character content of the candidate character cell exceeding the preset proportion in a certain logic column accords with the column data rule configured corresponding to the designated column semantic tag, the column semantic tag of the logic column is "price".
Optionally, for any logic column, taking the character content of the candidate character cell in the logic column as the input of the trained column semantic label extraction model to obtain the column semantic label of the logic column output by the column semantic label extraction model.
The column semantic tag extraction model can be obtained through labeling corpus training in advance. As shown in fig. 3, the first column may label its column semantic tag as "serial number", the second column may label its column semantic tag as "name", the third column may label its column semantic tag as "model", the fourth column may label its column semantic tag as "quantity", the fifth column labels its column semantic tag as "price", the sixth column labels its column semantic tag as "invoice", and the seventh column labels its column semantic tag as "vendor". The character content of each column can be used as input, the labeled column semantic label is used as output, machine learning is carried out, and a column semantic label extraction model is obtained through training. The column semantic tag extraction model may be a classification model algorithm based on a pre-training model (e.g., bert). The loss function loss of the classification model algorithm may be chosen to employ a typical classification task cross entropy loss function.
In practical application, for a certain logic column, after the character contents of all candidate character cells in the logic column are sequentially spliced, a column semantic label extraction model is input, and the output of the column semantic label extraction model is the column semantic label of the logic column.
In an embodiment, for any logic column, the column semantic tag of the logic column may be determined by the rule matching method, and if the column semantic tag of the logic column cannot be determined by the rule matching method, the column semantic tag of the logic column is determined by the column semantic tag extraction model.
Step 223, filtering the semantic tags in each column according to the query tags in the query keywords to determine a semantic matching group matched with the query tags.
The query label of each query keyword Q may be denoted as q.tag, and the value range thereof may be a column semantic label. The query content of each query keyword Q may be denoted q.val.
The semantic matching group comprises a logic column with the same column semantic label as the query keyword Q query label. Specifically, the column semantic tags for each logical column OC may be denoted as OC.tag. The same logical columns OC of OC.tag and Q.tag are partitioned into one group (i.e., a semantically matched group).
For example, if a query includes a plurality of query keywords, if query tags q.tags of the plurality of query keywords are not repeated, a plurality of semantic matching groups are obtained, where one semantic matching group includes one query keyword Q and one or more logic columns OC, and the semantic matching group may be denoted as MG.
Step 224, filtering according to the candidate character cells in the semantic matching group according to the query content in the query keyword, so as to determine the valid character cells matched with the query content.
For example, the query content in the query keyword may be compared to the character content of the candidate character cells in the semantic matching group to determine valid character cells in the semantic matching group that are the same as the query content in the query keyword.
For any group of semantic matching groups, the query content Q.val of the query keyword Q in the semantic matching groups can be compared with the character content O.val of each candidate character cell in the semantic matching groups, so that effective character cells can be screened out.
In an alternative embodiment, step 224 may include step 2241 and step 2242.
Step 2241, for each semantic matching group, calculates a matching value of the query content in the query keyword and the candidate character cell in the semantic matching group.
The query content in the query keyword in the semantic matching group can be calculated with the character content of each candidate character cell in the semantic matching group to determine the matching value of the query content in the query keyword and the candidate character cell in the semantic matching group.
For example, similarity of the query content in the query keyword to the character content of the candidate character cell may be calculated, thereby determining a matching value of the query content in the query keyword to the candidate character cell in the semantic matching group. For example, a vector of query contents in the query keyword, a vector of character contents of the candidate character cells may be extracted, a distance between the two vectors may be calculated, and the distance between the two vectors may be used as a matching value between the query contents in the query keyword and the candidate character cells in the semantic matching group.
For example, similarity between the query content in the query keyword and the character content of the candidate character cell may be calculated using a similarity calculation algorithm such as jaccard, dice, and the like.
In step 2242, the candidate character cell whose matching value is greater than the first threshold value is taken as the valid character cell.
The first threshold may be a preset value, which may be 0.8, 0.7, 0.85 etc.
Considering that the character content in the candidate character cells identified by the character identification algorithm may have errors, the candidate character cells may be subjected to transformation processing in order to reduce the errors, so as to improve the reliability of the character cells. Based on this, step 2241 described above may include: aiming at each candidate character cell in the semantic matching group, carrying out transformation processing on the candidate character cell to obtain a transformed character cell; and determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformation character cell and the query keyword.
For example, in performing the transformation processing, the character possibly having an error in the character content in the candidate character cell may be transformed.
In one embodiment, transforming the candidate character cell to obtain a transformed character cell may include: calculating the confidence coefficient of each character in the candidate character cell; and masking the characters with the confidence coefficient smaller than the second threshold value to obtain transformed character cells.
For example, the confidence level of each character in the character content in the candidate character cell may be determined first.
In this embodiment, after the whole image of the document to be queried is identified by the character recognition algorithm in step 210, the position information of each candidate character cell is obtained, and the position information of each character in the candidate character cell can be supplemented on the basis of each candidate character cell so as to enrich the position information of the candidate character cell. The position information of the character may also be represented by coordinates, which may be coordinates of the upper left corner and the lower right corner of the circumscribed rectangle of the character. The circumscribed rectangle of the character may be the smallest circumscribed rectangle of the character.
In this embodiment, a confidence level may be determined for each character in each candidate character cell, and the confidence level may be denoted as o.charconf.
The confidence level of each character may be determined, for example, based on the character content of the candidate character cell in which it is located, and also based on the likelihood that the character will appear in the surrounding environment. For example, if the content of the character in the candidate character cell is the price of an item, most of the characters in the character content in the candidate character cell are numerical values, but if one of the characters is the letter "o" like the numeral 0, the probability that the character is a recognition error is high, and the confidence of the character may be a low value. For example, the character content in the candidate character cell is mostly Chinese characters, but one character is letter "T" similar to the Chinese characters, and the probability of indicating that the character is wrong in recognition is high, and the confidence of the character may be a low value. For example, the character content in the candidate character cell is a kanji character, and each character in the character content can form a reasonable phrase, so that the probability of identifying each character in the character content in the candidate character cell is larger, and the confidence of each character in the character content may be a larger value.
The second threshold may be a value set as desired, for example, the second threshold may be 0.95, 0.9, 0.85, etc.
For example, characters with confidence less than the second threshold may be determined to be unreliable, and an unmatched placeholder character may be used, e.g., an invisible character may be used in place of the unreliable character, thus avoiding the subsequent calculation of similarity to count the unreliable character as the correct character. For each candidate character cell, when replacing the unreliable characters, the positions of the masked unreliable characters can be recorded; for each candidate character cell, the number of untrusted characters masked may also be recorded when replacing the untrusted characters.
Optionally, a similarity between the transformed character cell and the query content in the query keyword is calculated, wherein the similarity is used as a matching value between the query content in the query keyword and the candidate character cell. For example, the similarity of the character content of the transformed character cell to the query content in the query keyword may be calculated.
For example, similarity between the character content of the transformed character cell and the query content in the query keyword may be calculated using a similarity calculation algorithm such as jaccard, dice, and the like.
Optionally, determining the number of masked characters in the transformed character cell; if the number of the masked characters is larger than a third threshold value, taking the set numerical value as a matching value of query contents in the query keywords and the candidate character cells; and if the number of the masked characters is not greater than a third threshold value, taking the similarity between the transformed character cell and the query content in the query keyword as a matching value between the query content in the query keyword and the candidate character cell.
In this embodiment, a similarity threshold may also be preset, and the similarity threshold may be a value set as required. The similarity threshold may be denoted ST.
If the similarity between the transformed character cell and the query content in the query keyword is calculated to be smaller than the similarity threshold, the similarity can be used as a matching value between the query content in the query keyword and the candidate character cell.
If the similarity between the transformation character cell and the query content in the query keyword is not less than the similarity threshold value, determining the number of masked characters in the transformation character cell; if the number of the masked characters is larger than a third threshold value, taking the set numerical value as a matching value of query contents in the query keywords and the candidate character cells; and if the number of the masked characters is not greater than a third threshold value, taking the similarity between the transformed character cell and the query content in the query keyword as a matching value between the query content in the query keyword and the candidate character cell.
Optionally, the third threshold is determined by: MT (k) =log (len (Q)/k);
wherein MT (k) represents a third threshold; q represents the query content in the query keyword; len represents the length of Q; log () represents a logarithmic function; k represents a set integer.
Alternatively, the third threshold may be a value set as desired, for example, the third threshold may be equal to 2, 3, 4, 5, etc.
Optionally, the above set value is determined by: exp (MT-Ev (O));
wherein MT represents the third threshold; exp () represents an exponential function with a natural constant as a base; o represents any one candidate character cell; ev (O) represents the difference between the number of masked characters in the transformed character cell corresponding to the candidate character cell O and the third threshold.
Alternatively, the set value may be a certain value, for example, the set value may be 0.
In another embodiment, transforming the candidate character cell to obtain a transformed character cell may include: calculating the confidence coefficient of each character in the candidate character cell; and performing word-in-word replacement on the characters with the confidence degrees smaller than the second threshold value to obtain transformed character cells.
Optionally, replacing the character with the confidence coefficient smaller than the second threshold with a preset close character to obtain a transformed character cell, including: and performing word-form and word-form replacement on the characters with the confidence degrees smaller than the second threshold value to obtain a plurality of groups of transformed character cells.
An adjective and adjective model may be pre-constructed, which model may be denoted as SYNO. The input of the model is a character string and the position of an unreliable character, and the input is output as top k new character strings after the replacement of a similar word according to the position of the wrong character in the input.
For example, candidate character cells O may be entered in the model, and the model SYNO may be entered at locations of untrusted characters having a confidence level less than a second threshold to obtain k new transformed character cells.
The determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformed character cell and the query keyword comprises the following steps: calculating the similarity between each group of transformation character unit cells and query contents in the query keywords; and screening out the maximum similarity from the similarity between each group of the transformation character cells and the query content in the query keyword, and determining the maximum similarity as the matching value of the query content in the query keyword and the candidate character cells.
For example, the query content q.val of the query keyword and the obtained k new transformation character cells may be input into a similarity calculation function to obtain k similarity values, and the largest similarity value is taken as the final confidence.
In this embodiment, step 2241 may include: calculating the confidence coefficient of each character in the candidate character cell; masking the characters with the confidence coefficient smaller than the second threshold value to obtain first transformation character cells; performing word-form-word-close replacement on the characters with the confidence coefficient smaller than the second threshold value to obtain second transformation character cells; determining a first matching value of the query content in the query keyword and the candidate character cell according to the first transformation character cell and the query content in the query keyword; determining a second matching value of the query content in the query keyword and the candidate character cell according to the second transformation character cell and the query content in the query keyword; and screening larger matching values from the first matching values and the second matching values to be used as matching values of query contents in the query keywords and candidate character cells in the semantic matching group.
The determining manner of the first transformed character cell and the second transformed character cell may refer to the foregoing two character transforming processing manners, which are not described herein.
The above-mentioned first matching value determination method may refer to the above-mentioned determination method for calculating the matching value based on the changed character cell obtained by the mask processing. The above-mentioned determination method of the second matching value may refer to the determination method of calculating the matching value based on the changed character cell obtained by the word-in-word replacement process, which is not described herein.
In this embodiment, step 2241 may include: and calculating matching values of the query contents in the query keywords and the candidate character cells in the semantic matching group by using a confidence algorithm.
Wherein the confidence algorithm defines: confidence (Q, O) =sim (Q, transform (O)); wherein Q represents query content in the query keyword; o represents any one candidate character cell;
confidence (Q, O) represents a matching value between the query content in the query keyword and any one of the candidate character cells; sim (Q, transform (O)) represents a function for calculating the similarity between Q and transform (O), and transform (O) represents a transformation function for performing a transformation process on O.
In some embodiments, step 230 may include: determining a plurality of initial logic rows according to the position information of each effective character cell; determining the number of the effective character cells contained in each initial logic row; and determining a target logic row according to the number of the units of each initial logic row.
The determining the target logic row according to the number of the cells of each initial logic row may include: and determining the target logic row from the initial logic row with the largest unit number in the unit numbers of the initial logic rows.
The resulting valid character cells may be grouped according to row information o.row to result in a logical row OR containing one OR more valid character cells O.
Illustratively, the number of valid character cells in a logical row OR may be denoted ORN. The logic row OR with the largest ORN can be selected as a target logic row, and can also be used as a matching result of the final output of the document to be queried.
For example, if there are multiple ORNs that are the same and the largest logical row OR, then multiple ORNs may be kept the same and the largest logical OR, all as target logical rows. And finally, outputting all reserved O, wherein the output content comprises the information of the row where the effective character cell O is and the information of all O.
The information of the row where the valid character cell O is located may be the maximum circumscribed rectangle of the position information of all character cells in the same row.
As shown in fig. 6, the column semantic tags of the third column are "model", the column semantic tags of the fifth column are "price", and the query keyword is: model=ma 3600/2250G, price=360000. The matching result that the effective character cell is marked by the solid line frame shown in fig. 6 can be determined; the information of the row in which the valid character cell O is located is then presented in a dashed box.
In the embodiment of the application, the character cells in the document to be queried can be identified by adopting the table identification tool, the keyword-based document matching is performed by adopting the row and column information, the integrity of texts in the cells of the table is improved, and the problem that text blocks cannot be matched under the narrow-column condition can be solved. In addition, column semantic tags can be combined, so that column-based matching is more accurate. In the similarity matching method with the finest granularity, the threshold control with the finer granularity is more explanatory, and a better effect is obtained in the query of the document.
Based on the same application conception, the embodiment of the present application further provides a document matching device corresponding to the document matching method, and since the principle of solving the problem of the device in the embodiment of the present application is similar to that of the foregoing embodiment of the document matching method, the implementation of the device in the embodiment of the present application may refer to the description in the embodiment of the foregoing method, and the repetition is omitted.
Fig. 7 is a schematic functional block diagram of a document matching device according to an embodiment of the present application. The respective modules in the document matching apparatus in the present embodiment are used to execute the respective steps in the above-described method embodiment. The document matching apparatus includes: an identification module 310, a matching module 320, and a determination module 330; the contents of each module are as follows: the identifying module 310 is configured to identify a document to be queried, so as to extract a plurality of candidate character cells contained in the document to be queried; a matching module 320, configured to match the query keyword with each candidate character cell to determine a valid character cell; and the determining module 330 is configured to determine, according to the location information of each valid character cell, a target logic row in which the valid character cell is located.
Furthermore, the embodiment of the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the document matching method described in the above method embodiment.
The computer program product of the document matching method provided in the embodiments of the present application includes a computer readable storage medium storing program codes, where the instructions included in the program codes may be used to execute the steps of the document matching method described in the above method embodiments, and specifically, reference may be made to the above method embodiments, which are not described herein.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

1. A document matching method, comprising:
identifying a document to be queried to extract a plurality of candidate character cells contained in the document to be queried;
matching the query keyword with each candidate character cell to determine a valid character cell;
And determining a target logic row where each effective character cell is located according to the position information of each effective character cell.
2. The method of claim 1, wherein matching the query term with each candidate character cell to determine a valid character cell comprises:
dividing a plurality of the candidate character cells into a plurality of logic columns, wherein each logic column comprises one or more character cells;
for each logic column, determining a column semantic tag corresponding to the logic column according to character cells contained in the logic column;
screening on each column of semantic tags according to the query tags in the query keywords to determine a semantic matching group matched with the query tags;
and screening candidate character cells in the semantic matching group according to the query content in the query keyword to determine valid character cells matched with the query content.
3. The method of claim 2, wherein the candidate character cell includes character content and location information, the location information including upper left and lower right corner coordinate information of a circumscribed rectangle of the character content;
The dividing the plurality of candidate character cells into a plurality of logical columns includes:
for any two first candidate character cells and second candidate character cells, if the abscissa of the upper left corner of the circumscribed rectangular frame of the second candidate character cell is larger than the abscissa of the lower right corner of the circumscribed rectangular frame of the first candidate character cell, determining that the column information of the second candidate character cell is larger than the column information of the first candidate character cell;
determining the column information of all candidate character unit cells;
and determining the candidate character cells with the same column information as the same logic column so as to divide the plurality of candidate character cells into a plurality of logic columns.
4. The method of claim 2, wherein the filtering candidate character cells in the semantic matching group according to query content in the query keyword to determine valid character cells that match the query content comprises:
calculating the matching value of the query content in the query keyword and the candidate character cells in the semantic matching group aiming at each semantic matching group;
and taking the candidate character cell with the matching value larger than the first threshold value as a valid character cell.
5. The method of claim 4, wherein the calculating the match value of the query content in the query term to the candidate character cells in the semantic matching group comprises:
aiming at each candidate character cell in the semantic matching group, carrying out transformation processing on the candidate character cell to obtain a transformed character cell;
and determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformation character cell and the query keyword.
6. The method of claim 5, wherein transforming the candidate character cells to obtain transformed character cells comprises:
calculating the confidence coefficient of each character in the candidate character cell;
and masking the characters with the confidence coefficient smaller than the second threshold value to obtain transformed character cells.
7. The method of claim 6, wherein the determining a matching value of the query content in the query keyword and the candidate character cell based on the transformed character cell and the query content in the query keyword comprises:
Calculating the similarity of the transformation character cells and query contents in the query keywords, wherein the similarity is used as a matching value of the query contents in the query keywords and the candidate character cells; or alternatively, the process may be performed,
determining the number of masked characters in the transformed character cell; and if the number of the masked characters is larger than a third threshold value, taking the set numerical value as a matching value of query contents in the query keywords and the candidate character cells.
8. The method of claim 7, wherein the third threshold is determined by:
MT(k)=log(len(Q)/k);
wherein MT (k) represents a third threshold; q represents query content in the query keyword; len represents the length of Q; log () represents a logarithmic function; k represents a set integer.
9. The method of claim 7, wherein the set value is determined by:
exp(MT-Ev(O));
wherein MT represents the third threshold; exp () represents an exponential function with a natural constant as a base; o represents any one candidate character cell; ev (O) represents a difference between the number of masked characters in the transformed character cell corresponding to the candidate character cell O and the third threshold.
10. The method of claim 5, wherein transforming the candidate character cells to obtain transformed character cells comprises:
calculating the confidence coefficient of each character in the candidate character cell;
and performing word-in-word replacement on the characters with the confidence degrees smaller than the second threshold value to obtain transformed character cells.
11. The method of claim 10, wherein replacing the character with the confidence level less than the second threshold with the preset close character to obtain the transformed character cell comprises:
performing word-form-word-close replacement on the characters with the confidence coefficient smaller than the second threshold value to obtain a plurality of groups of transformed character cells;
the determining the matching value of the query content in the query keyword and the candidate character cell according to the query content in the transformation character cell and the query keyword comprises the following steps:
calculating the similarity between each group of transformation character unit grids and query contents in the query keywords;
and screening out the maximum similarity from the similarity between each group of transformation character cells and query contents in the query keywords, and determining the maximum similarity as a matching value between the query contents in the query keywords and the candidate character cells.
12. The method of claim 5, wherein the calculating the match value of the query content in the query term to the candidate character cells in the semantic matching group comprises:
calculating the confidence coefficient of each character in the candidate character cell;
masking the characters with the confidence coefficient smaller than the second threshold value to obtain first transformation character cells;
performing word-form-word-close replacement on the characters with the confidence coefficient smaller than the second threshold value to obtain second transformation character cells;
determining a first matching value of the query content in the query keyword and the candidate character cell according to the first transformation character cell and the query content in the query keyword;
determining a second matching value of the query content in the query keyword and the candidate character cell according to the second transformation character cell and the query content in the query keyword;
and screening larger matching values from the first matching values and the second matching values, and taking the larger matching values as matching values of query contents in the query keywords and candidate character cells in the semantic matching group.
13. The method of claim 4, wherein the calculating the match value of the query content in the query term to the candidate character cells in the semantic matching group comprises:
Calculating a matching value of query contents in the query keywords and candidate character cells in the semantic matching group by using a confidence algorithm;
wherein the confidence algorithm defines:
confidence(Q,O)=sim(Q,transform(O));
wherein Q represents query content in the query keyword; o represents any one candidate character cell; confidence (Q, O) represents a matching value between query content in the query keyword and any one of the candidate character cells; sim (Q, transform (O)) represents a function for calculating the similarity between Q and transform (O), and transform (O) represents a transformation function for performing a transformation process on O.
14. The method according to any one of claims 1-13, wherein determining, according to the location information of each valid character cell, a target logical row in which the valid character cell is located includes:
determining a plurality of initial logic rows according to the position information of each effective character cell;
determining the number of the valid character cells contained in each initial logic row;
and determining a target logic row according to the number of the units of each initial logic row.
15. The method of claim 14, wherein determining the target logical row based on the number of cells of each of the initial logical rows comprises:
And determining the target logic row from the initial logic row with the largest unit number in the unit numbers of the initial logic rows.
16. An electronic device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, which when executed by the processor perform the steps of the method of any of claims 1 to 15, when the electronic device is run.
CN202310182575.9A 2023-02-27 2023-02-27 Document matching method and electronic device Pending CN116090432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310182575.9A CN116090432A (en) 2023-02-27 2023-02-27 Document matching method and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310182575.9A CN116090432A (en) 2023-02-27 2023-02-27 Document matching method and electronic device

Publications (1)

Publication Number Publication Date
CN116090432A true CN116090432A (en) 2023-05-09

Family

ID=86202658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310182575.9A Pending CN116090432A (en) 2023-02-27 2023-02-27 Document matching method and electronic device

Country Status (1)

Country Link
CN (1) CN116090432A (en)

Similar Documents

Publication Publication Date Title
CN112035653B (en) Policy key information extraction method and device, storage medium and electronic equipment
US10489439B2 (en) System and method for entity extraction from semi-structured text documents
JP5356197B2 (en) Word semantic relation extraction device
TWI536181B (en) Language identification in multilingual text
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
US10755045B2 (en) Automatic human-emulative document analysis enhancements
US11734782B2 (en) Automated document analysis for varying natural languages
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
US11393237B1 (en) Automatic human-emulative document analysis
CA3048356A1 (en) Unstructured data parsing for structured information
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
US10699112B1 (en) Identification of key segments in document images
US11520835B2 (en) Learning system, learning method, and program
Haque et al. Opinion mining from bangla and phonetic bangla reviews using vectorization methods
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
Ha et al. Information extraction from scanned invoice images using text analysis and layout features
CN112307314A (en) Method and device for generating fine selection abstract of search engine
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
Kosmajac et al. Dnlp@ fintoc’20: Table of contents detection in financial documents
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN116090432A (en) Document matching method and electronic device
CN110717029A (en) Information processing method and system
CN117291192B (en) Government affair text semantic understanding analysis method and system
Kamal et al. Improve Academic Query Resolution through BERT-based Question Extraction from Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination