WO2022112857A1 - Method and apparatus for correcting order information, and device and storage medium - Google Patents
Method and apparatus for correcting order information, and device and storage medium Download PDFInfo
- Publication number
- WO2022112857A1 WO2022112857A1 PCT/IB2021/055848 IB2021055848W WO2022112857A1 WO 2022112857 A1 WO2022112857 A1 WO 2022112857A1 IB 2021055848 W IB2021055848 W IB 2021055848W WO 2022112857 A1 WO2022112857 A1 WO 2022112857A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- order
- corrected
- target
- text
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000004590 computer program Methods 0.000 claims description 15
- 238000012937 correction Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 description 11
- 238000012015 optical character recognition Methods 0.000 description 8
- 238000000605 extraction Methods 0.000 description 6
- 230000010365 information processing Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0633—Lists, e.g. purchase orders, compilation or processing
- G06Q30/0635—Processing of requisition or of purchase orders
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/14—Travel agencies
Definitions
- the present disclosure relates to computer vision technology, and in particular, to a method, apparatus, device, and storage medium for correcting order information. Background technique
- OCR Optical Character Recognition, Optical Character Recognition
- An embodiment of the present disclosure provides a correction solution for order information.
- a method for correcting order information comprising: obtaining order information to be corrected according to a text recognition result of the order; determining target search information from the text recognition result ; Acquiring order reference information matching the target search information; and correcting the order information to be corrected by using the order reference information to obtain target order information.
- an apparatus for correcting order information includes: an acquiring unit, configured to acquire order information to be corrected according to a text recognition result of the order; a determining unit, using for determining target search information from the text recognition result; a matching unit for acquiring order reference information matching the target search information; a correcting unit for correcting the to-be-corrected order information by using the order reference information, to get the target order information.
- an electronic device the device includes a memory and a processor, where the memory is used for storing computer instructions that can be executed on the processor, and the processor is used for The method for correcting order information described in the first aspect is implemented when the computer instructions are executed.
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for correcting order information described in the first aspect is implemented.
- a computer program including computer-readable codes, and when the computer-readable codes are executed in an electronic device, a processor in the electronic device executes When implementing the correction method for order information described in the first aspect.
- the order information correction method, device, device, and storage medium obtain order information to be corrected according to the text recognition result of the order, and determine target search information from the text recognition result, Acquiring order reference information matching the target search information, and using the order reference information to correct the order information to be corrected to obtain target order information, can quickly obtain accurate target order information from the text recognition result of the order.
- FIG. 1 is a flowchart of a method for correcting order information provided by at least one embodiment of the present disclosure
- FIG. 2 is a schematic structural diagram of a setting database in a method for correcting order information proposed by at least one embodiment of the present disclosure
- Figures 3A, 3B, and 3C are schematic diagrams of an information extraction method proposed by at least one embodiment of the present disclosure.
- FIG. 4 is a schematic diagram of a device for correcting order information proposed by at least one embodiment of the present disclosure
- FIG. 5 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure. Detailed ways
- FIG. 1 shows a flowchart of a method for correcting order information according to some embodiments of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104.
- step 101 the order information to be corrected is obtained according to the text recognition result of the order.
- the order for text recognition includes at least one of the following: an order image, an order in the form of an electronic document, such as a pdf document. Those skilled in the art should understand that the order may also include other types suitable for text recognition.
- the text box contained in the order can be obtained by performing text detection on the order; and the text in the text box can be recognized by performing text recognition on the obtained text box, so that Obtain a text recognition result; and directly perform text recognition on the order to be processed, such as OCR, to obtain a text recognition result containing a text box in the order.
- the embodiments of the present disclosure do not limit the specific method for obtaining the text recognition result.
- the order information to be corrected is the order information to be corrected obtained from the text recognition result of the order according to the set rule.
- the order information to be corrected includes address information
- the address information to be corrected can be obtained from the text recognition result according to the rules of the address information.
- target search information is determined from the text recognition result.
- the target search information is information that is related to the order information to be corrected or can reflect the characteristics of the order information to be corrected.
- the target search information includes a subject name of the order information to be corrected, and at least one of at least one content element.
- the target search information may include the name of the subject to which the address information belongs (eg, name, place name, etc.) and/or at least one content element included in the address information (eg, administrative regions at all levels, the zip code corresponding to the row area, etc.).
- step 103 obtain order reference information matching the target search information.
- order reference information matching the target search information may be acquired from the setting database.
- the setting database stores a plurality of reference subject names and corresponding reference information.
- the setting database is a database that stores a plurality of subject names and corresponding address information, according to the subject name corresponding to the order information to be corrected, such as "XX Hotel/ Hotel", and zip code, that is, the matching "XX Hotel/Hotel" can be searched in the setting database, and the corresponding address information can be used as the order reference information.
- the order reference information matching the target search information can also be obtained through the Internet.
- a search engine can be used to search the Internet according to the subject name and zip code corresponding to the order information to be corrected, and the retrieved information corresponding to the matching subject name can be used as the order reference information.
- the order reference information matching the target search information can also be obtained from the setting database and the Internet at the same time.
- the order reference information is obtained from both the setting database and the Internet, any one of them, or a designated one can be used as the target order reference information; in the case that only the order reference information is obtained from the Internet, the Internet can be used
- the search results update the settings database.
- step 104 the order information to be corrected is corrected using the order reference information to obtain target order information.
- the method, device, device and storage medium for correcting order information proposed by at least one embodiment of the present disclosure obtain order information to be corrected according to the text recognition result of the order, and obtain the order information to be corrected from the text recognition result.
- Determine the target search information in the middle obtain the order reference information matching the target search information, and use the order reference information to correct the order information to be corrected to obtain the target order information, which can be quickly obtained from the text recognition result of the order.
- the address database in the related art usually only supports the query from the subject name to the address, and only has certain error tolerance at the beginning and end of the input word. Because the method for correcting order information proposed by the embodiments of the present disclosure is to obtain matching order reference information according to the target search information determined from the text recognition result, and the target search information may be the subject name and/or the subject name in the order information to be corrected. Or at least one content element in the order information to be corrected, so even if there is erroneous information in the order information, such as even a wrong subject name, other information in the order information can be used as the target search information through this correction method, Correcting the order information to be corrected by obtaining the order reference information has high fault tolerance.
- the method for correcting order information proposed by at least one embodiment of the present disclosure is applicable to orders of different layouts.
- the subject name corresponding to the order information to be corrected may be obtained as the target search information.
- the subject name and the order information to be corrected are, for example, key-value pair information, wherein the subject name indicates an attribute, and the order information to be corrected indicates a value of the attribute.
- the order information to be corrected may be address information, the subject corresponding to the address information is the object to which the address information belongs, and the corresponding subject name is the name of the object to which it belongs.
- the object to which the address information belongs is an individual
- the corresponding subject name is a name; when the object to which the address information belongs is a place, the corresponding subject name is the place name.
- the order information to be corrected may also be identity information, and the subject name corresponding to the identity information is a name. Those skilled in the art should understand that the order information to be corrected may also be other types of information, which is not limited in the present disclosure.
- the setting database may include reference unit information of a plurality of levels, and the reference unit information of each lowest level of the plurality of levels corresponds to a plurality of reference subject names.
- the reference unit information is organized and stored according to the hierarchical level from top to bottom, and the lower the level of the reference unit information corresponds to the smaller the scope or the lower the authority.
- the minimum employment level is the corresponding reference information unit with the smallest scope or the lowest authority.
- the reference unit information of multiple levels included in the setting database includes reference administrative area information and/or postal code information
- the reference unit information of the lowest level includes the administrative area with the smallest scope. Name and/or zip code for this borough.
- the reference unit information in the setting database may be stored in a tree structure, the non-leaf nodes of different levels store the reference unit information of different levels, and the leaf nodes are used to store the information belonging to the upper level node.
- Reference principal name the reference unit information in the setting database may be stored in a tree structure, the non-leaf nodes of different levels store the reference unit information of different levels, and the leaf nodes are used to store the information belonging to the upper level node. Reference principal name.
- the setting database further stores first reference information corresponding to each reference subject name.
- the first reference information is usually complete information corresponding to the reference subject name, including reference unit information at various levels and specific reference information corresponding to the reference subject name. Taking address information as an example, the first reference information may be complete address information, including administrative area information at various levels and specific addresses corresponding to the names of reference entities, such as streets and/or units.
- the first reference information is obtained in advance, and is the reference information corresponding to the name of the reference subject with higher reliability and accuracy.
- the reference unit information of multiple levels in the setting database may be administrative regions of multiple levels.
- the tree structure for storing address information can be a rooted tree structure, and the root node has no actual meaning; the child nodes of the root can be used to store the travel salesman of the order (for example, XX travel agency), and the rest of the non-leaf nodes can be used to store The administrative region composition or postal code of the country; each leaf node can store an object name, and each leaf node can also store the complete address information corresponding to the object name.
- all non-leaf nodes are unique, and the parent node of the non-leaf node represents its own higher-level administrative district.
- FIG. 2 is a schematic structural diagram of a setting database in a method for correcting order information proposed by at least one embodiment of the present disclosure.
- the subtree of the traveling salesman can be constructed according to the top-to-bottom (from shallow to deep) hierarchy: country-province-city-district, in some cases, the next level of district may also include subdistricts , and each administrative region can also be replaced by a postal code, for example, it is constructed as country-province-postal code-district.
- the zip code can be substituted for any administrative region, which is not limited in the present disclosure.
- the reference administrative area information of each house level stored in the tree structure can be obtained from the administrative division table of each country and the correspondence table between zip codes and administrative areas published on the Internet;
- the name of the reference subject stored in the node and the corresponding first reference information can be obtained by manual annotation.
- the order reference information corresponding to the order information to be corrected may be obtained in the following manner.
- the unit information of the lowest level in the order information to be corrected may be acquired as target search information according to the level division in the setting database.
- the order information to be corrected as the address information of a hotel order as an example
- the hierarchical division of addresses in the set database that is, the tree structure of the database
- the information contained in the order information to be corrected can be obtained.
- Unit information at each level For example, according to the tree structure "country-province-city-district" in the database, the order information to be corrected is split, and then the administrative area information of each level included in the address information can be obtained. Among them, the administrative area information of the lowest level can be used as the target search information.
- the sub-area information may be used as the target search information; if the minimum administrative area included in the address information is a sub-area, the sub-area information may be used as the target search information; information as the target search information; other situations are similar and will not be repeated here.
- the reference unit information of each lowest level in the database corresponds to multiple reference subject names, so among the multiple reference subject names, the target subject name can be determined according to preset conditions.
- the subject name corresponding to the order information to be corrected may be matched with a plurality of reference subject names corresponding to the target unit information respectively; and the matching score is the highest and exceeds the first set threshold
- the reference principal name of identified as the target principal name.
- the reference information corresponding to each reference subject name is stored in the preset library, according to the first reference information corresponding to the determined target subject name, the information of the pending order information can be obtained.
- Order reference information >> The reference information stored in the setting database has high reliability and accuracy, and more accurate target order information can be obtained by using the reference information to correct the order information to be corrected.
- the setting database stores second reference information corresponding to each reference subject name.
- the second reference information is other reference information other than the reference unit information of each level, and is generally more specific information than the reference unit information of each level.
- the second reference information may be, for example, the street and/or unit where the hotel is located.
- the second reference information is obtained in advance, and is the reference information corresponding to the name of the reference subject with high reliability and accuracy.
- the method of determining the name of the target subject is similar to the above method, the difference is that after the target subject is determined After the name, the order reference information corresponding to the order information to be corrected is obtained according to the reference unit information of each level corresponding to the target subject name and the second reference information corresponding to the target subject name.
- the complete information of the target subject name can be obtained, Correcting the order information to be corrected based on the complete information can obtain more accurate and complete target order information.
- order reference information corresponding to the order information to be corrected may also be obtained from the Internet according to the target search information.
- the Internet may be searched according to, for example, the subject name or at least one content element of the order information to be corrected.
- One or more candidate order reference information, each candidate order reference information is matched with the to-be-corrected order information, and the candidate order reference information with the highest matching score and exceeding the second set threshold is obtained as the order reference information.
- the target search information may include the zip code contained in the address information, and/or one of which is administrative area information.
- a plurality of candidate address information that may be hotel addresses can be obtained from the Internet.
- the candidate address information with the highest matching score and exceeding the second set threshold can be used as the to-be-corrected address information
- the order reference information of the order information is corrected to obtain more accurate hotel address information.
- any one of the candidate address information can be retained and the other candidate address information can be deleted.
- the organizational storage of administrative regions at all levels and the zip codes corresponding to the administrative regions in the address database can be set according to the regulations of the target country, so the correction method can be easily extended to the itinerary information of any destination country 's correction.
- retrieval may be performed first in the setting database according to the target search information, and then in the interconnection.
- the reference information corresponding to the order information to be corrected obtained from the Internet, and the The subject name corresponding to the order information to be corrected is added to the information corresponding to the lowest-level reference unit information in the setting database, that is, the subject name is added to the reference subject name corresponding to the corresponding lowest-level reference unit information .
- the name of the subject corresponding to the order information to be corrected and the order reference information are stored in the leaf nodes of the tree structure, which become the name of the newly added reference subject and the corresponding reference information.
- the reference information corresponding to the to-be-corrected order information obtained from the Internet and the The subject name corresponding to the order information updates the information corresponding to the reference unit information of the lowest level in the setting database. That is, the reference information of the target subject name corresponding to the reference unit information of the lowest level in the setting database is replaced with the reference information corresponding to the order to be corrected obtained from the Internet.
- the reference information corresponding to the order information to be corrected is replaced with the reference information corresponding to the reference subject name originally stored in the leaf nodes of the tree structure, so as to realize the Updates to reference information referring to principal names.
- the latest update time of the reference information corresponding to the order information to be corrected obtained from the Internet may be obtained, and the determination based on the update time Whether to update the reference information of the reference subject name. For example, if the latest update time is within the set time range, such as within the last year or within the last 6 months, the update can be performed; on the contrary, if the latest update time exceeds the set time range, then the update can be performed. A prompt message is output, and the technician determines whether to update to avoid incorrect update.
- N-grams are usually used to correct the text recognition results.
- the training of N-grams relies on thesaurus, the thesaurus of address information, especially the overseas terminology, is usually not available. It is complete, because the correction effect of the N-gram model for the recognition results of the order text of the hotel order class is not good.
- the hotel address information in the text recognition result of the hotel itinerary can be corrected, such as correcting wrong information in the hotel address, Or complete the incomplete hotel address, which improves the accuracy and reliability of automatic visa information filling, improves user experience, and helps speed up the approval process.
- the correction method of the present disclosure can use the reference information obtained from the Internet for correction, or update the setting database according to the reference information obtained from the Internet, the problem of incomplete thesaurus can be solved, and better correction can be obtained. Effect.
- the order information to be corrected includes at least address information and hotel information.
- the order information to be corrected can be obtained from the text recognition result of the order to be processed by the following method.
- the key information may include at least one content element of the order information to be corrected and at least one of keywords indicating the order information to be corrected.
- the key information may include the content element "zip code" in the address information, and if the region to which the address information belongs is known, the digit of the zip code may be determined. number. Taking the order information to be corrected as an example of an address in Thailand, since the postal code of Thailand is 5 digits, it can be determined that the key information is 5 digits. In this step, a text box containing 5 digits is determined as the first text box.
- the identified content may include more than 5 digits, for example, the text box includes 8 digits, etc.
- a search may also be performed in the zip code list of the region to which the found zip code belongs, to confirm that the found zip code is indeed the zip code of the region to which the found zip code belongs coding.
- the number of digits of postal codes around the world can be integrated, and the key information can be determined as a number of 4 to 9 digits. Then in this step, determine the text boxes containing 4 to 9 digits respectively, as the first text box. In a possible implementation manner, in order to reduce additional discrimination operations, a text box containing only 4-9 digits may be determined as the first text box, that is, for a text box containing 10 or more digits Not be considered.
- the key information may also include content elements in the address information—administrative area information, such as “Thailand” or “Thailand”, then in the multiple text boxes, it can be determined to include “Thailand” or “Thailand” such as The text content of the text box as the first text box.
- the key information further includes keywords indicating the order information to be corrected.
- the keywords include "address”, "address”, and expressions in other languages. Keywords for the address. It should be noted that the form of the keyword is not limited in this application, for example, it may include various expressions such as full name and abbreviation.
- the text box to be merged is determined based on the first text box.
- the text boxes to be combined may be determined according to the positional relationship with the first text box, and the text boxes to be combined may be combined to obtain a combined text box.
- the The order information to be corrected is extracted from the merged text box.
- a first text box containing key information is determined in a plurality of text boxes included in the text recognition result of the order to be processed, and at least part of the text is evaluated according to the first text box.
- the frames are merged to obtain a merged text frame, and the order information to be corrected is obtained from the merged text frame, which can implement efficient information processing in the pending order according to the key information in the information of the to-be-corrected order.
- the text boxes may be combined in the following manner to obtain a combined text box.
- the positional relationship includes the positional relationship between other text boxes (that is, any text box other than the first text box or a specified text box) and the first text box, for example, in the position of the first text box.
- the distance to the first text box is also included, for example, the distance in pixels from the first text box in the vertical direction and the distance in pixels in the horizontal direction.
- the distance between the text boxes is determined according to the distance between the center points of the two text boxes.
- the text box whose positional relationship between each of the text boxes and the first text box belongs to the set range For example, the text box above the first text box may be determined as the second text box, or the text box whose pixel distance from the first text box in the vertical direction is within a set threshold may be determined as the second text box text boxes, etc.
- the merging of the to-be-merged text boxes may be performed on a line basis. That is, according to the row to which each text box in the to-be-combined text box belongs, the to-be-combined text boxes are combined to obtain the combined text box.
- Figure 3A shows an exemplary merge result. As shown in FIG. 3A , it includes multiple lines of merged text boxes, including merged text boxes 301 to 303, wherein the merged text boxes in each row are obtained by merging one or more text boxes included in the row.
- the first threshold may be specifically determined according to the format feature of the order information to be corrected.
- the order information to be corrected may be acquired from the merged text box according to the format feature of the order to be processed.
- the format feature of the order to be processed includes the distance feature between each line of text, the font feature of each line of text, the positional relationship feature between texts, and so on.
- the target direction for obtaining the order information to be corrected can be determined, and the order information to be corrected is obtained according to the target direction.
- the order information to be corrected is address information and the key information is a zip code
- the zip code is usually located at the end of the address information
- the target direction for extracting the order information to be corrected can be determined, and the extraction is performed according to the target direction.
- the order information to be corrected is address information
- the key information is a keyword indicating the address information
- the key word “address” since the key word “address” is usually located at the forefront of the address information, it can be determined that the order information to be corrected is located below the first text box, so that it can be determined that the order to be corrected is extracted
- the target direction of the information is extracted according to the target direction.
- the target direction includes a first target direction and a second target direction
- the first target direction is used to indicate that the merge is traversed in the process of locating the area where the order information to be corrected is located
- the direction of the text box, the second target direction is used to indicate the direction of reading the order information to be corrected from the area where the order information to be corrected is located.
- the key information may include a keyword indicating the order information to be corrected, at least one content element of the order information to be corrected, a subject name of the order information to be corrected, and the like. Taking the order information to be corrected as address information as an example, the keywords indicating the address information include "address",
- the key information is "10110" (zip code), starting from the first text box containing "10110", that is, from the merged text box 301 Initially, the combined text box is traversed upward until the combined text box 302 where the key information "Address” is located is found. Then take the key information "Address” as the starting position, traverse the merged text box downward until the merged text box 301 where the key information "10110" is located is found, and obtain the content traversed downward as the waiting Correct order information.
- the uppercase and lowercase forms of some or all letters in a word are not limited, and can be adjusted according to the actual situation. That is to say, in the actual identification and other processing processes, the same processing method can be adopted for ADDRESS, Address, address, etc., that is, they are all identified as "addresses”.
- the method further includes: obtaining a distance between adjacent merged text boxes.
- the adjacent merged text boxes include two merged text boxes that are adjacent in the vertical direction.
- the multiple merged text boxes obtained from the text recognition result include multiple pairs of adjacent merged text boxes.
- the merged text boxes 311-314 include adjacent merged text boxes 311-312, adjacent merged text boxes 312-313, and adjacent merged text boxes
- the traversing includes acquiring the text content in the combined text box, and also includes acquiring the distance between the combined text box and its adjacent combined text box, wherein the adjacent combined text box is traversing the combined text box. Iterates between merged text boxes. Next, take the first traversed merged text box as the starting position in the adjacent merged text boxes whose distance satisfies the first set condition, and traverse the merged text frame according to the second target direction until the merged text frame is found.
- that the distance of adjacent merged text boxes satisfies the first set condition includes: the distance of the adjacent merged text boxes is greater than the first inter-frame distance threshold.
- the key information is "10400" (zip code)
- the first text box containing the zip code is taken as the starting position, that is, the first text box containing "10400”
- the text box is the starting position, That is, starting from the merged text box 311, the merged text box is traversed upward.
- traversing to the merged text box 312 includes acquiring the content in the merged text box 312 and obtaining the distance between the merged text box 312 and the merged text box 311 .
- the distance between the two text boxes may be the pixel distance between the center points of the two text boxes in the vertical direction, or the pixel distance between the corresponding positions of the two text boxes may be used as the distance between the two text boxes , for example, in the case of left-aligned two text boxes, the corner points of the two text boxes at the upper left corner or the lower left corner can be used as the two vertices for determining the distance, and the pixels between the two vertices can be used to determine the distance. distance as the distance between two text boxes.
- other methods similar to the above-mentioned contents can also be used to determine the distance between the two text boxes.
- the specific implementation process is not limited in this application, and may include but not be limited to the above exemplified situations.
- the distance between the merged text box 312 and the merged text box 311 does not satisfy the first set condition, that is, the distance between the merged text box 312 and the merged text box 311 is less than or equal to the first inter-frame distance threshold.
- the distance between the combined text box 314 and the combined text box 313 satisfies the first set condition, that is, the distance between the combined text box 314 and the combined text box 313 is greater than the first inter-box distance threshold, then Stop traversing upwards.
- the relationship between the first target direction and the direction to which the second target direction points respectively is not limited, that is, the first target direction and the second target direction may be at a certain angle, for example, the The first target direction and the second target direction may be opposite (ie, 180°), or may be the same (ie, 0°).
- the first target direction may indicate a downward traversal of the merged text box, by traversing the merged text box downward until searching for Go to the last key information, or find adjacent merged text boxes whose distance satisfies the first set condition.
- the first target direction and the second target direction are the same, and the traversal is performed again in the above-mentioned traversed area according to the second target direction, and the traversed area is obtained.
- the content is used as the order information to be corrected.
- the adjacent merged text box is used as the target adjacent merged text box, then the first inter-frame distance threshold corresponding to the target adjacent merged text box is determined according to at least one of the following : the height of the merged text box first traversed in the adjacent merged text boxes of the target; the distance between the merged text boxes contained in the traversed adjacent merged text boxes and the height of the merged text box first traversed.
- the target adjacent merged text boxes are two adjacent merged text boxes for which the first inter-frame distance threshold is to be determined.
- the first inter-frame distance thresholds corresponding to each pair of adjacent merged text boxes may be different.
- the first inter-frame distance threshold is determined according to the height of the merged text box first traversed in the target adjacent merged text frame.
- each merged text frame is from bottom to top
- the adjacent merged text boxes 311 and 312 are first traversed adjacent merged text boxes in this example, and the first inter-frame distance threshold corresponding to the two can be determined according to the height of the merged text box 311 .
- the first inter-box distance threshold is set to 0.65*mean_heightl (the height of the merged text box 311).
- the first inter-frame distance threshold may be based on the traversed adjacent merged text boxes included The distance between the included merged text boxes and the height of the first traversed merged text box are determined.
- the first traversed merged text box is the first traversed merged text box in the process of locating the region where the order information to be corrected is located.
- the first inter-frame distance threshold corresponding to the target adjacent merged text boxes may be determined by: obtaining the updated inter-frame distance of the target adjacent merged text boxes, and the updated inter-frame distances The distance is obtained by weighted summation of the distances between the merged text boxes included in the reference adjacent merged text boxes and the updated inter-frame distances between the merged text boxes included in the reference adjacent merged text boxes, wherein , the reference adjacent text frame is the adjacent merged text frame closest to the target merged text frame; the update disturbance value of the target adjacent merged text frame is obtained, and the updated disturbance value is obtained by comparing all the first traversed The absolute value of the disturbance value of the adjacent merged text box and the distance difference value are obtained by weighted summation, wherein the distance difference value is the updated inter-frame distance of the target adjacent merged text frame and the reference adjacent merge.
- the difference between the distances between the merged text boxes included in the text box, the disturbance value is determined according to the height of the merged text box that is first traversed; according to the distance between the update boxes and the update disturbance value, it is determined that the target is adjacent The first inter-box distance threshold for merging text boxes.
- the text box closest to the region where the extracted target region is located is the text box of the subject name corresponding to the order information to be corrected.
- the text box above the extracted address information is the name of the hotel, the subject of the address information. The same is true for documents such as business cards and shopping orders.
- the text box closest to the area where the address information, identity information, etc. are located is the text box where the name of the subject of the information is located.
- the subject name corresponding to the order information to be corrected may be determined by the following method.
- the content contained in the merged text boxes 321-322 is the order information to be corrected extracted according to the method for correcting order information described in any embodiment of the present disclosure.
- the area where the merged text boxes 321-322 are located is determined as the area where the order information to be corrected is located.
- the region where the order information to be corrected is located is in the first target direction (the direction of search traversal, in In this example, it is up) and the closest merged text box is 323 (there is a non-target language text between the merged text box 322 and the merged text box 323, as shown in gray, which is ignored).
- the merged text box 323 is traversed upward. Since the distance between the adjacent merged text boxes above the merged text box 323 and the merged text box 323 exceeds the second inter-frame threshold, that is, the second set condition is satisfied (there is no other merged text above the merged text box 323 .
- the second setting condition is met
- the text box 323 is merged, so that the content "XXXXX Hotel" in the merged text box can be determined as the name of the subject of the order information to be corrected, that is, "XXXXX Hotel" is determined as the name of the subject of the order information to be corrected.
- the distance threshold between the second boxes can be set to 0.4*mean_height (the average of adjacent merged text boxes high).
- the second inter-box distance threshold can be set to 0.6*mean_height (adjacent merged the average height of the text box).
- the information extraction method proposed by any embodiment of the present disclosure can be applied to images or electronic documents of various formats, and various formats include at least one of the following: a hotel order, an airplane itinerary, a passport, an ID card, etc.
- the electronic document may be a pdf document.
- FIG. 4 is an apparatus for correcting order information provided by at least one embodiment of the present disclosure.
- the apparatus includes: an obtaining unit 401 for obtaining order information to be corrected according to a text recognition result of the order; a determining unit 402 for obtaining order information from The target search information is determined in the text recognition result; the matching unit 403 is used to obtain order reference information matching the target search information; the correction unit 404 is used to correct the order information to be corrected by using the order reference information, to get the target order information.
- the target search information includes at least one of the following: the target search information includes a subject name of the order information to be corrected and at least one content element of the order information to be corrected. at least one.
- the matching unit is specifically used for at least one of the following: obtaining order reference information matching the target search information from the setting database; obtaining the target search information through the Internet Matching order reference information.
- the setting database includes reference unit information of a plurality of levels, and the reference unit information of the lowest level in the plurality of levels corresponds to a plurality of reference subject names.
- the setting database stores the first reference information corresponding to the name of the reference subject; the determining unit is specifically configured to: obtain the to-be-corrected according to the hierarchical division in the setting database the unit information of the lowest level in the order information; the matching unit is specifically configured to: determine the target unit that matches the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level of the setting database information; determine a target subject name that meets a preset condition among the multiple reference subject names corresponding to the target unit information; obtain a target subject name matching the target search information according to the first reference information corresponding to the target subject name Order reference information.
- the setting database stores second reference information corresponding to the name of the reference subject; the determining unit is specifically configured to: obtain the to-be-corrected data according to the hierarchical division in the setting database the unit information of the lowest level in the order information; the matching unit is specifically configured to: determine the target unit that matches the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level of the setting database information; determine the target subject name that meets the preset condition among the multiple reference subject names corresponding to the target unit information; according to the reference unit information of each level corresponding to the target subject name, and the target subject name Corresponding second reference information, obtain order reference information matching the target search information.
- the matching unit when the matching unit determines a target subject name that meets a preset condition among the multiple reference subject names corresponding to the target unit information, the matching unit is specifically configured to: match the order to be corrected The subject names corresponding to the information are respectively matched with multiple reference subject names corresponding to the target unit information; the reference subject name with the highest matching score and exceeding the first set threshold is determined as the target subject name.
- the matching unit is specifically configured to: perform a search on the Internet according to the target search information, and obtain one or more candidate order reference information matching the target search information; The order reference information is matched with the order information to be corrected; and the candidate order reference information with the highest matching score and exceeding the second set threshold is obtained as the order reference information.
- the device further includes an adding unit for adding the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected, and adding force P to the device In the information corresponding to the reference unit information of the lowest level in the database.
- the device further includes an update unit, configured to update the information in the setting database according to the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected.
- the information corresponding to the reference unit information of the lowest level is updated.
- the order information to be corrected includes at least address information, and at least one content element included in the address information includes at least one of the following: administrative area, postal code; Reference unit information at multiple levels includes reference borough information or zip code information.
- the obtaining unit is specifically configured to: obtain a text recognition result of the object to be processed, where the text recognition result includes multiple text boxes;
- the first text box of the information, the key information includes at least one content element of the order information to be corrected and at least one of the keywords indicating the order information to be corrected; according to the first text box, to At least a part of the multiple text boxes is combined to obtain a combined text box; and the order information to be corrected is acquired from the combined text box.
- An embodiment of the present disclosure further provides an electronic device, the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute the computer instructions.
- the method for correcting order information described in any embodiment of the present disclosure is implemented.
- the order information correction method, device, device and storage medium obtain order information to be corrected according to the text recognition result of the order, and determine target search information from the text recognition result, Acquiring order reference information matching the target search information, and using the order reference information to correct the order information to be corrected to obtain target order information, can quickly obtain accurate target order information from the text recognition result of the order.
- FIG. 5 provides an electronic device according to at least one embodiment of the present disclosure, the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute all The method for correcting order information described in any embodiment of the present disclosure is implemented when the computer instruction is used.
- At least one embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for correcting order information described in any embodiment of the present disclosure is implemented .
- At least one embodiment of the present disclosure also provides a computer program, comprising computer-readable code, When the computer-readable code is executed in an electronic device, the processor in the electronic device implements the method for correcting order information described in the first aspect when executed.
- the computer program product of order information provided by the embodiments of the present disclosure includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in the foregoing method embodiments.
- the correction method of the order information includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in the foregoing method embodiments.
- the correction method of the order information includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in the foregoing method embodiments.
- the correction method of the order information includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in the foregoing method embodiments.
- the correction method of the order information includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in
- one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may employ a computer program implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein form of the product.
- computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
- Embodiments of the subject matter and functional operations described in this specification can be implemented in: digital electronic circuits, tangible embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them.
- Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible non-transitory program carrier for execution by or to control the operation of data processing apparatus or multiple modules.
- the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for use by the data
- the processing device executes.
- the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
- the central processing unit will receive instructions and data from read only memory and/or random access memory.
- the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operably interfaced with such mass storage devices to receive data from or to It transmits data, or both.
- the computer does not have to have such a device.
- the computer can be embedded in another device, such as a mobile phone, a personal Digital Assistant (PDA) mobile audio or video players, game consoles, Global Positioning System (GPS) receivers, or portable storage devices such as Universal Serial Bus (USB) flash drives, to name a few.
- PDA personal Digital Assistant
- GPS Global Positioning System
- USB Universal Serial Bus
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disk or removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks.
- semiconductor memory devices eg, EPROM, EEPROM, and flash memory devices
- magnetic disks eg, internal hard disk or removable disk
- magneto-optical disks e.g, CD-ROM and DVD-ROM disks.
- the processor and memory may be supplemented by or incorporated in special purpose logic circuitry.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Remote Sensing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for correcting order information. The method comprises: acquiring, according to a text recognition result of an order, order information to be corrected; determining target search information from the text recognition result; acquiring order reference information matching the target search information; and correcting the order information to be corrected by using the order reference information, so as to obtain target order information.
Description
订单信息的校正方法、 装置、 设备及存储介质 相关申请的交叉引用 本专利申请要求于 2020年 11月 25 日提交的、 申请号为 2020113397772、 发明名称 为 “订单信息的校正方法、 装置、 设备及存储介质” 的中国专利申请的优先权, 该申请 的全文以引用的方式并入本文中。 技术领域 CROSS REFERENCE TO RELATED APPLICATIONS Priority of the Chinese Patent Application for "Storage Media", the entire contents of which are incorporated herein by reference. technical field
[01]本公开涉及计算机视觉技术, 尤其涉及一种订单信息的校正方法、 装置、 设备及存 储介质。 背景技术 [01] The present disclosure relates to computer vision technology, and in particular, to a method, apparatus, device, and storage medium for correcting order information. Background technique
[02]目前 OCR( Optical Character Recognition , 光学字符识别)技术已广泛应用于多个领 域和行业, 通过该技术可以识别出文本资料图像中的大部分文本文字。 然而, 由于 OCR 结果的准确率问题, 使得从 OCR结果中所提取的信息可能会出现错误。 发明内容 [02] At present, OCR (Optical Character Recognition, Optical Character Recognition) technology has been widely used in many fields and industries, and most of the text words in text data images can be recognized through this technology. However, due to the accuracy of the OCR results, the information extracted from the OCR results may be wrong. SUMMARY OF THE INVENTION
[03]本公开实施例提供了一种订单信息的校正方案。 [03] An embodiment of the present disclosure provides a correction solution for order information.
[04]根据本公开实施例的第一方面, 提供一种订单信息的校正方法, 所述方法包括: 根 据订单的文本识别结果获得待校正订单信息; 从所述文本识别结果中确定目标搜索信息; 获取与所述目标搜索信息匹配的订单参考信息; 利用所述订单参考信息校正所述待校正 订单信息, 以得到目标订单信息。 [04] According to a first aspect of the embodiments of the present disclosure, a method for correcting order information is provided, the method comprising: obtaining order information to be corrected according to a text recognition result of the order; determining target search information from the text recognition result ; Acquiring order reference information matching the target search information; and correcting the order information to be corrected by using the order reference information to obtain target order information.
[05]才艮据本公开实施例的第二方面, 提供一种订单信息的校正装置, 所述装置包括: 获 取单元, 用于根据订单的文本识别结果获得待校正订单信息; 确定单元, 用于从所述文 本识别结果中确定目标搜索信息; 匹配单元, 用于获取与所述目标搜索信息匹配的订单 参考信息; 校正单元, 用于利用所述订单参考信息校正所述待校正订单信息, 以得到目 标订单信息。 [05] According to the second aspect of the embodiments of the present disclosure, an apparatus for correcting order information is provided, the apparatus includes: an acquiring unit, configured to acquire order information to be corrected according to a text recognition result of the order; a determining unit, using for determining target search information from the text recognition result; a matching unit for acquiring order reference information matching the target search information; a correcting unit for correcting the to-be-corrected order information by using the order reference information, to get the target order information.
[06]根据本公开实施例的第三方面,提供一种电子设备,所述设备包括存储器、处理器, 所述存储器用于存储可在处理器上运行的计算机指令, 所述处理器用于在执行所述计算 机指令时实现第一方面所述的订单信息的校正方法。 [06] According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the device includes a memory and a processor, where the memory is used for storing computer instructions that can be executed on the processor, and the processor is used for The method for correcting order information described in the first aspect is implemented when the computer instructions are executed.
[07]根据本公开实施例的第四方面, 提供一种计算机可读存储介质, 其上存储有计算机 程序, 所述程序被处理器执行时实现第一方面所述的订单信息的校正方法。 [07] According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for correcting order information described in the first aspect is implemented.
[08]根据本公开实施例的第五方面, 提供一种计算机程序, 包括计算机可读代码, 在所 述计算机可读代码在电子设备中运行的情况下, 所述电子设备中的处理器执行时实现第 一方面所述的订单信息的校正方法。 [08] According to a fifth aspect of the embodiments of the present disclosure, a computer program is provided, including computer-readable codes, and when the computer-readable codes are executed in an electronic device, a processor in the electronic device executes When implementing the correction method for order information described in the first aspect.
[09]本公开一个或多个实施例的订单信息的校正方法、 装置、 设备及存储介质, 根据订 单的文本识别结果获得待校正订单信息, 并从所述文本识别结果中确定目标搜索信息,
获取与所述目标搜索信息匹配的订单参考信息, 利用所述订单参考信息校正所述待校正 订单信息以得到目标订单信息, 可以从订单的文本识别结果中, 快速地获得准确的目标 订单信息。 [09] In one or more embodiments of the present disclosure, the order information correction method, device, device, and storage medium obtain order information to be corrected according to the text recognition result of the order, and determine target search information from the text recognition result, Acquiring order reference information matching the target search information, and using the order reference information to correct the order information to be corrected to obtain target order information, can quickly obtain accurate target order information from the text recognition result of the order.
[10]应当理解的是, 以上的一般描述和后文的细节描述仅是示例性和解释性的, 并不能 限制本公开。 附图说明 [10] It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present disclosure. Description of drawings
[11]此处的附图被并入说明书中并构成本说明书的一部分, 示出了符合本说明书的实施 例, 并与说明书一起用于解释本说明书的原理。 [11] The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with this specification, and together with the description, serve to explain the principles of this specification.
[12]图 1是本公开至少一个实施例提出的一种订单信息的校正方法的流程图; [12] FIG. 1 is a flowchart of a method for correcting order information provided by at least one embodiment of the present disclosure;
[13]图 2是本公开至少一个实施例提出的订单信息的校正方法中设定数据库的结构示意 图; [13] FIG. 2 is a schematic structural diagram of a setting database in a method for correcting order information proposed by at least one embodiment of the present disclosure;
[14]图 3A、 3B、 3C是本公开至少一个实施例提出的信息提取方法示意图; [14] Figures 3A, 3B, and 3C are schematic diagrams of an information extraction method proposed by at least one embodiment of the present disclosure;
[15]图 4是本公开至少一个实施例提出的一种订单信息的校正装置的示意图; [15] FIG. 4 is a schematic diagram of a device for correcting order information proposed by at least one embodiment of the present disclosure;
[16]图 5是本公开至少一个实施例提出的一种电子设备的结构示意图。 具体实施方式 [16] FIG. 5 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure. Detailed ways
[17]这里将详细地对示例性实施例进行说明, 其示例表示在附图中。 下面的描述涉及附 图时, 除非另有表示, 不同附图中的相同数字表示相同或相似的要素。 以下示例性实施 例中所描述的实施方式并不代表与本公开相一致的所有实施方式。 相反, 它们仅是与如 所附权利要求书中所详述的、 本公开的一些方面相一致的装置和方法的例子。 [17] Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
[18]图 1示出了根据本公开一些实施例的订单信息的校正方法的流程图。 如图 1所示, 该方法包括步骤 101 ~步骤 104。 [18] FIG. 1 shows a flowchart of a method for correcting order information according to some embodiments of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104.
[19]在步骤 101中, 根据订单的文本识别结果获得待校正订单信息。 [19] In step 101, the order information to be corrected is obtained according to the text recognition result of the order.
[20]在本公开实施例中, 进行文本识别的订单包括以下中的至少一种: 订单图像、 电子 文档形式的订单, 例如 pdf文档。 本领域技术人员应当理解, 所述订单还可以包括其他 适用于进行文本识别的类型。 [20] In the embodiment of the present disclosure, the order for text recognition includes at least one of the following: an order image, an order in the form of an electronic document, such as a pdf document. Those skilled in the art should understand that the order may also include other types suitable for text recognition.
[21]在一个示例中, 可以通过对订单进行文本检测, 获得所述订单中包含的文本框; 并 通过对所获得的文本框进行文本识别, 识别出所述文本框中的文本文字, 从而获得文本 识别结果; 还可以直接对待处理订单进行文本识别, 例如 OCR, 获得所述订单中包含 文本框的文本识别结果。 本公开实施例对于获取文本识别结果的具体方法不进行限制。 [21] In an example, the text box contained in the order can be obtained by performing text detection on the order; and the text in the text box can be recognized by performing text recognition on the obtained text box, so that Obtain a text recognition result; and directly perform text recognition on the order to be processed, such as OCR, to obtain a text recognition result containing a text box in the order. The embodiments of the present disclosure do not limit the specific method for obtaining the text recognition result.
[22]待校正订单信息为根据设定规则从所述订单的文本识别结果中获得的待校正订单信 息。 例如, 在待校正订单信息包含地址信息的情况下, 可以 4艮据地址信息的规则从所述 文本识别结果中获取待校正的地址信息。 [22] The order information to be corrected is the order information to be corrected obtained from the text recognition result of the order according to the set rule. For example, in the case that the order information to be corrected includes address information, the address information to be corrected can be obtained from the text recognition result according to the rules of the address information.
[23]在步骤 102中, 从所述文本识别结果中确定目标搜索信息。 所述目标搜索信息为与 所述待校正订单信息相关或者能够体现所述待校正订单信息特征的信息。
[24]在一个示例中, 所述目标搜索信息包括所述待校正订单信息的主体名称、 以及至少 个内容元素中的至少一项。 以地址信息为例, 所述目标搜索信息可以包括地址信息所 属的主体名称(例如姓名、 场所名称等等)和 /或所述地址信息所包括的至少一个内容元 素 (例如各级行政区、 各级行区对应的邮政编码等等) 。 [23] In step 102, target search information is determined from the text recognition result. The target search information is information that is related to the order information to be corrected or can reflect the characteristics of the order information to be corrected. [24] In an example, the target search information includes a subject name of the order information to be corrected, and at least one of at least one content element. Taking address information as an example, the target search information may include the name of the subject to which the address information belongs (eg, name, place name, etc.) and/or at least one content element included in the address information (eg, administrative regions at all levels, the zip code corresponding to the row area, etc.).
[25]在步骤 103中, 获取与所述目标搜索信息匹配的订单参考信息。 [25] In step 103, obtain order reference information matching the target search information.
[26]在本公开实施例中, 可以从所述设定数据库中获取与所述目标搜索信息匹配的订单 参考信息。其中,所述设定数据库存储有多个参考主体名称以及对应的参考信息。例如, 在所述目标搜索信息包括地址信息的情况下, 所述设定数据库为存储了多个主体名称和 对应的地址信息的数据库,根据待校正订单信息对应的主体名称,例如 “XX酒店 /Hotel”, 以及邮政编码, 即可以在所述设定数据库中搜索到相匹配的 “XX酒店 /Hotel” , 并将所 对应的地址信息作为订单参考信息。 [26] In this embodiment of the present disclosure, order reference information matching the target search information may be acquired from the setting database. Wherein, the setting database stores a plurality of reference subject names and corresponding reference information. For example, in the case where the target search information includes address information, the setting database is a database that stores a plurality of subject names and corresponding address information, according to the subject name corresponding to the order information to be corrected, such as "XX Hotel/ Hotel", and zip code, that is, the matching "XX Hotel/Hotel" can be searched in the setting database, and the corresponding address information can be used as the order reference information.
[27]在本公开实施例中, 还可以通过互联网获取与所述目标搜索信息匹配的订单参考信 息。 仍以地址信息为例, 可以根据待校正订单信息对应的主体名称和邮政编码, 利用搜 索引擎在互联网中进行搜索, 将所检索到的相匹配的主体名称所对应的信息作为订单参 考信息。 [27] In the embodiment of the present disclosure, the order reference information matching the target search information can also be obtained through the Internet. Still taking the address information as an example, a search engine can be used to search the Internet according to the subject name and zip code corresponding to the order information to be corrected, and the retrieved information corresponding to the matching subject name can be used as the order reference information.
[28]在本公开实施例中, 还可以同时从设定数据库和互联网中获取与所述目标搜索信息 匹配的订单参考信息。 在从设定数据库和互联网都获得了订单参考信息的情况下, 可以 将其中的任一个, 或者指定的一个作为目标订单参考信息; 在仅从互联网获取了订单参 考信息的情况下, 可以利用互联网的搜索结果更新设定数据库。 [28] In the embodiment of the present disclosure, the order reference information matching the target search information can also be obtained from the setting database and the Internet at the same time. In the case that the order reference information is obtained from both the setting database and the Internet, any one of them, or a designated one can be used as the target order reference information; in the case that only the order reference information is obtained from the Internet, the Internet can be used The search results update the settings database.
[29]在步骤 104中, 利用所述订单参考信息校正所述待校正订单信息, 以得到目标订单 信息。 [29] In step 104, the order information to be corrected is corrected using the order reference information to obtain target order information.
[30]在本公开实施例中, 本公开至少一个实施例提出的订单信息的校正方法、 装置、 设 备及存储介质, 根据订单的文本识别结果获得待校正订单信息, 并从所述文本识别结果 中确定目标搜索信息, 获取与所述目标搜索信息匹配的订单参考信息, 利用所述订单参 考信息校正所述待校正订单信息以得到目标订单信息, 可以从订单的文本识别结果中, 快速地获得准确的目标订单信息。 [30] In the embodiment of the present disclosure, the method, device, device and storage medium for correcting order information proposed by at least one embodiment of the present disclosure obtain order information to be corrected according to the text recognition result of the order, and obtain the order information to be corrected from the text recognition result. Determine the target search information in the middle, obtain the order reference information matching the target search information, and use the order reference information to correct the order information to be corrected to obtain the target order information, which can be quickly obtained from the text recognition result of the order. Accurate target order information.
[31]相关技术中的地址数据库, 通常只支持主体名称到地址的查询, 且只在输入的词首 和词尾有一定的容错性。 由于本公开实施例所提出的订单信息的校正方法是根据从文本 识别结果中所确定的目标搜索信息来获取匹配的订单参考信息, 而目标搜索信息可以是 待校正订单信息中的主体名称和 /或者待校正订单信息中的至少一个内容元素,因此即便 订单信息中存在错误的信息, 例如甚至是错误的主体名称, 通过该校正方法也可以将所 述订单信息中的其他信息作为目标搜索信息, 以获取订单参考信息对所述待校正订单信 息进行校正, 具有较高的容错性。 [31] The address database in the related art usually only supports the query from the subject name to the address, and only has certain error tolerance at the beginning and end of the input word. Because the method for correcting order information proposed by the embodiments of the present disclosure is to obtain matching order reference information according to the target search information determined from the text recognition result, and the target search information may be the subject name and/or the subject name in the order information to be corrected. Or at least one content element in the order information to be corrected, so even if there is erroneous information in the order information, such as even a wrong subject name, other information in the order information can be used as the target search information through this correction method, Correcting the order information to be corrected by obtaining the order reference information has high fault tolerance.
[32]此外, 由于目标搜索信息的获取与订单的文本排布方式无关, 本公开至少一个实施 例提出的订单信息的校正方法适用于不同版面的订单。 [32] In addition, since the acquisition of target search information has nothing to do with the text arrangement of the order, the method for correcting order information proposed by at least one embodiment of the present disclosure is applicable to orders of different layouts.
[33]在一些实施例中, 从所述订单的文本识别结果中, 可以获取待校正订单信息对应的 主体名称作为目标搜索信息。 其中, 所述主体名称和所述待校正订单信息例如为键值对 信息, 其中, 所述主体名称指示属性, 所述待校正订单信息指示所述属性的值。
[34]在一个示例中, 所述待校正订单信息可以为地址信息, 该地址信息对应的主体为所 述地址信息所属的对象, 所对应的主体名称为所属对象的名称。 举例来说, 在所述地址 信息所属的对象为个人的情况下, 对应的主体名称为姓名; 在所述地址信息所属的对象 为场所的情况下,对应的主体名称为场所名称。所述待校正订单信息还可以是身份信息, 该身份信息对应的主体名称为姓名。 本领域技术人员应当理解, 所述待校正订单信息还 可以是其他类型的信息, 本公开对此不进行限制。 [33] In some embodiments, from the text recognition result of the order, the subject name corresponding to the order information to be corrected may be obtained as the target search information. The subject name and the order information to be corrected are, for example, key-value pair information, wherein the subject name indicates an attribute, and the order information to be corrected indicates a value of the attribute. [34] In an example, the order information to be corrected may be address information, the subject corresponding to the address information is the object to which the address information belongs, and the corresponding subject name is the name of the object to which it belongs. For example, when the object to which the address information belongs is an individual, the corresponding subject name is a name; when the object to which the address information belongs is a place, the corresponding subject name is the place name. The order information to be corrected may also be identity information, and the subject name corresponding to the identity information is a name. Those skilled in the art should understand that the order information to be corrected may also be other types of information, which is not limited in the present disclosure.
[35]在一些实施例中, 所述设定数据库可以包括多个层级的参考单元信息, 且所述多个 层级中每个最低层级的参考单元信息对应于多个参考主体名称。 在所述设定数据库中, 参考单元信息是按照从上至下的层次等级进行组织存储的, 层级越低的参考单元信息所 对应的范围越小或者权限越低。 其中, 最低雇级为对应的范围最小或者权限最低的参考 信息单元。 以存储地址信息的设定数据库为例, 所述设定数据库所包括的多个层级的参 考单元信息包括参考行政区信息和 /或邮政编码信息,则最低是级的参考单元信息包括范 围最小的行政区名称和 /或该行政区对应的邮政编码。 [35] In some embodiments, the setting database may include reference unit information of a plurality of levels, and the reference unit information of each lowest level of the plurality of levels corresponds to a plurality of reference subject names. In the setting database, the reference unit information is organized and stored according to the hierarchical level from top to bottom, and the lower the level of the reference unit information corresponds to the smaller the scope or the lower the authority. Among them, the minimum employment level is the corresponding reference information unit with the smallest scope or the lowest authority. Taking the setting database for storing address information as an example, the reference unit information of multiple levels included in the setting database includes reference administrative area information and/or postal code information, and the reference unit information of the lowest level includes the administrative area with the smallest scope. Name and/or zip code for this borough.
[36]在一个示例中, 所述设定数据库中参考单元信息可以是树形结构存储的, 不同层级 的非叶子节点存储不同层级的参考单元信息, 叶子节点用于存储属于上一级节点的参考 主体名称。 [36] In an example, the reference unit information in the setting database may be stored in a tree structure, the non-leaf nodes of different levels store the reference unit information of different levels, and the leaf nodes are used to store the information belonging to the upper level node. Reference principal name.
[37]在一些实施例中, 所述设定数据库中还存储有每个参考主体名称对应的第一参考信 息。 所述第一参考信息通常是所述参考主体名称所对应的完整信息, 包含了各个层级的 参考单元信息以及所述参考主体名称所对应的具体参考信息。 以地址信息为例, 所述第 一参考信息可以是完整的地址信息, 包含了各个层级的行政区信息以及参考主体名称所 对应的具体地址, 例如街道和 /或单元。 所述第一参考信息是预先获得的, 具有较高可信 度和准确度的所述参考主体名称所对应的参考信息。 [37] In some embodiments, the setting database further stores first reference information corresponding to each reference subject name. The first reference information is usually complete information corresponding to the reference subject name, including reference unit information at various levels and specific reference information corresponding to the reference subject name. Taking address information as an example, the first reference information may be complete address information, including administrative area information at various levels and specific addresses corresponding to the names of reference entities, such as streets and/or units. The first reference information is obtained in advance, and is the reference information corresponding to the name of the reference subject with higher reliability and accuracy.
[38]以待校正订单信息为酒店订单中的地址信息为例, 所述设定数据库中的多个层级的 参考单元信息可以是多个层级的行政区。该存储地址信息的树形结构可以是有根树形结 构,且根节点无实际含义;根的子节点可以用于存储订单的旅行商(例如 XX旅行社), 其余的非叶子节点可以用于存储国家的行政区成分或邮政编码; 每个叶子节点可以存储 一个对象名称,并且,每个叶子节点还可以存储所述对象名称所对应的完整的地址信息。 在同一旅行商对应的子树中, 所有非叶子节点唯一, 并且所述非叶子节点的父亲节点表 示其自身的高一级行政区。 [38] Taking the order information to be corrected as address information in a hotel order as an example, the reference unit information of multiple levels in the setting database may be administrative regions of multiple levels. The tree structure for storing address information can be a rooted tree structure, and the root node has no actual meaning; the child nodes of the root can be used to store the travel salesman of the order (for example, XX travel agency), and the rest of the non-leaf nodes can be used to store The administrative region composition or postal code of the country; each leaf node can store an object name, and each leaf node can also store the complete address information corresponding to the object name. In the subtree corresponding to the same traveling salesman, all non-leaf nodes are unique, and the parent node of the non-leaf node represents its own higher-level administrative district.
[39]图 2是本公开至少一个实施例提出的订单信息的校正方法中设定数据库的结构示意 图。 如图 2所示, 旅行商的子树可以根据从上至下 (从浅至深) 的层级构造: 国家 -省- 市 -区, 在一些情况下, 区的下一级还可以包括子区, 且各个行政区还可以利用邮政编 码代替, 例如构造为国家-省-邮政编码 -区。 本领域技术人员应当理解, 以上仅为示例, 所述邮政编码可替代任一行政区, 本公开对此不进行限制。 [39] FIG. 2 is a schematic structural diagram of a setting database in a method for correcting order information proposed by at least one embodiment of the present disclosure. As shown in Figure 2, the subtree of the traveling salesman can be constructed according to the top-to-bottom (from shallow to deep) hierarchy: country-province-city-district, in some cases, the next level of district may also include subdistricts , and each administrative region can also be replaced by a postal code, for example, it is constructed as country-province-postal code-district. Those skilled in the art should understand that the above is only an example, and the zip code can be substituted for any administrative region, which is not limited in the present disclosure.
[40]对于存储地址信息的设定数据库而言, 树形结构所存储的各个屋级的参考行政区信 息可以根据互联网上公开的各个国家的行政区划表以及邮政编码与行政区的对应表得 到; 叶子节点存储的参考主体名称以及所对应的第一参考信息可以由人工标注得到。 [40] For the setting database for storing address information, the reference administrative area information of each house level stored in the tree structure can be obtained from the administrative division table of each country and the correspondence table between zip codes and administrative areas published on the Internet; The name of the reference subject stored in the node and the corresponding first reference information can be obtained by manual annotation.
[41]在所述设定数据库中还存储有每个参考主体名称对应的第一参考信息的情况下, 可 以通过以下方式获取所述待校正订单信息对应的订单参考信息。
[42]首先, 可以根据所述设定数据库中的层级划分, 获取所述待校正订单信息中最低层 级的单元信息作为目标搜索信息。 [41] In the case where the first reference information corresponding to each reference subject name is also stored in the setting database, the order reference information corresponding to the order information to be corrected may be obtained in the following manner. [42] First, the unit information of the lowest level in the order information to be corrected may be acquired as target search information according to the level division in the setting database.
[43]以所述待校正订单信息为酒店订单的地址信息为例, 根据所述设定数据库中地址的 层级划分, 也即数据库的树形结构, 可以获得所述待校正订单信息所包含的各个层级的 单元信息。 例如, 按照数据库中的树形结构 “国家-省 -市-区” , 对所述待校正订单信息 进行拆分, 则可以获得地址信息所包含的各个层级的行政区信息。 其中, 可以将最低层 级的行政区信息作为目标搜索信息。 例如, 在所述地址信息所包含的最小行政区为子区 的情况下, 则可以将子区信息作为目标搜索信息; 在所述地址信息所包含的最小行政区 为区的情况下, 则可以将区信息作为目标搜索信息; 其他情况类似, 不再赘述。 [43] Taking the order information to be corrected as the address information of a hotel order as an example, according to the hierarchical division of addresses in the set database, that is, the tree structure of the database, the information contained in the order information to be corrected can be obtained. Unit information at each level. For example, according to the tree structure "country-province-city-district" in the database, the order information to be corrected is split, and then the administrative area information of each level included in the address information can be obtained. Among them, the administrative area information of the lowest level can be used as the target search information. For example, in the case where the minimum administrative area included in the address information is a sub-area, the sub-area information may be used as the target search information; if the minimum administrative area included in the address information is a sub-area, the sub-area information may be used as the target search information; information as the target search information; other situations are similar and will not be repeated here.
[44]接下来, 确定所述设定数据库的最低层级的参考单元信息中, 与所述待校正订单信 息中最低层级的单元信息相匹配的目标单元信息。 也即, 在所述设定数据库的树形结构 中, 定位所述待校正订单信息中最低层级的单元信息所在的位置。 在所述设定数据库的 树形结构中, 定位该最低层级的单元信息所存储的位置, 也即确定该最低层级的单元信 息所对应的 (相匹配的) 参考单元信息, 并将该参考单元信息作为目标单元信息。 [44] Next, determine the target unit information in the reference unit information of the lowest level of the setting database that matches the unit information of the lowest level in the order information to be corrected. That is, in the tree structure of the setting database, locate the position where the unit information of the lowest level in the order information to be corrected is located. In the tree structure of the setting database, locate the location where the unit information of the lowest level is stored, that is, determine the (matching) reference unit information corresponding to the unit information of the lowest level, and set the reference unit information as target unit information.
[45]之后, 确定所述目标单元信息所对应的多个参考主体名称中, 符合预设条件的目标 主体名称。 [45] After that, determine the target subject name that meets the preset condition among the multiple reference subject names corresponding to the target unit information.
[46]设定数据库中每个最低层级的参考单元信息对应于多个参考主体名称, 因此在该多 个参考主体名称中, 可以根据预设条件来确定目标主体名称。 [46] It is set that the reference unit information of each lowest level in the database corresponds to multiple reference subject names, so among the multiple reference subject names, the target subject name can be determined according to preset conditions.
[47]在一个示例中, 可以将所述待校正订单信息对应的主体名称分别与所述目标单元信 息所对应的多个参考主体名称进行匹配; 并将匹配得分最高且超过第一设定阈值的参考 主体名称, 确定为目标主体名称。 [47] In an example, the subject name corresponding to the order information to be corrected may be matched with a plurality of reference subject names corresponding to the target unit information respectively; and the matching score is the highest and exceeds the first set threshold The reference principal name of , identified as the target principal name.
[48]最后, 根据所述目标主体名称所对应的第一参考信息, 获得与所述目标搜索信息匹 配的订单参考信息。 [48] Finally, obtain order reference information matching the target search information according to the first reference information corresponding to the target subject name.
[49]在所述预设库中存储有每个参考主体名称对应的参考信息的情况下, 根据所确定的 目标主体名称所对应的第 参考信息, 则可以获得所述待权正订单信息的订单参考信息>> 在设定数据库中所存储的参考信息具有较高的可信度和准确度, 利用该参考信息对所述 待校正订单信息进行校正, 可以获得更准确的目标订单信息。 [49] In the case where the reference information corresponding to each reference subject name is stored in the preset library, according to the first reference information corresponding to the determined target subject name, the information of the pending order information can be obtained. Order reference information >> The reference information stored in the setting database has high reliability and accuracy, and more accurate target order information can be obtained by using the reference information to correct the order information to be corrected.
[50]在一些实施例中,所述设定数据库中存储有每个参考主体名称对应的第二参考信息。 所述第二参考信息为各个层级的参考单元信息之外的其他参考信息, 通常为相较于各个 层级的参考单元信息更为具体的信息。 以所述待校正订单信息为酒店订单中所包含的地 址信息为例, 所述第二参考信息例如可以是酒店所在的街道和 /或单元。 其中, 所述第二 参考信息是预先获得的, 具有较高可信度和准确度的所述参考主体名称所对应的参考信 息。 [50] In some embodiments, the setting database stores second reference information corresponding to each reference subject name. The second reference information is other reference information other than the reference unit information of each level, and is generally more specific information than the reference unit information of each level. Taking the order information to be corrected as the address information included in the hotel order as an example, the second reference information may be, for example, the street and/or unit where the hotel is located. Wherein, the second reference information is obtained in advance, and is the reference information corresponding to the name of the reference subject with high reliability and accuracy.
[51]在所述设定数据库中存储有每个参考主体名称对应的第二参考信息的情况下, 确定 目标主体名称的方式与上述方法类似,不同之处在于,在确定了所述目标主体名称之后, 根据所述目标主体名称所对应的各个层级的参考单元信息, 以及所述目标主体名称所对 应的第二参考信息, 获得所述待校正订单信息对应的订单参考信息。 [51] In the case where the second reference information corresponding to each reference subject name is stored in the setting database, the method of determining the name of the target subject is similar to the above method, the difference is that after the target subject is determined After the name, the order reference information corresponding to the order information to be corrected is obtained according to the reference unit information of each level corresponding to the target subject name and the second reference information corresponding to the target subject name.
[52]根据所述目标主体名称所对应的各个层级的参考单元信息, 以及所述目标主体名称
所对应的第二参考信息, 可以获得所述目标主体名称的完整信息,
居该完整信息对所 述待校正订单信息进行校正, 可以获得更为准确、 完整的目标订单信息。 [52] According to the reference unit information of each level corresponding to the target subject name, and the target subject name The corresponding second reference information, the complete information of the target subject name can be obtained, Correcting the order information to be corrected based on the complete information can obtain more accurate and complete target order information.
[53]在一些实施例中, 还可以根据所述目标搜索信息, 从互联网中获取所述待校正订单 信息对应的订单参考信息。 [53] In some embodiments, order reference information corresponding to the order information to be corrected may also be obtained from the Internet according to the target search information.
[54]在一个示例中, 可以根据所述待校正订单信息的例如主体名称或至少一个内容元素 在互联网中进行搜索, 例如, 在互联网中进行搜索主体名称, 可以获得与所述主体名称 对应的一个或多个候选订单参考信息, 将各候选订单参考信息与所述待校正订单信息进 行匹配, 获取匹配得分最高且超过第二设定阈值的候选订单参考信息作为所述订单参考 信息。 [54] In one example, the Internet may be searched according to, for example, the subject name or at least one content element of the order information to be corrected. One or more candidate order reference information, each candidate order reference information is matched with the to-be-corrected order information, and the candidate order reference information with the highest matching score and exceeding the second set threshold is obtained as the order reference information.
[55]以所述待校正订单信息为酒店订单的地址信息为例, 所述目标搜索信息可以包括所 述地址信息中所包含的邮政编码,和 /或其中一个是级的行政区信息。根据所述地址信息 对应的主体名称, 也即酒店名称, 连同所述地址信息所包括的至少一个内容元素进行检 索, 可以从互联网中获耳又多个可能是酒店地址的候选地址信息。 通过将从互相网中获得 的各候选地址信息与所述待校正订单信息, 按照地址组分进行模糊匹配, 可以将匹配得 分最高, 且超过第二设定阈值的候选地址信息作为所述待校正订单信息的订单参考信息, 以进行校正, 从而获得更准确的酒店地址信息。 [55] Taking the order information to be corrected as the address information of a hotel order as an example, the target search information may include the zip code contained in the address information, and/or one of which is administrative area information. According to the subject name corresponding to the address information, that is, the hotel name, and at least one content element included in the address information, a plurality of candidate address information that may be hotel addresses can be obtained from the Internet. By performing fuzzy matching between each candidate address information obtained from the Internet and the order information to be corrected according to address components, the candidate address information with the highest matching score and exceeding the second set threshold can be used as the to-be-corrected address information The order reference information of the order information is corrected to obtain more accurate hotel address information.
[56]在两个或两个以上候选地址信息的匹配得分相同的情况下, 可以保留其中任一候选 地址信息并删除其他的候选地址信息。 [56] In the case that the matching scores of two or more candidate address information are the same, any one of the candidate address information can be retained and the other candidate address information can be deleted.
[57]在本公开实施例中, 地址数据库中各级行政区以及行政区所对应的邮政编码的组织 存储, 可以目标国家的规定进行设置, 因而该校正方法易于扩展至任意目的地国家的行 程单信息的校正。 [57] In the embodiment of the present disclosure, the organizational storage of administrative regions at all levels and the zip codes corresponding to the administrative regions in the address database can be set according to the regulations of the target country, so the correction method can be easily extended to the itinerary information of any destination country 's correction.
[58]在一些实施例中, 可以首先根据所述目标搜索信息在设定数据库中进行检索, 再在 互相联中进行检索。 [58] In some embodiments, retrieval may be performed first in the setting database according to the target search information, and then in the interconnection.
[59]在设定数据库中不存在与待校正订单信息对应的主体名称相匹配的目标主体名称的 情况下, 可以将从互联网中获取的所述待校正订单信息对应的参考信息, 以及所述待校 正订单信息对应的主体名称, 添加至所述设定数据库中最低层级的参考单元信息所对应 的信息中, 也即将所述主体名称添加至相应最低层级参考单元信息所对应的参考主体名 称中。对于树形结构的设定数据库来说,也即将所述待校正订单信息所对应的主体名称, 以及订单参考信息存储在树形结构的叶子节点中, 成为新增加的参考主体名称和对应的 参考信息。 [59] In the case where there is no target subject name matching the subject name corresponding to the order information to be corrected in the setting database, the reference information corresponding to the order information to be corrected obtained from the Internet, and the The subject name corresponding to the order information to be corrected is added to the information corresponding to the lowest-level reference unit information in the setting database, that is, the subject name is added to the reference subject name corresponding to the corresponding lowest-level reference unit information . For the setting database of the tree structure, the name of the subject corresponding to the order information to be corrected and the order reference information are stored in the leaf nodes of the tree structure, which become the name of the newly added reference subject and the corresponding reference information.
[60]在从设定数据库中获取的参考信息与从互联网获取的参考信息不一致的情况下, 可 以才艮据从互联网中获取的所述待校正订单信息对应的参考信息, 以及所述待校正订单信 息对应的主体名称, 对所述设定数据库中最低层级的参考单元信息所对应的信息进行更 新。 也即, 利用从互联网中获取的所述待校正订单对应的参考信息, 替换所述设定数据 库中最低层级的参考单元信息所对应的目标主体名称的参考信息。对于树形结构的设定 数据库来说, 也即将所述待校正订单信息所对应的参考信息, 替换原本存储在树形结构 的叶子节点中的参考主体名称所对应的参考信息, 实现对所述参考主体名称的参考信息 的更新。
[61]在一个示例中, 在对参考主体名称的参考信息进行更新前, 可以获取从互联网中获 取的所述待校正订单信息对应的参考信息的最近一次更新时间, 并基于所述更新时间确 定是否对所述参考主体名称的参考信息进行更新。 例如, 在最近 次更新时间在设定时 间范围内, 比如在最近一年内, 或者最近 6个月内, 则可以进行更新; 相反, 如果最近 一次更新时间超出了所述设定时间范围, 则可以输出提示信息, 由技术人员确定是否进 行更新, 以避免错误更新。 [60] In the case where the reference information obtained from the setting database is inconsistent with the reference information obtained from the Internet, the reference information corresponding to the to-be-corrected order information obtained from the Internet and the The subject name corresponding to the order information updates the information corresponding to the reference unit information of the lowest level in the setting database. That is, the reference information of the target subject name corresponding to the reference unit information of the lowest level in the setting database is replaced with the reference information corresponding to the order to be corrected obtained from the Internet. For the setting database of the tree structure, that is, the reference information corresponding to the order information to be corrected is replaced with the reference information corresponding to the reference subject name originally stored in the leaf nodes of the tree structure, so as to realize the Updates to reference information referring to principal names. [61] In an example, before updating the reference information of the reference subject name, the latest update time of the reference information corresponding to the order information to be corrected obtained from the Internet may be obtained, and the determination based on the update time Whether to update the reference information of the reference subject name. For example, if the latest update time is within the set time range, such as within the last year or within the last 6 months, the update can be performed; on the contrary, if the latest update time exceeds the set time range, then the update can be performed. A prompt message is output, and the technician determines whether to update to avoid incorrect update.
[62]在本公开实施例中, 通过利用从互联网中获取的参考信息, 对设定数据库进行添力口 和更新, 可以确保从设定数据库中获取的参考信息的可信度和准确度, 从而可以从待处 理订单中获取更力口准确的待校正订单信息。 [62] In the embodiment of the present disclosure, by using the reference information obtained from the Internet to add and update the setting database, the reliability and accuracy of the reference information obtained from the setting database can be ensured. Thus, more accurate order information to be corrected can be obtained from the pending order.
[63]由于在办理出境旅游申请签证时, 需要填写酒店信息并提供酒店行程单以供审查。 对酒店行程单进行文字识别和信息提取可减少繁琐的用户填写以及简化审查流程, 然而 由于 OCR结果的准确率问题, 使得从 OCR结果中所提取的信息可能会出现错误。 [63] When applying for a visa for outbound tourism, it is necessary to fill in hotel information and provide a hotel itinerary for review. Text recognition and information extraction for hotel itinerary can reduce cumbersome user filling and simplify the review process. However, due to the accuracy of the OCR results, the information extracted from the OCR results may be wrong.
[64]相关技术中通常采用 N 元模型 (N-gram)对文本识别结果进行校正, 然而由于 N 元模型的训练依赖词库, 而地址信息的词库, 尤其是境外地名词库通常是不完备的, 因 些采用 N元模型对酒店订单类的订单文本识别结果的校正效果欠佳。 [64] In related technologies, N-grams are usually used to correct the text recognition results. However, because the training of N-grams relies on thesaurus, the thesaurus of address information, especially the overseas terminology, is usually not available. It is complete, because the correction effect of the N-gram model for the recognition results of the order text of the hotel order class is not good.
[65]通过将本公开至少一个实施例提出的订单信息的校正方法应用于自动签证处理, 可 以对酒店行程单的文本识别结果中的酒店地址信息进行校正, 例如修正酒店地址中的错 误信息, 或者对于不完整的酒店地址进行补全, 提升了自动签证信息填写的准确度和可 靠性, 提高了用户体验, 并且有利于加快审批流程。 此外, 由于本公开的校正方法可以 利用从互联网获取的参考信息进行校正, 或者根据从互联网获取的参考信息对设定数据 库进行更新, 因此可以解决词库不完备的问题, 可以得到更好的校正效果。 [65] By applying the method for correcting order information proposed in at least one embodiment of the present disclosure to automatic visa processing, the hotel address information in the text recognition result of the hotel itinerary can be corrected, such as correcting wrong information in the hotel address, Or complete the incomplete hotel address, which improves the accuracy and reliability of automatic visa information filling, improves user experience, and helps speed up the approval process. In addition, since the correction method of the present disclosure can use the reference information obtained from the Internet for correction, or update the setting database according to the reference information obtained from the Internet, the problem of incomplete thesaurus can be solved, and better correction can be obtained. Effect.
[66]在本公开实施例中, 所述待校正订单信息至少包括地址信息、 酒店信息, 在这种情 况下可以通过以下方法从待处理订单的文本识别结果中, 获得待校正订单信息。 [66] In the embodiment of the present disclosure, the order information to be corrected includes at least address information and hotel information. In this case, the order information to be corrected can be obtained from the text recognition result of the order to be processed by the following method.
[67]首先, 获取所述订单的文本识别结果, 所述文本识别结果包括多个文本框。 [67] First, acquire the text recognition result of the order, where the text recognition result includes multiple text boxes.
[68]接下来, 从所述多个文本框中, 确定包含关键信息的第一文本框。 所述关键信息可 以包括所述待校正订单信息的至少一个内容元素、 以及指示所述待校正订单信息的关键 词中的至少一项。 [68] Next, from the plurality of text boxes, a first text box containing key information is determined. The key information may include at least one content element of the order information to be corrected and at least one of keywords indicating the order information to be corrected.
[69]在待校正订单信息为地址信息的情况下, 所述关键信息可以包括地址信息中的内容 元素 “邮政编码” , 在已知地址信息所属地区的情况下, 则可以确定邮政编码的位数。 以待校正订单信息为泰国地址为例, 由于泰国邮政编码为 5位数字, 因此可以确定所述 关键信息是 5位数字。 在该步骤中, 确定包含 5位数字的文本框, 作为第一文本框。 考 虑到识别出的内容中可能包括多于 5位数字的情况, 比如, 文本框包括 8位数字等, 为 了减少额外的判别操作, 可以在实际应用过程中, 确定仅包含 5位数字的文本框, 作为 第一文本框。 [69] In the case where the order information to be corrected is address information, the key information may include the content element "zip code" in the address information, and if the region to which the address information belongs is known, the digit of the zip code may be determined. number. Taking the order information to be corrected as an example of an address in Thailand, since the postal code of Thailand is 5 digits, it can be determined that the key information is 5 digits. In this step, a text box containing 5 digits is determined as the first text box. Considering that the identified content may include more than 5 digits, for example, the text box includes 8 digits, etc., in order to reduce additional discrimination operations, it can be determined in the actual application process that only contains 5 digits of the text box , as the first text box.
[70]在一些实施例中, 对于所查找到的邮政编码, 还可以在该查找到的邮政编码所属地 区的邮政编码列表中进行搜索, 以确认所查找到的邮政编码确实为所属地区的邮政编码。 [70] In some embodiments, for the found zip code, a search may also be performed in the zip code list of the region to which the found zip code belongs, to confirm that the found zip code is indeed the zip code of the region to which the found zip code belongs coding.
[71]在未知地址信息所属地区的情况下, 可以综合世界各地邮政编码的位数情况, 将关 键信息确定为 4位〜 9位的数字。则在本步骤中,分别确定包含 4位 ~9位数字的文本框,
作为第一文本框。 在一种可能的实现方式中, 为了减少额外的判别操作, 可以确定仅包 含 4位~9位数字的文本框, 作为第一文本框, 即对于包含了 10位甚至更多位数字的文 本框不予考虑。 [71] In the case where the region to which the address information belongs is unknown, the number of digits of postal codes around the world can be integrated, and the key information can be determined as a number of 4 to 9 digits. Then in this step, determine the text boxes containing 4 to 9 digits respectively, as the first text box. In a possible implementation manner, in order to reduce additional discrimination operations, a text box containing only 4-9 digits may be determined as the first text box, that is, for a text box containing 10 or more digits Not be considered.
[72]所述关键信息还可以包括地址信息中的内容元素一行政区信息, 例如 “泰国” 或 “Thailand” , 则在所述多个文本框中, 可以确定包含 “泰国” 或 “Thailand” 这样的文 本内容的文本框作为第一文本框。 [72] The key information may also include content elements in the address information—administrative area information, such as “Thailand” or “Thailand”, then in the multiple text boxes, it can be determined to include “Thailand” or “Thailand” such as The text content of the text box as the first text box.
[73]所述关键信息还包括指示所述待校正订单信息的关键词, 以所述待校正订单信息为 地址为例, 所述关键词包括“地址” 、 “address” , 以及其他语言中表示地址的关键词。 需要说明的是, 在本申请中对于关键词的形式不予限定, 比如, 可以包括全称、 缩写等 多种表现形式。 [73] The key information further includes keywords indicating the order information to be corrected. Taking the order information to be corrected as an address as an example, the keywords include "address", "address", and expressions in other languages. Keywords for the address. It should be noted that the form of the keyword is not limited in this application, for example, it may include various expressions such as full name and abbreviation.
[74]之后, 根据所述第一文本框, 对至少部分所述多个文本框进行合并, 得到合并文本 框。 [74] After that, according to the first text box, combine at least a part of the multiple text boxes to obtain a combined text box.
[75]在本公开实施例中, 待进行合并的文本框是基于所述第一文本框确定的。 例如, 可 以才艮据与所述第一文本框的位置关系来确定待进行合并的文本框, 并对所述待合并文本 框进行合并, 得到合并文本框。 [75] In the embodiment of the present disclosure, the text box to be merged is determined based on the first text box. For example, the text boxes to be combined may be determined according to the positional relationship with the first text box, and the text boxes to be combined may be combined to obtain a combined text box.
[76]最后, 从所述合并文本框中获取待校正订单信息。 [76] Finally, obtain the order information to be corrected from the merged text box.
[77]通过根据所述合并文本框所包含的内容或者所述合并文本框的格式信息, 或是根据 所述合并文本框所包含的内容以及所述合并文本框的格式信息, 可以在所述合并文本框 中提取出待校正订单信息。 [77] According to the content contained in the combined text box or the format information of the combined text box, or according to the content contained in the combined text box and the format information of the combined text box, the The order information to be corrected is extracted from the merged text box.
[78]在本公开实施例中, 通过在待处理订单的文本识别结果所包含的多个文本框中, 确 定包含关键信息的第一文本框, 并根据所述第一文本框对至少部分文本框进行合并, 得 到合并文本框,并从所述合并文本框中获取待校正订单信息,可以实现在待处理订单中, 根据待校正订单信息中的关键信息进行高效的信息处理。 [78] In this embodiment of the present disclosure, a first text box containing key information is determined in a plurality of text boxes included in the text recognition result of the order to be processed, and at least part of the text is evaluated according to the first text box. The frames are merged to obtain a merged text frame, and the order information to be corrected is obtained from the merged text frame, which can implement efficient information processing in the pending order according to the key information in the information of the to-be-corrected order.
[79]在一些实施例中, 可以通过以下方式对文本框进行合并, 得到合并文本框。 [79] In some embodiments, the text boxes may be combined in the following manner to obtain a combined text box.
[80]首先, 获取所述多个文本框中除所述第一文本框以外的每个文本框与所述第一文本 框之间的位置关系。 所述位置关系包括其他文本框(即除所述第一文本框以外的任意一 个文本框或是指定的文本框)与所述第一文本框的方位关系, 例如在所述第一文本框的 上方或下方, 还包括与所述第一文本框的距离, 例如与所述第一文本框在垂直方向上的 像素距离, 以及在水平方向上的像素距离。 其中, 文本框之间的距离根据两个文本框的 中心点之间的距离确定。 [80] First, obtain the positional relationship between each text box in the plurality of text boxes except the first text box and the first text box. The positional relationship includes the positional relationship between other text boxes (that is, any text box other than the first text box or a specified text box) and the first text box, for example, in the position of the first text box. Above or below, the distance to the first text box is also included, for example, the distance in pixels from the first text box in the vertical direction and the distance in pixels in the horizontal direction. Wherein, the distance between the text boxes is determined according to the distance between the center points of the two text boxes.
[81]接下来, 将所述每个文本框中, 与所述第一文本框之间的位置关系属于设定范围内 的文本框, 确定为第二文本框。 例如, 可以将所述第一文本框上方的文本框确定为第二 文本框, 或者可以将在垂直方向上与所述第一文本框的像素距离在设定阈值内的文本框 确定为第二文本框等等。 [81] Next, determine the text box whose positional relationship between each of the text boxes and the first text box belongs to the set range as the second text box. For example, the text box above the first text box may be determined as the second text box, or the text box whose pixel distance from the first text box in the vertical direction is within a set threshold may be determined as the second text box text boxes, etc.
[82]之后, 将所述第一文本框与所述第二文本框作为待合并文本框, 进行合并, 得到所 述合并文本框。 [82] After that, take the first text box and the second text box as text boxes to be combined, and combine them to obtain the combined text box.
[83]在本公开实施例中, 根据所述文本识别结果中的除所述第一文本框以外的多个文本
框, 与包含所述关键信息的第一文本框的位置关系, 来确定待合并的文本框, 并对所述 待合并文本框进行合并。 这样, 可以将进行合并的文本框对象缩小至与待校正订单信息 相关的范围内, 减少了信息处理量, 提高了信息处理效率。 [83] In an embodiment of the present disclosure, according to a plurality of texts other than the first text box in the text recognition result box, and the positional relationship with the first text box containing the key information to determine the text boxes to be combined, and combine the text boxes to be combined. In this way, the text box objects to be merged can be narrowed down to the range related to the order information to be corrected, which reduces the amount of information processing and improves the information processing efficiency.
[84]对所述待合并文本框的合并可以基于行来进行。 也即, 根据所述待合并文本框中各 文本框所属的行, 对所述待合并文本框进行合并, 得到所述合并文本框。 [84] The merging of the to-be-merged text boxes may be performed on a line basis. That is, according to the row to which each text box in the to-be-combined text box belongs, the to-be-combined text boxes are combined to obtain the combined text box.
[85]在所述待合并文本框中属于同一行的文本框的数量为一个的情况下, 将属于同一行 的一个文本框, 确定为一个合并文本框。 [85] In the case that the number of text boxes belonging to the same row in the text boxes to be merged is one, one text frame belonging to the same row is determined as a merged text frame.
[86]在所述待合并文本框中属于同一行的文本框的数量为多个的情况下, 将属于同一行 的多个文本框进行合并, 得到一个合并文本框。 [86] In the case where the number of text boxes belonging to the same row in the text boxes to be merged is multiple, merge the multiple text boxes belonging to the same row to obtain a merged text frame.
[87]图 3A示出示例性的合并结果。 如图 3A所示, 其包括多行合并文本框, 包括合并文 本框 301~303, 其中, 每行所述合并文本框是对于该行所包含的一个或多个文本框进行 合并得到的。 [87] Figure 3A shows an exemplary merge result. As shown in FIG. 3A , it includes multiple lines of merged text boxes, including merged text boxes 301 to 303, wherein the merged text boxes in each row are obtained by merging one or more text boxes included in the row.
[88]在本公开实施例中, 根据各文本框所属的行, 对待合并文本框进行合并, 得到了各 行对应的合并文本框, 有利于进行后续的信息处理。 [88] In the embodiment of the present disclosure, according to the row to which each text box belongs, the text boxes to be combined are combined, and the combined text box corresponding to each row is obtained, which is beneficial to subsequent information processing.
[89]在一些实施例中, 对于属于同一行的多个文本框, 在相邻两个文本框之间的距离小 于第一阈值的情况下, 对所述相邻两个文本框进行合并; 通过对同一行中符合上述条件 的每两个相邻文本框进行合并, 得到了所述行对应的一个合并文本框。 其中, 所述第一 阈值可以根据待校正订单信息的格式特征具体确定。 [89] In some embodiments, for a plurality of text boxes belonging to the same row, when the distance between two adjacent text boxes is less than a first threshold, the two adjacent text boxes are merged; By merging every two adjacent text boxes in the same row that meet the above conditions, a combined text box corresponding to the row is obtained. Wherein, the first threshold may be specifically determined according to the format feature of the order information to be corrected.
[90]对于属于同一行的多个文本框, 在相邻文本框之间的距离大于或等于第一阈值的情 况下, 表明这两个相邻文本框可能不是相关的内容, 不同属于待校正订单信息, 因此不 对该相邻文本框进行合并。 [90] For multiple text boxes belonging to the same row, when the distance between adjacent text boxes is greater than or equal to the first threshold, it indicates that the two adjacent text boxes may not be related content, and are different from the content to be corrected order information, so the adjacent text boxes are not merged.
[91]在对同一行的相邻文本框进行合并, 得到超过一个合并文本框的情况下, 根据所得 到的合并文本框与所述第一文本框的位置关系,确定所述行所对应的合并文本框。例如, 将与所述第一文本框在水平方向上距离最近的合并文本框, 作为最终的合并文本框。 [91] In the case of merging adjacent text boxes on the same row to obtain more than one combined text box, determine the corresponding Merge text boxes. For example, the combined text box with the closest horizontal distance to the first text box is used as the final combined text box.
[92]在本公开实施例中, 通过以同一行的相邻文本框之间的合并条件进行限制, 可以避 免将不相关内容的文本框合并至所述合并文本框中, 提高了信息处理的准确度。 [92] In the embodiment of the present disclosure, by restricting the merging conditions between adjacent text boxes in the same row, it is possible to avoid merging text boxes of irrelevant content into the merged text boxes, which improves the efficiency of information processing. Accuracy.
[93]在一些实施例中, 可以根据所述待处理订单的格式特征, 从所述合并文本框中获取 待校正订单信息。 [93] In some embodiments, the order information to be corrected may be acquired from the merged text box according to the format feature of the order to be processed.
[94]所述待处理订单的格式特征,包括各行文本之间的距离特征,各行文本的字体特征, 文本之间的位置关系特征等等。 [94] The format feature of the order to be processed includes the distance feature between each line of text, the font feature of each line of text, the positional relationship feature between texts, and so on.
[95]根据所述格式特征, 可以确定获取待校正订单信息的目标方向, 并根据所述目标方 向来获取待校正订单信息。 [95] According to the format feature, the target direction for obtaining the order information to be corrected can be determined, and the order information to be corrected is obtained according to the target direction.
[96]例如, 在待校正订单信息为地址信息、 所述关键信息为邮政编码的情况下, 由于通 常情况下邮政编码位于地址信息的末端,从而可以确定所述待校正订单信息位于所述第 一文本框的上方, 从而可以确定提取待校正订单信息的目标方向, 根据该目标方向进行 提取。 [96] For example, when the order information to be corrected is address information and the key information is a zip code, since the zip code is usually located at the end of the address information, it can be determined that the order information to be corrected is located at the Above a text box, the target direction for extracting the order information to be corrected can be determined, and the extraction is performed according to the target direction.
[97]又例如, 在待校正订单信息为地址信息、 所述关键信息为指示地址信息的关键词
“address(地址) ” 的情况下, 由于关键 “address” 词通常位于地址信息的最前端, 从 而可以确定所述待校正订单信息位于所述第一文本框的下方,从而可以确定提取待校正 订单信息的目标方向, 根据该目标方向进行提取。 [97] For another example, when the order information to be corrected is address information, and the key information is a keyword indicating the address information In the case of “address (address)”, since the key word “address” is usually located at the forefront of the address information, it can be determined that the order information to be corrected is located below the first text box, so that it can be determined that the order to be corrected is extracted The target direction of the information is extracted according to the target direction.
[98]在本公开实施例中, 通过根据待处理订单的格式特征确定目标方向, 并按照所述目 标方向, 从所述合并文本框中获取待校正订单信息, 可以提高信息处理的效率。 [98] In the embodiment of the present disclosure, by determining the target direction according to the format feature of the order to be processed, and obtaining the order information to be corrected from the merged text box according to the target direction, the efficiency of information processing can be improved.
[99]在一些实施例中, 所述目标方向包括第一目标方向和第二目标方向, 所述第一目标 方向用于指示定位所述待校正订单信息所处区域的过程中遍历所述合并文本框的方向, 所述第二目标方向用于指示从所述待校正订单信息所处区域中读取所述待校正订单信 息的方向。 [99] In some embodiments, the target direction includes a first target direction and a second target direction, and the first target direction is used to indicate that the merge is traversed in the process of locating the area where the order information to be corrected is located The direction of the text box, the second target direction is used to indicate the direction of reading the order information to be corrected from the area where the order information to be corrected is located.
[100] 在一个示例中, 以包含第一关键信息所在的合并文本框为起始位置, 按照所述 第一目标方向遍历所述合并文本框, 直至查找到最后一个关键信息所在的合并文本框为 止; 以所述最后一个关键信息所在的合并文本框为起始位置, 按照所述第二目标方向遍 历所述合并文本框, 直至查找到所述第一关键信息所在的合并文本框为止, 并获取按照 所述第二目标方向遍历到的内容。 其中, 所述关键信息可以包括指示所述待校正订单信 息的关键词、 所述待校正订单信息的至少一个内容元素、 所述待校正订单信息的主体名 称等等。以所述待校正订单信息为地址信息为例,指示地址信息的关键词包括“地址”、[100] In an example, take the merged text box containing the first key information as the starting position, and traverse the merged text box according to the first target direction until the merged text box where the last key information is located is found. Take the merged text box where the last key information is located as the starting position, traverse the merged text box according to the second target direction until the merged text box where the first key information is located is found, and Acquire the content traversed according to the second target direction. Wherein, the key information may include a keyword indicating the order information to be corrected, at least one content element of the order information to be corrected, a subject name of the order information to be corrected, and the like. Taking the order information to be corrected as address information as an example, the keywords indicating the address information include "address",
“address” , 以及其他语言中表示地址的关键词。 "address", and keywords for addresses in other languages.
[101] 参见图 3A所示的示例性合并文本框, 所述关键信息为 “10110” (邮政编码), 以包含 “10110” 的第一文本框为起始位置, 也即从合并文本框 301 开始, 向上遍历所 述合并文本框, 直至查找到关键信息 “Address” 所在的合并文本框 302。 再以所述关键 信息 “Address”为起始位置,向下遍历所述合并文本框,直至查找到所述关键信息 “ 10110” 所在的合并文本框 301,并获取向下遍历到的内容作为待校正订单信息。需要说明的是, 对于诸如英文释义的 “地址” , 对于单词中部分或是全部字母的大写、 小写等形式, 不 予限定, 可结合实际情况进行调整。 也就意味着, 在实际识别等处理过程中, 对于 ADDRESS、 Address、 address等, 均可以采用相同的处理方式, 即均被识别为 “地址”。 [101] Referring to the exemplary merged text box shown in FIG. 3A, the key information is "10110" (zip code), starting from the first text box containing "10110", that is, from the merged text box 301 Initially, the combined text box is traversed upward until the combined text box 302 where the key information "Address" is located is found. Then take the key information "Address" as the starting position, traverse the merged text box downward until the merged text box 301 where the key information "10110" is located is found, and obtain the content traversed downward as the waiting Correct order information. It should be noted that, for "address" in English interpretation, the uppercase and lowercase forms of some or all letters in a word are not limited, and can be adjusted according to the actual situation. That is to say, in the actual identification and other processing processes, the same processing method can be adopted for ADDRESS, Address, address, etc., that is, they are all identified as "addresses".
[102] 在一个示例中, 所述方法还包括: 获取相邻合并文本框之间的距离。 其中, 所 述相邻合并文本框包括在垂直方向上邻近的两个合并文本框。从所述文本识别结果中所 获得的多个合并文本框, 包括多对相邻合并文本框。如图 3B所示,合并文本框 311~314 包括相邻合并文本框 311~312、 相邻合并文本框 312~313、 相邻合并文本框
[102] In an example, the method further includes: obtaining a distance between adjacent merged text boxes. Wherein, the adjacent merged text boxes include two merged text boxes that are adjacent in the vertical direction. The multiple merged text boxes obtained from the text recognition result include multiple pairs of adjacent merged text boxes. As shown in FIG. 3B, the merged text boxes 311-314 include adjacent merged text boxes 311-312, adjacent merged text boxes 312-313, and adjacent merged text boxes
[103] 以所述第一文本框为起始位置, 按照所述第一目标方向遍历所述合并文本框, 直至查找到距离满足第一设定条件的相邻合并文本框为止。 其中, 遍历包括获取所述合 并文本框中的文本内容, 还包括获取所述合并文本框与其相邻的合并文本框之间的距离, 其中, 所述相邻的合并文本框是在遍历所述合并文本框之间遍历的。 接下来, 以所述距 离满足第一设定条件的相邻合并文本框中, 首先遍历到的合并文本框为起始位置, 按照 所述第二目标方向遍历所述合并文本框, 直至查找到所述关键信息所在的合并文本框为 止, 并获取按照所述第二目标方向遍历到的内容。 其中, 相邻合并文本框的距离满足第 一设定条件包括: 所述相邻合并文本框的距离大于第一框间距离阈值。 [103] Taking the first text box as a starting position, traverse the combined text box according to the first target direction, until an adjacent combined text box whose distance satisfies the first set condition is found. The traversing includes acquiring the text content in the combined text box, and also includes acquiring the distance between the combined text box and its adjacent combined text box, wherein the adjacent combined text box is traversing the combined text box. Iterates between merged text boxes. Next, take the first traversed merged text box as the starting position in the adjacent merged text boxes whose distance satisfies the first set condition, and traverse the merged text frame according to the second target direction until the merged text frame is found. until the merged text box where the key information is located, and acquire the content traversed according to the second target direction. Wherein, that the distance of adjacent merged text boxes satisfies the first set condition includes: the distance of the adjacent merged text boxes is greater than the first inter-frame distance threshold.
[104] 参见图 3B所示的示例性合并文本框, 所述关键信息为 “10400” (邮政编码), 以包含邮政编码的第一文本框为起始位置, 即包含 “10400”的第一文本框为起始位置,
也即从合并文本框 311开始, 向上遍历所述合并文本框。 以遍历至所述合并文本框 312 为例,包括获取所述合并文本框 312中的内容、并获取合并文本框 312与合并文本框 311 之间的距离。 其中, 两个文本框之间的距离可以是两个文本框的中心点在垂直方向上的 像素距离, 也可以采用两个文本框对应位置之间的像素距离作为两个文本框之间的距离, 比如, 在两个文本框左对齐的情况下, 可以将两个文本框位于左上角或是左下角的角点 作为用于确定距离的两个顶点, 并将这两个顶点之间的像素距离作为两个文本框之间的 距离。 当然, 还可以采用其他与上述内容类似的方式, 来确定两个文本框之间的距离。 对于具体实现过程, 在本申请中不予限定, 可以包括但不限于上述例举的情况。 在合并 文本框 312与合并文本框 311之间的距离不满足第一设定条件, 也即合并文本框 312与 合并文本框 311之间的距离小于或等于第一框间距离阈值的情况下, 则继续向上遍历。 在检测出合并文本框 314与合并文本框 313之间的距离满足第一设定条件, 也即合并文 本框 314与合并文本框 313之间的距离大于第一框间距离阈值的情况下, 则停止向上遍 历。 接下来, 以合并文本框 313为起始位置, 也即以合并文本框 314与合并文本框 313 中首先遍历到的合并文本框 313为起始位置, 向下遍历所述合并文本框, 直至查找到所 述关键信息邮政编码 “10400”所在的合并文本框 311, 并获取向下遍历到的内容作为待 校正订单信息。 [104] Referring to the exemplary merged text box shown in FIG. 3B, the key information is "10400" (zip code), and the first text box containing the zip code is taken as the starting position, that is, the first text box containing "10400" The text box is the starting position, That is, starting from the merged text box 311, the merged text box is traversed upward. Taking traversing to the merged text box 312 as an example, it includes acquiring the content in the merged text box 312 and obtaining the distance between the merged text box 312 and the merged text box 311 . The distance between the two text boxes may be the pixel distance between the center points of the two text boxes in the vertical direction, or the pixel distance between the corresponding positions of the two text boxes may be used as the distance between the two text boxes , for example, in the case of left-aligned two text boxes, the corner points of the two text boxes at the upper left corner or the lower left corner can be used as the two vertices for determining the distance, and the pixels between the two vertices can be used to determine the distance. distance as the distance between two text boxes. Of course, other methods similar to the above-mentioned contents can also be used to determine the distance between the two text boxes. The specific implementation process is not limited in this application, and may include but not be limited to the above exemplified situations. In the case where the distance between the merged text box 312 and the merged text box 311 does not satisfy the first set condition, that is, the distance between the merged text box 312 and the merged text box 311 is less than or equal to the first inter-frame distance threshold, Continue to traverse upwards. When it is detected that the distance between the combined text box 314 and the combined text box 313 satisfies the first set condition, that is, the distance between the combined text box 314 and the combined text box 313 is greater than the first inter-box distance threshold, then Stop traversing upwards. Next, take the combined text box 313 as the starting position, that is, take the combined text box 313 that is first traversed in the combined text box 314 and the combined text box 313 as the starting position, and traverse the combined text box downward until searching for Go to the merged text box 311 where the key information zip code "10400" is located, and acquire the content traversed downward as the order information to be corrected.
[105] 在本公开实施例中, 对于第一目标方向以及第二目标方向分别指向的方向之间 的关系不予限定, 即第一目标方向与第二目标方向可以呈一定角度, 比如, 所述第一目 标方向和所述第二目标方向可以是相反的 (即 180° ) , 也可以是相同的 (即 0° ) 。 [105] In the embodiment of the present disclosure, the relationship between the first target direction and the direction to which the second target direction points respectively is not limited, that is, the first target direction and the second target direction may be at a certain angle, for example, the The first target direction and the second target direction may be opposite (ie, 180°), or may be the same (ie, 0°).
[106] 在一个示例中, 在第一关键信息位于待校正订单信息的开始部分时, 第一目标 方向可以指示向下遍历所述合并文本框, 通过向下遍历所述合并文本框, 直至查找到最 后一个关键信息, 或者查找到距离满足第一设定条件的相邻合并文本框。 在第一关键信 息位于待校正订单信息的开始部分的情况下, 所述第一目标方向和所述第二目标方向相 同, 在上述进行遍历的区域按照第二目标方向再次进行遍历, 获取遍历到的内容作为待 校正订单信息。 [106] In an example, when the first key information is located at the beginning of the order information to be corrected, the first target direction may indicate a downward traversal of the merged text box, by traversing the merged text box downward until searching for Go to the last key information, or find adjacent merged text boxes whose distance satisfies the first set condition. In the case where the first key information is located at the beginning of the order information to be corrected, the first target direction and the second target direction are the same, and the traversal is performed again in the above-mentioned traversed area according to the second target direction, and the traversed area is obtained. The content is used as the order information to be corrected.
[107] 在一些实施例中, 将所述相邻合并文本框作为目标相邻合并文本框, 则所述目 标相邻合并文本框对应的第一框间距离阈值才艮据以下至少一项确定: 所述目标相邻合并 文本框中首先遍历到的合并文本框的高度; 已遍历的相邻合并文本框所包含的合并文本 框之间的距离以及首先遍历到的合并文本框的高度。 其中, 所述目标相邻合并文本框是 待确定第一框间距离阈值的两个相邻的合并文本框。 在本公开实施例中, 每对相邻合并 文本框所对应的第一框间距离阈值可以是不同的。 [107] In some embodiments, the adjacent merged text box is used as the target adjacent merged text box, then the first inter-frame distance threshold corresponding to the target adjacent merged text box is determined according to at least one of the following : the height of the merged text box first traversed in the adjacent merged text boxes of the target; the distance between the merged text boxes contained in the traversed adjacent merged text boxes and the height of the merged text box first traversed. Wherein, the target adjacent merged text boxes are two adjacent merged text boxes for which the first inter-frame distance threshold is to be determined. In this embodiment of the present disclosure, the first inter-frame distance thresholds corresponding to each pair of adjacent merged text boxes may be different.
[108] 在一个示例中, 所述第一框间距离阈值才艮据所述目标相邻合并文本框中首先遍 历到的合并文本框的高度确定。 [108] In an example, the first inter-frame distance threshold is determined according to the height of the merged text box first traversed in the target adjacent merged text frame.
[109] 以图 3B中的相邻合并文本框 311和 312对应的第一框间距离阈值为例,由于在 定位所述待校正订单信息所在区域的过程中, 各个合并文本框是由下至上进行遍历的, 相邻合并文本框 311和 312在本示例中是首先遍历到的相邻合并文本框, 可以根据合并 文本框 311的高度, 来确定二者所对应的第一框间距离阈值。 例如, 将所述第一框间距 离阈值设置为 0.65*mean_heightl (合并文本框 311的高度) 。 [109] Taking the first inter-frame distance threshold corresponding to the adjacent merged text boxes 311 and 312 in FIG. 3B as an example, because in the process of locating the area where the order information to be corrected is located, each merged text frame is from bottom to top During traversal, the adjacent merged text boxes 311 and 312 are first traversed adjacent merged text boxes in this example, and the first inter-frame distance threshold corresponding to the two can be determined according to the height of the merged text box 311 . For example, the first inter-box distance threshold is set to 0.65*mean_heightl (the height of the merged text box 311).
[110] 在一个示例中, 所述第一框间距离阈值可以根据已遍历的相邻合并文本框所包
含的合并文本框之间的距离以及首先遍历到的合并文本框的高度确定。 其中, 首先遍历 到的合并文本框是在定位所述待校正订单信息所在区域的过程中最先遍历的合并文本 框。 [110] In an example, the first inter-frame distance threshold may be based on the traversed adjacent merged text boxes included The distance between the included merged text boxes and the height of the first traversed merged text box are determined. The first traversed merged text box is the first traversed merged text box in the process of locating the region where the order information to be corrected is located.
[111] 以图 3B中的相邻合并文本框 312和 313对应的第一框间距离阈值为例,可以根 据已遍历的相邻合并文本框 311和 312之间的距离, 以及最先遍历的合并文本框 311的 高度,来确定二者所对应的第一框间距离阈值。例如,将所述第一框间距离阈值 threshold 设置为 mean l_distance+standard l_deviation , 其中, mean l_distance表示相邻合并文本框 311和 312之间的距离, standard l_deviation表示合并文本框 311和 312对应的扰动值, standard 1 _deviation=0.25 *height 1 , height 1例如为合并文本框 311的高度。 [111] Taking the first inter-frame distance threshold corresponding to the adjacent merged text boxes 312 and 313 in FIG. 3B as an example, according to the distance between the traversed adjacent merged text boxes 311 and 312, and the first traversed The heights of the text boxes 311 are combined to determine the first inter-frame distance threshold corresponding to the two. For example, the first inter-frame distance threshold threshold is set to mean l_distance+standard l_deviation , where mean l_distance represents the distance between adjacent merged text frames 311 and 312 , and standard l_deviation represents the disturbance corresponding to the merged text frames 311 and 312 value, standard 1 _deviation=0.25 *height 1 , height 1 is, for example, the height of the merged text box 311 .
[112] 在已遍历到的相邻文本框多于一对的情况下, 以图 3B 中的相邻文本框 313和 314对应的第一框间距离阈值为例, 可以根据已遍历的相邻合并文本框 311和 312之间 的距离、 相邻合并文本框 312和 313之间的距离, 以及最先遍历的合并文本框 311的高 度, 来确定目标相邻文本框 313和 314对应的第一框间距离阈值。 [112] In the case where there are more than one pair of adjacent text boxes that have been traversed, taking the first inter-frame distance threshold corresponding to the adjacent text boxes 313 and 314 in FIG. 3B as an example, according to the traversed adjacent text boxes The distance between the combined text boxes 311 and 312, the distance between the adjacent combined text boxes 312 and 313, and the height of the combined text box 311 traversed first are used to determine the first target adjacent text boxes 313 and 314 corresponding to The distance threshold between boxes.
[113] 在一个示例中, 可以通过以下方式确定所述目标相邻合并文本框对应的第一框 间距离阈值: 获取所述目标相邻合并文本框的更新框间距离, 所述更新框间距离通过对 参考相邻合并文本框所包含的合并文本框之间的距离, 以及所述参考相邻合并文本框所 包含的合并文本框之间的更新框间距离, 进行加权求和获得, 其中, 所述参考相邻文本 框为与所述目标合并文本框最近的相邻合并文本框; 获取所述目标相邻合并文本框的更 新扰动值, 所述更新扰动值通过对首先遍历到的所述相邻合并文本框的扰动值和距离差 值的绝对值进行加权求和获得, 其中, 所述距离差值为所述目标相邻合并文本框的更新 框间距离与所述参考相邻合并文本框所包含的合并文本框之间的距离之差, 所述扰动值 根据首先遍历到的合并文本框的高度确定; 根据所述更新框间距离和所述更新扰动值确 定所述目标相邻合并文本框的第一框间距离阈值。 [113] In an example, the first inter-frame distance threshold corresponding to the target adjacent merged text boxes may be determined by: obtaining the updated inter-frame distance of the target adjacent merged text boxes, and the updated inter-frame distances The distance is obtained by weighted summation of the distances between the merged text boxes included in the reference adjacent merged text boxes and the updated inter-frame distances between the merged text boxes included in the reference adjacent merged text boxes, wherein , the reference adjacent text frame is the adjacent merged text frame closest to the target merged text frame; the update disturbance value of the target adjacent merged text frame is obtained, and the updated disturbance value is obtained by comparing all the first traversed The absolute value of the disturbance value of the adjacent merged text box and the distance difference value are obtained by weighted summation, wherein the distance difference value is the updated inter-frame distance of the target adjacent merged text frame and the reference adjacent merge. The difference between the distances between the merged text boxes included in the text box, the disturbance value is determined according to the height of the merged text box that is first traversed; according to the distance between the update boxes and the update disturbance value, it is determined that the target is adjacent The first inter-box distance threshold for merging text boxes.
[114] 仍以图 3B中的相邻文本框 313和 314对应的第一框间距离阈值为例,首先获得 相 邻 文 本 框 313 和 314 所 对 应 的 更 新 框 间 距 离 new_mean=0.6*mean_distance+0.4*mean2_distance; 其中, mean_distance 为参考相邻合 并文本框 312和 313所包括的合并文本框之间的更新框间距离。 在本示例中, 除最先遍 历的相邻合并文本框之外, 各相邻合并文本框对应的更新框间距离的获取方式相同。 所 述最先遍历的相邻合并文本框对应的更新框间距离为所包含的合并文本框之间的距离。 接 下 来 , 获 取 更 新 扰 动 值 new_deviation=0.6*standardl_deviation+0.4*abs(mean2_distance-new_mean) , 其 中 , standard 1 _deviation如前所述, 表示合并文本框 311和 312对应的扰动值, 其例如为合 并文本框 311的高度, mean2_distance、 new_mean的含义如上所述。 最后, 根据以上获 得的更新框间距离和更新扰动值确定目标相邻合并文本框 313和 314对应的第一框间距 离阈值。 [114] Still taking the first inter-frame distance threshold corresponding to the adjacent text boxes 313 and 314 in FIG. 3B as an example, first obtain the updated inter-frame distance new_mean=0.6*mean_distance+0.4 corresponding to the adjacent text boxes 313 and 314 *mean2_distance; wherein mean_distance is the updated inter-frame distance between the merged text boxes included in the reference adjacent merged text boxes 312 and 313 . In this example, except for the first traversed adjacent merged text box, the distance between the update boxes corresponding to each adjacent merged text frame is obtained in the same manner. The distance between the update boxes corresponding to the first traversed adjacent merged text boxes is the distance between the included merged text boxes. Next, obtain and update the disturbance value new_deviation=0.6*standardl_deviation+0.4*abs(mean2_distance-new_mean), wherein, standard 1_deviation as described above, represents the disturbance value corresponding to the merged text boxes 311 and 312, which is, for example, the merged text boxes The height of 311, the meanings of mean2_distance and new_mean are as described above. Finally, a first inter-frame distance threshold corresponding to the target adjacent merged text boxes 313 and 314 is determined according to the update inter-frame distance and the update disturbance value obtained above.
[115] 本领域技术人员应当理解, 以上各个参数的数值仅用于示例, 无意限制, 各个 参数的数值以及加权系数值可以根据实际需要确定。 [115] Those skilled in the art should understand that the numerical values of the above parameters are only used for examples, and are not intended to be limiting, and the numerical values of the respective parameters and the weighting coefficient values can be determined according to actual needs.
[116] 对于图 3B所示的多个合并文本框,通过应用以上所述的第一框间距离阈值确定 的方法, 在从合并文本框 311向上遍历时, 检测到合并文本框 314与合并文本框 313之
间的距离大于对应的第一框间距离阈值, 因此停止遍历, 接下来从合并文本框 314与合 并文本框 313中首先遍历到的合并文本框 313作为起始位置,向下遍历各个合并文本框, 直至查找到关键信息所在的合并文本框 311为止, 并获取向下遍历得到的内容。 [116] For the multiple merged text boxes shown in FIG. 3B, by applying the method for determining the distance between the first boxes described above, when traversing upward from the merged text box 311, it is detected that the merged text box 314 and the merged text are of box 313 The distance between them is greater than the corresponding first inter-frame distance threshold, so the traversal is stopped. Next, the merged text box 313 that is first traversed in the merged text box 314 and the merged text box 313 is used as the starting position, and each merged text box is traversed downward. , until the merged text box 311 where the key information is located is found, and the content obtained by the downward traversal is obtained.
[117] 在本公开实施例中, 通过对距离阈值设置扰动值, 以及根据已遍历的相邻合并 文本框的距离和最先遍历到的合并文本框来更新当前距离阈值,提高了本公开实施例提 出的信息提取方法的容错率, 从而能够更有效地提取出待校正订单信息。 [117] In the embodiment of the present disclosure, by setting a disturbance value to the distance threshold, and updating the current distance threshold according to the distance of the adjacent merged text boxes that have been traversed and the merged text box that was first traversed, the implementation of the present disclosure is improved. The error-tolerance rate of the information extraction method proposed in this example can be used to extract the order information to be corrected more effectively.
[118] 在一些实施例中, 在提取出所述待校正订单信息之后, 还可以按照所述目标方 向, 根据与所述待校正订单信息所在区域的位置关系, 从所述待校正订单信息所在区域 之外的合并文本框中确定所述待校正订单信息对应的主体名称。 [118] In some embodiments, after extracting the order information to be corrected, it is also possible to follow the target direction, according to the positional relationship with the area where the order information to be corrected is located, from where the order information to be corrected is located. The name of the subject corresponding to the order information to be corrected is determined in the merged text box outside the area.
[119] 在多种格式的文件中, 与所提取的目标区域所在区域距离最近的文本框, 为所 述待校正订单信息对应的主体名称的文本框。以图 3B所示的酒店订单的部分截图为例, 可见所提取的地址信息上方的文本框, 则为所述地址信息的主体一酒店的名称。 对于名 片、 购物订单等文件也是如此, 与地址信息、 身份信息等所在区域距离最近的文本框, 为这些信息的主体的名称所在的文本框。 [119] In files of various formats, the text box closest to the region where the extracted target region is located is the text box of the subject name corresponding to the order information to be corrected. Taking the partial screenshot of the hotel order shown in FIG. 3B as an example, it can be seen that the text box above the extracted address information is the name of the hotel, the subject of the address information. The same is true for documents such as business cards and shopping orders. The text box closest to the area where the address information, identity information, etc. are located is the text box where the name of the subject of the information is located.
[120] 在一个示例中, 可以通过以下方法确定所述待校正订单信息对应的主体名称。 [120] In an example, the subject name corresponding to the order information to be corrected may be determined by the following method.
[121] 首先, 确定与所述待校正订单信息所在区域在所述第一目标方向上距离最近的 合并文本框; 以所述合并文本框为起始位置, 按照所述第一目标方向遍历所述合并文本 框, 直至查找到距离满足第二设定条件的相邻合并文本框为止; 以所述距离满足第二设 定条件的相邻合并文本框中, 首先遍历到的合并文本框为起始位置, 按照所述第二目标 方向遍历所述待校正订单信息所在区域以外的合并文本框, 并获取按照所述第二目标方 向遍历到的内容。 [121] First, determine the merged text box that is closest to the area where the order information to be corrected is located in the first target direction; take the merged text frame as a starting position, and traverse all the text boxes according to the first target direction Describe the merged text box until the adjacent merged text box whose distance satisfies the second set condition is found; starting from the merged text box traversed first in the adjacent merged text box whose distance satisfies the second set condition traverse the merged text box outside the region where the order information to be corrected is located according to the second target direction, and acquire the content traversed according to the second target direction.
[122] 以图 3C所示的合并文本框为例,合并文本框 321-322中所包含的内容为根据本 公开任一实施例所述的订单信息的校正方法提取的待校正订单信息, 可以将合并文本框 321-322所在的区域确定为所述待校正订单信息所在的区域。 在 4艮据所述文本识别结果 所确定的各个合并文本框中, 除合并文本框 321~322之外, 与所述待校正订单信息所在 区域在第一目标方向上(查找遍历的方向, 在本示例中为向上)距离最近的合并文本框 为 323 (在合并文本框 322和合并文本框 323之间存在非目标语言的文字, 如灰色部分 所示, 忽略不计) 。 以合并文本框 323为起始位置, 向上遍历合并文本框。 由于合并文 本框 323上方相邻的合并文本框与合并文本框 323之间的距离超过第二框间阈值, 也即 满足了第二设定条件(在合并文本框 323的上方不存在其他合并文本框的情况下, 也认 为满足第二设定条件) , 则将合并文本框 323作为起始位置, 向下遍历所述待校正订单 信息所在区域之外的合并文本框, 在本示例中即为合并文本框 323, 从而可以将合并文 本框中的内容 “XXXXXX Hotel”确定为待校正订单信息的主体的名称,即将 “XXXXXX 酒店” 确定为待校正订单信息的主体的名称。 [122] Taking the merged text box shown in FIG. 3C as an example, the content contained in the merged text boxes 321-322 is the order information to be corrected extracted according to the method for correcting order information described in any embodiment of the present disclosure. The area where the merged text boxes 321-322 are located is determined as the area where the order information to be corrected is located. In each merged text box determined according to the text recognition result, except the merged text boxes 321-322, the region where the order information to be corrected is located is in the first target direction (the direction of search traversal, in In this example, it is up) and the closest merged text box is 323 (there is a non-target language text between the merged text box 322 and the merged text box 323, as shown in gray, which is ignored). Taking the merged text box 323 as the starting position, the merged text box is traversed upward. Since the distance between the adjacent merged text boxes above the merged text box 323 and the merged text box 323 exceeds the second inter-frame threshold, that is, the second set condition is satisfied (there is no other merged text above the merged text box 323 . box, it is also considered that the second setting condition is met), then take the merged text box 323 as the starting position, and traverse down the merged text box outside the area where the order information to be corrected is located, which in this example is The text box 323 is merged, so that the content "XXXXXX Hotel" in the merged text box can be determined as the name of the subject of the order information to be corrected, that is, "XXXXXX Hotel" is determined as the name of the subject of the order information to be corrected.
[123] 在一些实施例中, 在以所述合并文本框为起始位置, 按照所述第一目标方向遍 历所述合并文本框时, 忽略不在所述目标所在区域上方的合并文本框, 也即忽略与所述 待校正订单信息所在的合并文本框在水平方向上没有交集的合并文本框。 [123] In some embodiments, when taking the merged text box as a starting position and traversing the merged text frame according to the first target direction, ignore the merged text frame that is not above the area where the target is located, and also That is, a merged text box that has no horizontal intersection with the merged text box where the order information to be corrected is located is ignored.
[124] 在一个示例中, 在所遍历的合并文本框中包含 “) ” 却没有 “ (” 的情况下, 则可以忽略相邻合并文本框之间的距离条件, 继续在第一目标方向上遍历合并文本框,
直到查找到 “ (” , 再根据相邻合并文本框之间的距离条件确定是否停止遍历。 在该示 例中,可以将第二框间距离阈值设置为 0.4*mean_height(相邻合并文本框的平均高度)。 [124] In an example, in the case that the traversed merged text box contains ")" but does not have "(", the distance condition between adjacent merged text boxes can be ignored, and the process continues in the first target direction traverse the merged textbox, Until "(" is found, then determine whether to stop traversing according to the distance between adjacent merged text boxes. In this example, the distance threshold between the second boxes can be set to 0.4*mean_height (the average of adjacent merged text boxes high).
[125] 在一个示例中, 在当前遍历的合并文本框中包含完整的括号 “ ( ) ” , 或者不 含括号的情况下, 可以将第二框间距离阈值设置为 0.6*mean_height(相邻合并文本框的 平均高度) 。 本领域技术人员应当理解, 以上的系数设置均为示例, 本公开对此不进行 限制。 [125] In an example, in the case that the currently traversed merged text box contains complete brackets "( )", or does not contain brackets, the second inter-box distance threshold can be set to 0.6*mean_height (adjacent merged the average height of the text box). Those skilled in the art should understand that the above coefficient settings are all examples, which are not limited in the present disclosure.
[126] 本公开任一实施例所提出的信息提取方法可应用于各种版式的图像或电子文档, 各种版式至少包括如下一项: 酒店订单、 飞机行程单、 护照、 身份证等等, 电子文档可 以是 pdf文档。 通过将该信息提取方法应用于上述各种版式的图像或电子文档, 可以提 取出所述图像或电子文档中所包含的相应类型的待校正订单信息, 至少包括如下一项: 地址信息、 行程信息、 身份信息等等。 [126] The information extraction method proposed by any embodiment of the present disclosure can be applied to images or electronic documents of various formats, and various formats include at least one of the following: a hotel order, an airplane itinerary, a passport, an ID card, etc., The electronic document may be a pdf document. By applying the information extraction method to the above images or electronic documents of various formats, the corresponding type of order information to be corrected contained in the images or electronic documents can be extracted, including at least one of the following: address information, itinerary information , identity information, etc.
[127] 图 4为本公开至少一个实施例提供的订单信息的校正装置, 所述装置包括: 获 取单元 401, 用于根据订单的文本识别结果获得待校正订单信息; 确定单元 402, 用于 从所述文本识别结果中确定目标搜索信息; 匹配单元 403, 用于获取与所述目标搜索信 息匹配的订单参考信息; 校正单元 404, 用于利用所述订单参考信息校正所述待校正订 单信息, 以得到目标订单信息。 [127] FIG. 4 is an apparatus for correcting order information provided by at least one embodiment of the present disclosure. The apparatus includes: an obtaining unit 401 for obtaining order information to be corrected according to a text recognition result of the order; a determining unit 402 for obtaining order information from The target search information is determined in the text recognition result; the matching unit 403 is used to obtain order reference information matching the target search information; the correction unit 404 is used to correct the order information to be corrected by using the order reference information, to get the target order information.
[128] 在一些实施例中, 所述目标搜索信息包括如下至少一项: 所述目标搜索信息包 括所述待校正订单信息的主体名称、和所述待校正订单信息的至少一个内容元素中的至 少一项。 [128] In some embodiments, the target search information includes at least one of the following: the target search information includes a subject name of the order information to be corrected and at least one content element of the order information to be corrected. at least one.
[129] 在一些实施例中, 所述匹配单元具体用于如下至少一项: 从所述设定数据库中 获取与所述目标搜索信息匹配的订单参考信息; 通过互联网获取与所述目标搜索信息匹 配的订单参考信息。 [129] In some embodiments, the matching unit is specifically used for at least one of the following: obtaining order reference information matching the target search information from the setting database; obtaining the target search information through the Internet Matching order reference information.
[130] 在一些实施例中, 所述设定数据库包括多个层级的参考单元信息, 且所述多个 层级中最低层级的参考单元信息对应于多个参考主体名称。 [130] In some embodiments, the setting database includes reference unit information of a plurality of levels, and the reference unit information of the lowest level in the plurality of levels corresponds to a plurality of reference subject names.
[131] 在一些实施例中, 所述设定数据库存储有参考主体名称对应的第一参考信息; 所述确定单元具体用于: 根据所述设定数据库中的层级划分, 获取所述待校正订单信息 中最低层级的单元信息; 所述匹配单元具体用于: 确定所述设定数据库的最低层级的参 考单元信息中, 与所述待校正订单信息中最低层级的单元信息相匹配的目标单元信息; 确定所述目标单元信息所对应的多个参考主体名称中, 符合预设条件的目标主体名称; 根据所述目标主体名称所对应的第一参考信息, 获得与所述目标搜索信息匹配的订单参 考信息。 [131] In some embodiments, the setting database stores the first reference information corresponding to the name of the reference subject; the determining unit is specifically configured to: obtain the to-be-corrected according to the hierarchical division in the setting database the unit information of the lowest level in the order information; the matching unit is specifically configured to: determine the target unit that matches the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level of the setting database information; determine a target subject name that meets a preset condition among the multiple reference subject names corresponding to the target unit information; obtain a target subject name matching the target search information according to the first reference information corresponding to the target subject name Order reference information.
[132] 在一些实施例中, 所述设定数据库存储有参考主体名称对应的第二参考信息; 所述确定单元具体用于: 根据所述设定数据库中的层级划分, 获取所述待校正订单信息 中最低层级的单元信息; 所述匹配单元具体用于: 确定所述设定数据库的最低层级的参 考单元信息中, 与所述待校正订单信息中最低层级的单元信息相匹配的目标单元信息; 确定所述目标单元信息所对应的多个参考主体名称中, 符合预设条件的目标主体名称; 根据所述目标主体名称所对应的各个层级的参考单元信息, 以及所述目标主体名称所对 应的第二参考信息, 获得与所述目标搜索信息匹配的订单参考信息。
[133] 在一些实施例中, 所述匹配单元在确定所述目标单元信息所对应的多个参考主 体名称中, 符合预设条件的目标主体名称时, 具体用于: 将所述待校正订单信息对应的 主体名称分别与所述目标单元信息所对应的多个参考主体名称进行匹配; 将匹配得分最 高且超过第一设定阈值的参考主体名称, 确定为目标主体名称。 [132] In some embodiments, the setting database stores second reference information corresponding to the name of the reference subject; the determining unit is specifically configured to: obtain the to-be-corrected data according to the hierarchical division in the setting database the unit information of the lowest level in the order information; the matching unit is specifically configured to: determine the target unit that matches the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level of the setting database information; determine the target subject name that meets the preset condition among the multiple reference subject names corresponding to the target unit information; according to the reference unit information of each level corresponding to the target subject name, and the target subject name Corresponding second reference information, obtain order reference information matching the target search information. [133] In some embodiments, when the matching unit determines a target subject name that meets a preset condition among the multiple reference subject names corresponding to the target unit information, the matching unit is specifically configured to: match the order to be corrected The subject names corresponding to the information are respectively matched with multiple reference subject names corresponding to the target unit information; the reference subject name with the highest matching score and exceeding the first set threshold is determined as the target subject name.
[134] 在一些实施例中, 所述匹配单元具体用于: 根据所述目标搜索信息在互联网中 进行搜索, 获得与所述目标搜索信息匹配的一个或多个候选订单参考信息; 将各候选订 单参考信息与所述待校正订单信息进行匹配; 获取匹配得分最高且超过第二设定阈值的 候选订单参考信息作为所述订单参考信息。 [134] In some embodiments, the matching unit is specifically configured to: perform a search on the Internet according to the target search information, and obtain one or more candidate order reference information matching the target search information; The order reference information is matched with the order information to be corrected; and the candidate order reference information with the highest matching score and exceeding the second set threshold is obtained as the order reference information.
[135] 在一些实施例中, 所述装置还包括添加单元, 用于将从互联网中获取的所述订 单参考信息, 以及所述待校正订单信息对应的主体名称, 添力 P至所述设定数据库中最低 层级的参考单元信息所对应的信息中。 [135] In some embodiments, the device further includes an adding unit for adding the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected, and adding force P to the device In the information corresponding to the reference unit information of the lowest level in the database.
[136] 在一些实施例中, 所述装置还包括更新单元, 用于根据从互联网中获取的所述 订单参考信息, 以及所述待校正订单信息对应的主体名称, 对所述设定数据库中最低层 级的参考单元信息所对应的信息进行更新。 [136] In some embodiments, the device further includes an update unit, configured to update the information in the setting database according to the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected. The information corresponding to the reference unit information of the lowest level is updated.
[137] 在一些实施例中, 所述待校正订单信息至少包括地址信息, 所述地址信息所包 括的至少一个内容元素包括如下至少一项: 行政区、 邮政编码; 所述设定数据库所包括 的多个层级的参考单元信息包括参考行政区信息或邮政编码信息。 [137] In some embodiments, the order information to be corrected includes at least address information, and at least one content element included in the address information includes at least one of the following: administrative area, postal code; Reference unit information at multiple levels includes reference borough information or zip code information.
[138] 在一些实施例中, 所述获取单元具体用于: 获取所述待处理对象的文本识别结 果, 所述文本识别结果包括多个文本框; 从所述多个文本框中确定包含关键信息的第一 文本框, 所述关键信息包括所述待校正订单信息的至少一个内容元素、 和指示所述待校 正订单信息的关键词中的至少一项; 根据所述第一文本框, 对所述多个文本框中的至少 部分进行合并, 得到合并文本框; 从所述合并文本框获取所述待校正订单信息。 [138] In some embodiments, the obtaining unit is specifically configured to: obtain a text recognition result of the object to be processed, where the text recognition result includes multiple text boxes; The first text box of the information, the key information includes at least one content element of the order information to be corrected and at least one of the keywords indicating the order information to be corrected; according to the first text box, to At least a part of the multiple text boxes is combined to obtain a combined text box; and the order information to be corrected is acquired from the combined text box.
[139] 本公开实施例还提供一种电子设备, 所述设备包括存储器、 处理器, 所述存储 器用于存储可在处理器上运行的计算机指令, 所述处理器用于在执行所述计算机指令时 实现本公开任一实施方式所述的订单信息的校正方法。 [139] An embodiment of the present disclosure further provides an electronic device, the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute the computer instructions. The method for correcting order information described in any embodiment of the present disclosure is implemented.
[140] 根据本公开实施例还提供一种计算机可读存储介质, 其上存储有计算机程序, 所述程序被处理器执行时实现本公开任一实施方式所述的订单信息的校正方法。 [140] According to an embodiment of the present disclosure, there is also provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for correcting order information described in any embodiment of the present disclosure is implemented.
[141] 本公开一个或多个实施例的订单信息的校正方法、 装置、 设备及存储介质, 根 据订单的文本识别结果获得待校正订单信息, 并从所述文本识别结果中确定目标搜索信 息, 获取与所述目标搜索信息匹配的订单参考信息, 利用所述订单参考信息校正所述待 校正订单信息以得到目标订单信息, 可以从订单的文本识别结果中, 快速地获得准确的 目标订单信息。 [141] The order information correction method, device, device and storage medium according to one or more embodiments of the present disclosure, obtain order information to be corrected according to the text recognition result of the order, and determine target search information from the text recognition result, Acquiring order reference information matching the target search information, and using the order reference information to correct the order information to be corrected to obtain target order information, can quickly obtain accurate target order information from the text recognition result of the order.
[142] 图 5为本公开至少一个实施例提供的电子设备,所述设备包括存储器、处理器, 所述存储器用于存储可在处理器上运行的计算机指令, 所述处理器用于在执行所述计算 机指令时实现本公开任一实施例所述的订单信息的校正方法。 [142] FIG. 5 provides an electronic device according to at least one embodiment of the present disclosure, the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute all The method for correcting order information described in any embodiment of the present disclosure is implemented when the computer instruction is used.
[143] 本公开至少一个实施例还提供了一种计算机可读存储介质, 其上存储有计算机 程序, 所述程序被处理器执行时实现本公开任一实施例所述的订单信息的校正方法。 [143] At least one embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for correcting order information described in any embodiment of the present disclosure is implemented .
[144] 本公开至少一个实施例还提供了一种计算机程序, 包括计算机可读代码, 在所
述计算机可读代码在电子设备中运行的情况下, 所述电子设备中的处理器执行时实现第 一方面所述的订单信息的校正方法。 [144] At least one embodiment of the present disclosure also provides a computer program, comprising computer-readable code, When the computer-readable code is executed in an electronic device, the processor in the electronic device implements the method for correcting order information described in the first aspect when executed.
[145] 本公开实施例所提供的订单信息的计算机程序产品, 包括存储了计算机可读代 码的计算机可读存储介质, 所述计算机可读代码包括的指令可用于执行上述方法实施例 中所述的订单信息的校正方法。 [145] The computer program product of order information provided by the embodiments of the present disclosure includes a computer-readable storage medium storing computer-readable codes, and the instructions included in the computer-readable codes can be used to execute the methods described in the foregoing method embodiments. The correction method of the order information.
[146] 本领域技术人员应明白, 本说明书 -一个或多个实施例可提供为方法、 系统或计 算机程序产品。 因此, 本说明书一个或多个实施例可采用完全硬件实施例、 完全软件实 施例或结合软件和硬件方面的实施例的形式。 而且, 本说明书一个或多个实施例可采用 在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁 盘存储器、 CD-ROM、 光学存储器等) 上实施的计算机程序产品的形式。 [146] As will be appreciated by those skilled in the art, one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may employ a computer program implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein form of the product.
[147] 本说明书中的各个实施例均采用递进的方式描述, 各个实施例之间相同相似的 部分互相参见即可, 每个实施例重点说明的都是与其他实施例的不同之处。 尤其, 对于 数据处理设备实施例而言, 由于其基本相似于方法实施例, 所以描述的比较筒单, 相关 之处参见方法实施例的部分说明即可。 [147] Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant part may refer to the partial description of the method embodiment.
[148] 上述对本说明书特定实施例进行了描述。 其它实施例在所附权利要求书的范围 内。 在一些情况下, 在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序 来执行并且仍然可以实现期望的结果。 另外, 在附图中描绘的过程不一定要求示出的特 定顺序或者连续顺序才能实现期望的结果。 在某些实施方式中, 多任务处理和并行处理 也是可以的或者可能是有利的。 [148] The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the acts or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[149] 本说明书中描述的主题及功能操作的实施例可以在以下中实现: 数字电子电路、 有形体现的计算机软件或固件、 包括本说明书中公开的结构及其结构性等同物的计算机 硬件、 或者它们中的一个或多个的组合。 本说明书中描述的主题的实施例可以实现为一 个或多个计算机程序, 即编码在有形非暂时性程序载体上以被数据处理装置执行或控制 数据处理装置的操作的计算机程序指令中的一个或多个模块。 可替代地或附加地, 程序 指令可以被编码在人工生成的传播信号上, 例如机器生成的电、 光或电磁信号, 该信号 被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介 质可以是机器可读存储设备、 机器可读存储基板、 随机或串行存取存储器设备、 或它们 中的一个或多个的组合。 [149] Embodiments of the subject matter and functional operations described in this specification can be implemented in: digital electronic circuits, tangible embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible non-transitory program carrier for execution by or to control the operation of data processing apparatus or multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for use by the data The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
[150] 本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或 多个可编程计算机执行, 以通过根据输入数据进行操作并生成输出来执行相应的功能。 所述处理及逻辑流程还可以由专用逻辑电路一例如 FPGA(现场可编程门阵列)或 ASIC[150] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be implemented by dedicated logic circuits such as FPGAs (Field Programmable Gate Arrays) or ASICs.
(专用集成电路) 来执行, 并且装置也可以实现为专用逻辑电路。 (application-specific integrated circuit), and the apparatus may also be implemented as an application-specific logic circuit.
[151] 适合用于执行计算机程序的计算机包括, 例如通用和 /或专用微处理器, 或任何 其他类型的中央处理单元。通常, 中央处理单元将从只读存储器和 /或随机存取存储器接 收指令和数据。 计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存 储指令和数据的一个或多个存储器设备。 通常, 计算机还将包括用于存储数据的一个或 多个大容量存储设备, 例如磁盘、 磁光盘或光盘等, 或者计算机将可操作地与此大容量 存储设备摘接以从其接收数据或向其传送数据, 抑或两种情况兼而有之。 然而, 计算机 不是必须具有这样的设备。 此外, 计算机可以嵌入在另一设备中, 例如移动电话、 个人
数字助理 (PDA) 移动音频或视频播放器、 游戏操纵台、 全球定位系统(GPS)接收 机、 或例如通用串行总线 (USB) 闪存驱动器的便携式存储设备, 仅举几例。 [151] Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operably interfaced with such mass storage devices to receive data from or to It transmits data, or both. However, the computer does not have to have such a device. Furthermore, the computer can be embedded in another device, such as a mobile phone, a personal Digital Assistant (PDA) mobile audio or video players, game consoles, Global Positioning System (GPS) receivers, or portable storage devices such as Universal Serial Bus (USB) flash drives, to name a few.
[152] 适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性 存储器、 媒介和存储器设备, 例如包括半导体存储器设备(例如 EPROM、 EEPROM和 闪存设备) 、 磁盘 (例如内部硬盘或可移动盘) 、 磁光盘以及 CD ROM和 DVD-ROM 盘。 处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。 [152] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disk or removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and memory may be supplemented by or incorporated in special purpose logic circuitry.
[153] 虽然本说明书包含许多具体实施细节, 但是这些不应被解释为限制任何发明的 范围或所要求保护的范围, 而是主要用于描述特定发明的具体实施例的特征。 本说明书 内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。 另一方面, 在单 个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来 实施。 此外, 虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护, 但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除, 并 且所要求保护的组合可以指向子组合或子组合的变型。 [153] Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or the scope of what is claimed, but are primarily used to describe features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function as described above in certain combinations and even be claimed as such, one or more features from a claimed combination may in some cases be removed from the combination and the claimed A protected combination may point to a subcombination or a variation of a subcombination.
[154] 类似地, 虽然在附图中以特定顺序描绘了操作, 但是这不应被理解为要求这些 操作以所示的特定顺序执行或顺次执行、 或者要求所有例示的操作被执行, 以实现期望 的结果。 在某些情况下, 多任务和并行处理可能是有利的。 此外, 上述实施例中的各种 系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离, 并且应当理解, 所描述的程序组件和系统通常可以一起集成在单个软件产品中, 或者封装成多个软件产 品。 [154] Similarly, although operations are depicted in the figures in a particular order, this should not be construed as requiring that these operations be performed in the particular order shown or sequentially, or that all illustrated operations be performed, in order to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product , or packaged into multiple software products.
[155] 由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。 在某些情况下, 权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。 此外, 附图中描绘的处理并非必需所示的特定顺序或顺次顺序, 以实现期望的结果。 在 某些实现中, 多任务和并行处理可能是有利的。
[155] Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
Claims
1、 一种订单信息的校正方法, 所述方法包括: 根据订单的文本识别结果获得待校正订单信息; 从所述文本识别结果中确定目标搜索信息; 获取与所述目标搜索信息匹配的订单参考信息; 利用所述订单参考信息校正所述待校正订单信息, 以得到目标订单信息。 1. A method for correcting order information, the method comprising: obtaining order information to be corrected according to a text recognition result of an order; determining target search information from the text recognition result; obtaining an order reference matching the target search information information; correcting the order information to be corrected by using the order reference information to obtain target order information.
2、 根据权利要求 1 所述的方法, 其特征在于, 所述目标搜索信息包括所述待校正 订单信息的主体名称以及至少一个内容元素中的至少一项。 2. The method according to claim 1, wherein the target search information includes at least one of a subject name of the order information to be corrected and at least one content element.
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述获取与所述目标搜索信息 匹配的订单参考信息, 包括如下至少一项: 从设定数据库中获取与所述目标搜索信息匹配的订单参考信息; 通过互联网获取与所述目标搜索信息匹配的订单参考信息。 3. The method according to claim 1 or 2, wherein the acquiring order reference information matching the target search information comprises at least one of the following: acquiring the target search information from a setting database matching order reference information; obtain order reference information matching the target search information through the Internet.
4、 根据权利要求 3所述的方法, 其特征在于, 所述设定数据库包括多个层级的参考单元信息, 且 所述多个层级中最低层级的参考单元信息对应于多个参考主体名称。 4. The method according to claim 3, wherein the setting database includes reference unit information of a plurality of levels, and the reference unit information of the lowest level in the plurality of levels corresponds to a plurality of reference subject names.
5、 根据权利要求 4所述的方法, 其特征在于, 所述设定数据库存储有参考主体名称对应的第一参考信息; 从所述文本识别结果中确定目标搜索信息, 包括: 根据所述设定数据库中的层级划分, 获取所述待校正订单信息中最低层级的单 元信息; 所述从设定数据库中获取与所述目标搜索信息匹配的订单参考信息, 包括: 确定所述设定数据库的最低层级的参考单元信息中, 与所述待校正订单信息中 最低层级的单元信息相匹配的目标单元信息; 确定所述目标单元信息所对应的多个参考主体名称中,符合预设条件的目标主 体名称; 根据所述目标主体名称所对应的第一参考信息, 获得与所述目标搜索信息匹配 的订单参考信息。 5. The method according to claim 4, wherein the setting database stores the first reference information corresponding to the name of the reference subject; and determining the target search information from the text recognition result comprises: according to the setting determining the level division in the database, and acquiring the unit information of the lowest level in the order information to be corrected; acquiring the order reference information matching the target search information from the setting database, including: determining the information of the setting database In the reference unit information of the lowest level, the target unit information that matches the unit information of the lowest level in the order information to be corrected; Determine the target unit that meets the preset condition among the multiple reference subject names corresponding to the target unit information subject name; obtain order reference information matching the target search information according to the first reference information corresponding to the target subject name.
6、 根据权利要求 4所述的方法, 其特征在于, 所述设定数据库存储有参考主体名称对应的第二参考信息; 从所述文本识别结果中确定目标搜索信息, 包括: 根据所述设定数据库中的层级划分, 获取所述待校正订单信息中最低层级的单 元信息; 所述从设定数据库中获取与所述目标搜索信息匹配的订单参考信息, 包括: 确定所述设定数据库的最低层级的参考单元信息中, 与所述待校正订单信息中 最低层级的单元信息相匹配的目标单元信息; 确定所述目标单元信息所对应的多个参考主体名称中,符合预设条件的目标主 体名称; 根据所述目标主体名称所对应的各个居级的参考单元信息, 以及所述目标主体 名称所对应的第二参考信息, 获得与所述目标搜索信息匹配的订单参考信息。 6. The method according to claim 4, wherein the setting database stores second reference information corresponding to the name of the reference subject; determining the target search information from the text recognition result comprises: according to the setting determining the level division in the database, and acquiring the unit information of the lowest level in the order information to be corrected; acquiring the order reference information matching the target search information from the setting database, including: determining the information of the setting database In the reference unit information of the lowest level, the target unit information that matches the unit information of the lowest level in the order information to be corrected; Determine the target unit that meets the preset condition among the multiple reference subject names corresponding to the target unit information subject name; obtain order reference information matching the target search information according to the reference unit information of each level corresponding to the target subject name and the second reference information corresponding to the target subject name.
7、 根据权利要求 5或 6所述的方法, 其特征在于, 所述确定所述目标单元信息所 对应的多个参考主体名称中, 符合预设条件的目标主体名称, 包括: 将所述待校正订单信息对应的主体名称分别与所述目标单元信息所对应的多个参
考主体名称进行匹配; 将匹配得分最高且超过第一设定阈值的参考主体名称, 确定为目标主体名称。 7. The method according to claim 5 or 6, wherein the determining, among the multiple reference subject names corresponding to the target unit information, the target subject name that meets a preset condition comprises: The subject name corresponding to the correction order information and the multiple parameters corresponding to the target unit information are respectively. The test subject name is matched; the reference subject name with the highest matching score and exceeding the first set threshold is determined as the target subject name.
8、 根据权利要求 3至 7任一项所述的方法, 其特征在于, 所述通过互联网获取与 所述目标搜索信息匹配的订单参考信息, 包括: 根据所述目标搜索信息在互联网中进行搜索, 获得与所述目标搜索信息匹配的一个 或多个候选订单参考信息; 将各所述候选订单参考信息与所述待校正订单信息进行匹配; 获取匹配得分最高且超过第二设定阈值的候选订单参考信息作为所述订单参考信 息。 8. The method according to any one of claims 3 to 7, wherein the obtaining order reference information matching the target search information through the Internet comprises: searching the Internet according to the target search information , obtain one or more candidate order reference information matching the target search information; match each candidate order reference information with the to-be-corrected order information; obtain the candidate with the highest matching score and exceeding the second set threshold Order reference information is used as the order reference information.
9、 根据权利要求 8所述的方法, 其特征在于, 所述方法还包括: 将从互联网中获取的所述订单参考信息, 以及所述待校正订单信息对应的主体名称, 添加至所述设定数据库中最低层级的参考单元信息所对应的信息中。 9. The method according to claim 8, wherein the method further comprises: adding the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected to the device In the information corresponding to the reference unit information of the lowest level in the database.
10、 根据权利要求 8所述的方法, 其特征在于, 所述方法还包括: 根据从互联网中获取的所述订单参考信息, 以及所述待校正订单信息对应的主体名 称, 对所述设定数据库中最低层级的参考单元信息所对应的信息进行更新。 10. The method according to claim 8, wherein the method further comprises: according to the order reference information obtained from the Internet and the subject name corresponding to the order information to be corrected, setting the setting The information corresponding to the reference unit information of the lowest level in the database is updated.
11、 根据权利要求 4至 10任一项所述的方法, 其特征在于, 所述待校正订单信息至少包括地址信息, 所述地址信息所包括的至少一个内容元素包括如下至少一项: 行政区、 邮政编码, 所述设定数据库所包括的多个层级的参考单元信息包括参考行政区信息和 /或邮政 编码信息。 11. The method according to any one of claims 4 to 10, wherein the order information to be corrected includes at least address information, and at least one content element included in the address information includes at least one of the following: administrative region, Postal code, the reference unit information of multiple levels included in the setting database includes reference administrative area information and/or postal code information.
12、 根据权利要求 11 所述的方法, 其特征在于, 所述根据订单的文本识别结果获 得待校正订单信息, 包括: 获取所述订单的文本识别结果, 所述文本识别结果包括多个文本框; 从所述多个文本框中确定包含关键信息的第一文本框, 所述关键信息包括所述待校 正订单信息的至少一个内容元素、 以及指示所述待校正订单信息的关键词中的至少一项; 根据所述第一文本框,对所述多个文本框中的至少部分进行合并,得到合并文本框; 从所述合并文本框获取所述待校正订单信息。 12. The method according to claim 11, wherein the obtaining the order information to be corrected according to the text recognition result of the order comprises: obtaining the text recognition result of the order, wherein the text recognition result includes a plurality of text boxes determining a first text box containing key information from the plurality of text boxes, the key information including at least one content element of the order information to be corrected and at least one of the keywords indicating the order information to be corrected Item 1: According to the first text box, combine at least part of the multiple text boxes to obtain a combined text box; and obtain the order information to be corrected from the combined text box.
13、 一种订单信息的校正装置, 所述装置包括: 获取单元, 用于根据订单的文本识别结果获得待校正订单信息; 确定单元, 用于从所述文本识别结果中确定目标搜索信息; 匹配单元, 用于获取与所述目标搜索信息匹配的订单参考信息; 校正单元, 用于利用所述订单参考信息校正所述待校正订单信息, 以得到目标订单 信息。 13. A device for correcting order information, the device comprising: an obtaining unit for obtaining order information to be corrected according to a text recognition result of the order; a determining unit for determining target search information from the text recognition result; matching a unit for acquiring order reference information matching the target search information; a correcting unit for correcting the order information to be corrected by using the order reference information to obtain target order information.
14、 一种电子设备, 所述设备包括存储器、 处理器, 所述存储器用于存储可在处理 器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求 1至 12 任一项所述的方法。 14. An electronic device comprising a memory and a processor, the memory for storing computer instructions executable on the processor, the processor for implementing claims 1 to 12 when executing the computer instructions The method of any one.
15、 一种计算机可读存储介质, 其上存储有计算机程序, 所述程序被处理器执行时 实现权利要求 1至 12任一所述的方法。 15. A computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 12 is implemented.
16、 一种计算机程序, 包括计算机可读代码, 在所述计算机可读代码在电子设备中 运行的情况下, 所述电子设备中的处理器执行时实现权利要求 1至 12中任一所述的方 法。
16. A computer program, comprising computer-readable codes, when the computer-readable codes are executed in an electronic device, a processor in the electronic device implements any one of claims 1 to 12 when executed. Methods.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011339777.2A CN112395874A (en) | 2020-11-25 | 2020-11-25 | Order information correction method, device, equipment and storage medium |
CN202011339777.2 | 2020-11-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022112857A1 true WO2022112857A1 (en) | 2022-06-02 |
Family
ID=74603919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2021/055848 WO2022112857A1 (en) | 2020-11-25 | 2021-06-30 | Method and apparatus for correcting order information, and device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112395874A (en) |
WO (1) | WO2022112857A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114092684A (en) * | 2021-11-17 | 2022-02-25 | 中国银联股份有限公司 | Text calibration method, device, equipment and storage medium |
CN114120322B (en) * | 2022-01-26 | 2022-05-10 | 深圳爱莫科技有限公司 | Order commodity quantity identification result correction method and processing equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050137991A1 (en) * | 2003-12-18 | 2005-06-23 | Bruce Ben F. | Method and system for name and address validation and correction |
WO2009005492A1 (en) * | 2007-06-29 | 2009-01-08 | United States Postal Service | Systems and methods for validating an address |
CN107239453A (en) * | 2016-03-28 | 2017-10-10 | 平安科技(深圳)有限公司 | Information write-in method and device |
WO2020134991A1 (en) * | 2018-12-29 | 2020-07-02 | 益萃网络科技(中国)有限公司 | Automatic input method for paper form, apparatus , and computer device and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5043735B2 (en) * | 2008-03-28 | 2012-10-10 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Information classification system, information processing apparatus, information classification method, and program |
CN110442702B (en) * | 2019-08-15 | 2022-09-02 | 北京上格云技术有限公司 | Searching method and device, readable storage medium and electronic equipment |
CN110674396B (en) * | 2019-08-28 | 2021-04-27 | 北京三快在线科技有限公司 | Text information processing method and device, electronic equipment and readable storage medium |
-
2020
- 2020-11-25 CN CN202011339777.2A patent/CN112395874A/en active Pending
-
2021
- 2021-06-30 WO PCT/IB2021/055848 patent/WO2022112857A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050137991A1 (en) * | 2003-12-18 | 2005-06-23 | Bruce Ben F. | Method and system for name and address validation and correction |
WO2009005492A1 (en) * | 2007-06-29 | 2009-01-08 | United States Postal Service | Systems and methods for validating an address |
CN107239453A (en) * | 2016-03-28 | 2017-10-10 | 平安科技(深圳)有限公司 | Information write-in method and device |
WO2020134991A1 (en) * | 2018-12-29 | 2020-07-02 | 益萃网络科技(中国)有限公司 | Automatic input method for paper form, apparatus , and computer device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112395874A (en) | 2021-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3971731A1 (en) | Fence address-based coordinate data processing method and apparatus, and computer device | |
AU2020279921B2 (en) | Representative document hierarchy generation | |
US8745065B2 (en) | Query parsing for map search | |
US20090006394A1 (en) | Systems and methods for validating an address | |
WO2022112857A1 (en) | Method and apparatus for correcting order information, and device and storage medium | |
CN110674396B (en) | Text information processing method and device, electronic equipment and readable storage medium | |
CN111652176B (en) | Information extraction method, device, equipment and storage medium | |
JP7149721B2 (en) | Information processing device, character recognition engine optimization method and program | |
US8855421B2 (en) | Methods and apparatuses for Embedded Media Marker identification | |
CN109344387B (en) | Method and device for generating shape near word dictionary and method and device for correcting shape near word error | |
CN110516011B (en) | Multi-source entity data fusion method, device and equipment | |
US10331717B2 (en) | Method and apparatus for determining similar document set to target document from a plurality of documents | |
US8996501B2 (en) | Optimally ranked nearest neighbor fuzzy full text search | |
JP2019169025A (en) | Information processing device, character recognition engine selection method, and program | |
CN115470307A (en) | Address matching method and device | |
CN114201480A (en) | Multi-source POI fusion method and device based on NLP technology and readable storage medium | |
JP2016133960A (en) | Keyword extraction system, keyword extraction method, and computer program | |
CN112287763A (en) | Image processing method, apparatus, device and medium | |
CN109241208B (en) | Address positioning method, address monitoring method, information processing method and device | |
CN113626536B (en) | News geocoding method based on deep learning | |
CN112579713B (en) | Address recognition method, address recognition device, computing equipment and computer storage medium | |
CN113704427A (en) | Text provenance determination method, device, equipment and storage medium | |
CN112396056A (en) | Method for high-accuracy line division of text image OCR result | |
CN111708891A (en) | Food material entity linking method and device among multi-source food material data | |
CN111460325A (en) | POI searching method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21897263 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21897263 Country of ref document: EP Kind code of ref document: A1 |