WO2019242125A1

WO2019242125A1 - Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium

Info

Publication number: WO2019242125A1
Application number: PCT/CN2018/105543
Authority: WO
Inventors: 苏晓明; 汪伟; 王晓伟; 王鸿滨; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-19
Filing date: 2018-09-13
Publication date: 2019-12-26
Also published as: CN109002425A; CN109002425B

Abstract

The present solution provides a method and an apparatus for acquiring upstream and downstream relationships between companies, a terminal device and a medium, which are applicable to the technical field of data processing. The method comprises: converting the format of a text to be analyzed from pdf to xml format; according to each xml tag comprised in said text after conversion, positioning a form existing in said text, and acquiring a median value of each field region in the form; grouping, on the basis of the median value, the company object identifiers existing in each form body region to obtain form header fields matching the company object identifiers; and determining the upstream and downstream relationship between company objects according to the company object identifiers that match customer fields and vendor fields, respectively. The present solution achieves the automatic positioning of a form, and can obtain industrial chain information between company objects according to the company object identifiers that match the customer fields and the supplier fields, and improves the acquisition efficiency of the upstream and downstream relationship of companies.

Description

Method, device, terminal equipment and medium for obtaining upstream and downstream relationships of enterprises

This application claims the priority of a Chinese patent application filed on June 19, 2018 with the Chinese Patent Office, application number 201810630801.4, and the invention name is "Methods, Terminals, and Media for Acquiring Upstream and Downstream Relations of an Enterprise", the entire contents of which are hereby incorporated by reference. Incorporated in this application.

Technical field

The present application belongs to the technical field of data processing, and in particular, relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring an upstream and downstream relationship of an enterprise.

Background technique

Enterprise industry chain information has important reference value in many aspects such as enterprise risk assessment, risk transmission, and industry correlation analysis. The existing public documents of some companies often reveal the industrial chain relationships of some of the companies they are associated with. For example, in a public document such as a prospectus, annual report, and quarterly report issued by an enterprise, users can view the source of materials and sales destinations of products sold by the enterprise, so as to identify some upstream and downstream enterprises associated with the enterprise.

However, because the styles of public documents such as quarterly reports, annual reports, and prospectuses are more complicated, the industrial chain information contained in such public documents can only be manually identified and obtained manually, so the efficiency of obtaining upstream and downstream relationships of enterprises More low.

technical problem

In view of this, the embodiments of the present application provide a method, device, terminal device and medium for obtaining upstream and downstream relationships of an enterprise, so as to solve the problem that the efficiency of obtaining upstream and downstream relationships of various enterprises is relatively low in the public documents of various enterprises. .

Technical solutions

A first aspect of the embodiments of the present application provides a method for obtaining an upstream and downstream relationship of an enterprise, including:

Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;

According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;

Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field

Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.

A second aspect of the embodiments of the present application provides an apparatus for acquiring an upstream and downstream relationship of an enterprise, and the monitoring device includes a unit for executing the method for acquiring an upstream and downstream relationship of an enterprise according to the first aspect.

A third aspect of the embodiments of the present application provides a terminal device including a memory and a processor. The memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions. When the instruction is read, the steps of the method for obtaining the upstream and downstream relationships of the enterprise according to the first aspect are implemented.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. The steps of the method for obtaining the upstream and downstream relationships of the company.

Beneficial effect

In the embodiment of the present application, since the public documents such as the prospectus, annual report, and quarterly report obtained in the original loading exist in the pdf format, by converting the text format of these public documents to the xml format, the machine can recognize the xml tags The location area to which the form belongs is determined to realize the automatic positioning of the form. In the above public document, each field value included in the form exists in text form in each xml tag, so for the enterprises that exist in the form body area The object identifier, based on the midline value of each field area, to determine the customer field or supplier field that the enterprise object ID matches, can improve the accuracy of matching the header field to which each field value in the table body area belongs. Because there is a clear upstream and downstream relationship between the customer and the supplier, according to the corporate object identifiers that match the customer field and the supplier field, the industry chain information between the corporate objects can be obtained, thereby improving the upstream and downstream relationship of the enterprise. Acquisition efficiency.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;

FIG. 2 is a specific implementation flowchart of a method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application;

FIG. 3 is a specific implementation flowchart of a method S1031 for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;

FIG. 4 is a specific implementation flowchart of a method S104 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application; FIG.

5 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;

6 is a schematic diagram of an apparatus for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.

Embodiments of the invention

In order to explain the technical solution described in this application, the following description is made through specific embodiments.

FIG. 1 shows an implementation flow of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application. The method flow includes steps S101 to S105. The specific implementation principle of each step is as follows:

S101: Acquire a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.

In the embodiment of the present application, the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information. Among them, when companies create the above public documents, they use portable documents (Portable Document Format (PDF) format for output, so the format of the text to be analyzed downloaded from the above public website is PDF format.

S102: Convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool.

For each text to be analyzed in pdf format, import it into a preset text conversion tool, and after detecting the format conversion instruction issued by the user, output the text to be analyzed based on the eXtensible Markup Language (xml) format. . The above text conversion tools can be, for example, Foxit converters, PDF all-around converters, and All Office Converter and more. Exemplarily, the text to be analyzed based on the xml format may be, for example:

<text top = "538" left = "157" width = "214" height = "22" font = "10"> (3) Other important matters </ text>

<text top = "584" left = "171" width = "596" height = "19" font = "12"> as of 2005 As of December 31, 2013, the details of major unfinished engineering contracts signed by the company are as follows: </ text>

S103: Locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value represents the field The distance between the center position of the area and the left border of the page. The field area includes the header area and the body area.

According to the text to be analyzed in the above example, it is known that the text to be analyzed based on the xml format includes a text tag <text>, and the <text> tag also includes attribute values such as top, width, height, and font. It is worth noting that in addition to the text tag <text>, paragraph tags or other types of tags may exist in the text to be analyzed based on the xml format, which is not shown in the above example for the time being.

In the embodiment of the present application, the text data corresponding to each text label is an attribute value of a field area in the table. According to the top attribute value of the text label, the position of each table existing in the text to be analyzed can be located.

Specifically, FIG. 2 shows a specific implementation process of the method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application, which is detailed as follows:

S1031: For each page in the text to be analyzed, locate each text label contained in the page, and read the value of the top attribute in the text label.

In the embodiment of the present application, each text to be analyzed associated with the enterprise object may be a pdf text displayed on a single page, or a pdf text displayed on multiple pages. After the text format conversion process is performed, the pdf text of each page will be converted to the corresponding page of xml text.

After the table in the text to be analyzed is converted to the xml format, the text data of each field in the table will correspond to the text data in the text label <text>. For the xml text of each page, read the top attribute value of each text tag according to each text tag it contains. The value of the top attribute indicates the distance between the position of the text data corresponding to the text label in the current page and the top of the page. It can be seen that if the text data is in different rows in the text to be analyzed, the top attribute value of the text label corresponding to the text data is different. In addition, if the text data appears at a higher position in the current page, the smaller the top attribute value of the corresponding text label is.

As an embodiment of the present application, FIG. 3 shows a specific implementation process of the method S1031 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:

S10311: Scan each page in the text to be analyzed separately to determine the page containing a preset form name.

In the embodiment of the present application, since the texts to be analyzed are public documents such as annual reports, quarterly reports, and prospectuses, the form names of each table included in the texts to be analyzed are form names that conform to a preset format. Scan each page in the text to be analyzed according to a preset regular expression. The above regular expression is used to describe the pattern rule to which the table name conforms.

If the text data matching the regular expression is identified in the current page, it is determined that the page contains a preset table name, so the page in the text to be analyzed is selected. After identifying each page in the text to be analyzed, multiple pages containing the name of the table can be determined in turn.

S10312: For the currently determined page, locate each text label contained in the page, and read the value of the top attribute in the text label.

S10313: if at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the current determination The operation of locating each text label contained in the page and reading the top attribute value in the text label.

If at least two text tags with the same top attribute value do not exist in a currently determined page, it means that there is no form in the page, so the next page containing the preset form name is read, and the process returns to the execution step. S10312.

In the embodiment of the present application, since a method of performing character matching based on a regular expression consumes less system resources, a plurality of pages including a preset table name in the text to be analyzed are determined in advance, which is compared to directly reading the pages. The method of determining the top attribute value of each text label in the text to determine whether the page contains a table improves the search efficiency of the table; after preliminary positioning each page to which the table in the text to be analyzed belongs, the table is further determined according to the top attribute value The specific distribution position of the table avoids the situation that only the table name does not exist on the page, so the embodiment of the present application improves the accuracy of table positioning.

S1032: In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the The area of the table in the text to be analyzed.

Among the text tags contained in the current page, according to the value of the top attribute value, the text tags with the highest top attribute value and the smallest top attribute value are filtered out. In the text to be analyzed, the text corresponding to the two text tags The data is in the first and last rows of the table. Therefore, in the embodiment of the present application, according to the position of the text label with the highest top attribute value and the smallest top attribute value in the current page, the page position of the last row of the table and the first row of the table can be determined on this page. Position the page area between these two page positions as the area where a table exists.

Particularly, in the current page, before detecting each of the text tags with the highest top attribute value and the smallest top attribute value, it is detected whether there are multiple consecutive text tags in the page. If there are K text tags (K is an integer greater than zero) that appear consecutively, the K text tags that appear consecutively are determined as xml parameters corresponding to a table in the text to be analyzed. For the xml parameter corresponding to each table, the text label with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the area where the table exists. Therefore, based on the above manner, various forms existing in the current page can be located.

In the embodiment of the present application, the value of the left attribute indicates the distance between the position of the text data corresponding to the text label on the current page and the left side of the page, and the value of the width attribute indicates the field area corresponding to the text label in the table. Width value, midline value indicates the distance between the center line of the field area on the current page and the left side of the page.

Calculate the median value Line_Mid of each field area in the table by the following formula:

Line_Mid = Value [left] + Value [width] / 2

The Value [left] indicates the left label value of the text label corresponding to the field area; and the Value [width] indicates the width label value of the text label corresponding to the field area.

S104: Based on the median value, group the enterprise object identifiers existing in each of the table body regions to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field. And the vendor field.

In the embodiment of the present application, for each table located in the text to be analyzed, the table body area and the header area are included. The header area includes the field area to which the first row of text data in the table belongs; the body area includes the other field areas in the table except the header area.

In the embodiment of the present application, the data column associated with the enterprise object identifier in each table is identified through a preset recognition algorithm. The enterprise object identification includes, but is not limited to, the name of the enterprise object, the abbreviation of the company name, or the industry common name of the enterprise object.

Exemplarily, the preset recognition algorithm may be, for example, acquiring multiple enterprise object identifiers collected in advance and storing the multiple enterprise object identifiers in an identifier list; judging for the text data corresponding to each text label Whether the text data matches any corporate object identification in the identification list; if the text data matches any corporate object identification system in the identification list, it is determined that the data column to which the text data belongs is a data column associated with the corporate object identification .

In the tables included in the quarterly report, annual report, and prospectus analysis text, for the data column associated with the corporate object identifier, the corresponding header field is usually the customer field or the supplier field. Since it is difficult to intuitively reflect the correspondence between each enterprise object identifier and its header field in the text to be analyzed based on the xml format, in the embodiment of the present application, based on the midline value of the field area to which the enterprise object identifier belongs, the enterprise object The identifiers are grouped to determine whether each enterprise object identifier is the body data in the "Customer" field data column or the body data in the "Supplier" field data column.

Specifically, as an embodiment of the present application, FIG. 4 shows a specific implementation process of the method S104 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:

S1041: Obtain the first centerline value of each header field in the header area separately.

In a table located on the current page, according to the above analysis, it can be known that the text data corresponding to each text label with the smallest top attribute value is the header field of the table. Therefore, after calculating the median value of each text label with the smallest top attribute value, the median value is output as the median value of a header field corresponding to the text label.

S1042: For each of the table body regions to which the enterprise object identifier belongs, obtain a second midline value of the table body region.

In the embodiment of the present application, if the text data corresponding to the text label is detected to contain the enterprise object identifier, it is determined that the field area corresponding to the text label is the body area, so the text label corresponds to The midline value of the field area is output as the midline value of a body area in the current table.

It should be noted that, in this embodiment, the first midline value refers to the midline value of the header area, and the second midline value refers to the midline value of the body area. The “first” here is only for convenience of expression and reference, and does not mean that there must be a corresponding first midline value in the specific implementation of the present application. Similarly, the "second" in the second midline value is only for convenience of expression and reference, and does not mean that there will be a second midline value corresponding to it in the specific implementation of the present application.

S1043: Calculate the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.

In the embodiment of the present application, if there are A (A is an integer greater than zero) header fields in the table, A first midline values can be obtained. For each corporate object identifier, the absolute value of the difference between the second midline value and each first midline value is calculated separately according to the second midline value of the body area to which it belongs, and the absolute value of the difference is output as the corporate object. Identifies the relative distance from the header field.

Exemplarily, if a corporate object identifier exists in the table as "crocodile group" and there are two header fields in the table, which are "customer" and "supplier", then the body field where "crocodile group" is located The relative distance D1 from the header field of "Customer" is:

D1 = abs (Line_mid [Crocodile Group] -Line_mid [Customer]

The relative distance D2 between the table body field where "Crocodile Group" is located and the header field of "Supplier" is: D2 = abs (Line_mid [Crocodile Group] -Line_mid [Supplier]

Among them, abs () is a preset absolute value value function; Line_mid [customer] is the first midline value of the header area to which the “customer” header field belongs; Line_mid [supplier] is the “vendor” header field to which it belongs The first midline value of the header area; Line_mid [crocodile group] is the second midline value of the body area to which the crocodile group belongs.

S1044: Output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.

In the embodiment of the present application, after the relative distances between the enterprise object identifier and the A header fields are calculated, A relative distances can be obtained. Among the above A relative distances, the relative distance with the smallest value is selected, and the first midline value associated with the relative distance is determined. According to the determined header field corresponding to the first midline value, the header field is output as a header field that matches the enterprise object identifier.

For example, in the above example, if the relative distance D1 between the body field of the "crocodile group" and the header field of the "customer" is 3, the body field of the "crocodile group" and the table "supplier" The relative distance D2 of the header field is 4, then the header field with the smallest relative distance is the field "customer", so the header field to which "customer" belongs is output as the header field that matches the corporate object identifier, that is, , Determine the data column to which the enterprise object identifier belongs as the data column in which the field "customer" is located, so as to accurately group each enterprise object identifier in the table.

S105: Determine upstream and downstream relationships between the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.

In the embodiment of the present application, for the enterprise object identifiers corresponding to the respective text labels with the same top attribute value, these enterprise object identifiers are displayed in the same row of information records of the two-dimensional data table created in advance. The header field of the two-dimensional data table includes a customer field and a supplier field.

In the embodiment of the present application, according to the header field that the enterprise object identifier matches, the data column to which each enterprise object identifier belongs in the two-dimensional data table is adjusted so that each enterprise object identifier and the matching header field are located in the same data Column.

Exemplarily, the two-dimensional data table finally output is as follows:

客户 Client	供应商 supplier
鳄鱼集团 Crocodile Group	望望有限公司 Wangwang Co., Ltd.
好来旺集团 Holawang Group	春夏秋冬集团 Chun Xia Qiu Dong Group

Since the relationship between the customer and the supplier is the downstream and upstream supply chain relationship, based on the two-dimensional data table output above, the upstream and downstream hierarchical relationship between various enterprise objects can be determined. For example, in the above example, the Crocodile Group is a downstream level relative to Wangwang Co., Ltd., and the Spring, Summer, Autumn and Winter Group is an upstream level relative to the Holawang Group.

In the embodiment of the present application, determining a customer field or a supplier field matched by an enterprise object identifier based on a center line value of each field region can improve the matching accuracy rate of a header field to which each field value in the table body region belongs. Because there is a clear upstream and downstream relationship between the customer and the supplier, according to the corporate object identifiers that match the customer field and the supplier field, the industry chain information between the corporate objects can be obtained, thereby improving the upstream and downstream relationship of the enterprise. Acquisition efficiency.

As another embodiment of the present application, as shown in FIG. 5, before step S104, the method further includes:

S106: For each page in the text to be analyzed, locate each text label contained in the page, and read the value of the top attribute in the text label.

S107: If there are at least two text tags with the same top attribute value, record each of the top attribute values in the page in a preset register.

S108: Find the smallest top attribute value in the register, and read the text data in the text label corresponding to the top attribute value.

S109: Determine the text data as one of the header fields in the table.

In the embodiment of the present application, the text to be analyzed includes multiple pages. For each page, in the page based on the xml format, locate each text tag <text> contained in it, and read the top attribute value of each text tag.

In the embodiment of the present application, it is determined whether there are at least two text tags with the same top attribute value on the current page. If the determination result is no, then read the next page in the text to be analyzed, and return to execute the above step S106. If the judgment result is yes, in the current page, starting from the position of the page to which the at least two text tags belong, each top attribute value that is subsequently read is recorded in a preset register until each top When the attribute values are all recorded, find the smallest top attribute value in the register.

Read the text data in each text label corresponding to the top attribute value, and output the text data as a header field in a table included in the current page.

For example, if the text to be analyzed based on the xml format is:

<text top="627" left="132" width="27" height="13" font="9"><text top = "627" left = "132" width = "27" height = "13" font = "9"> 序号Serial number </text> </ text>

<text top="627" left="224" width="51" height="13" font="9"><text top = "627" left = "224" width = "51" height = "13" font = "9"> 工程名称project name </text> </ text>

<text top="655" left="141" width="574" height="11" font="9">1 <text top = "655" left = "141" width = "574" height = "11" font = "9"> 1 复旦国权科技园Fudan Guoquan Science and Technology Park 2004 2004 年year 10 10 月month 28 28 日day 上海上风科盛投资有限公司Shanghai Shangfeng Kesheng Investment Co., Ltd. 15,000 15,000 万元Ten thousand yuan </text> </ text>

The text data corresponding to the text label with the smallest top attribute value is "Serial Number" and "Project Name". Therefore, "Serial Number" and "Project Name" are output as two header fields in the current table, respectively.

In the embodiment of the present application, each page of the text to be analyzed is traversed to locate each text label included in the page. Only when the page contains at least two text labels with the same top attribute value, the page is analyzed. Each top attribute value in the record is recorded in a preset register, which avoids the need to perform read and write operations of text labels on each page, achieves rapid positioning of the page to which the table belongs, and thus improves the search efficiency of tables in the text to be analyzed. As a result, the acquisition efficiency of the upstream and downstream relationships of the enterprise is also improved.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

Corresponding to the method for acquiring the upstream and downstream relationships of the enterprise described in the foregoing embodiment, FIG. 6 shows a structural block diagram of the device for acquiring the upstream and downstream relationships of the enterprise provided in the embodiment of the present application. Examples related parts.

Referring to FIG. 6, the device includes:

The obtaining unit 61 is configured to obtain a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.

The conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format by using a preset text conversion tool.

A positioning unit 63, configured to locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value Represents the distance between the center position of the field area and the left border of the page. The field area includes the header area and the body area.

A grouping unit 64 is configured to perform group processing on the enterprise object identifiers existing in each of the table body regions based on the median value to obtain a header field matched by each of the enterprise object identifiers. The fields include the customer field and the supplier field.

A determining unit 65 is configured to determine an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.

Optionally, the apparatus for acquiring upstream and downstream relationships of the enterprise further includes:

The reading unit is configured to locate, for each page in the text to be analyzed, each text label included in the page, and read a top attribute value in the text label.

The recording unit is configured to record each of the top attribute values in the page in a preset register if there are at least two of the text tags with the same top attribute value.

The searching unit is configured to search for the smallest top attribute value in the register, and read text data in the text label corresponding to the top attribute value.

A determining unit, configured to determine the text data as one of the header fields in the table.

Optionally, the grouping unit 64 includes:

The first obtaining subunit is configured to obtain a first center line value of each header field in the header area separately.

The second obtaining subunit is configured to obtain, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region.

A calculation subunit, configured to respectively calculate a relative distance between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.

An output subunit, configured to output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.

Optionally, the positioning unit 63 includes:

A positioning subunit, configured to locate each text label contained in the page for each page in the text to be analyzed, and read the value of the top attribute in the text label.

A detection subunit, configured to detect each of the text tags with the highest top attribute value and the smallest top attribute value in the page, and determine a page area between the two determined text tags It is positioned as an area where a table exists in the text to be analyzed.

Optionally, the positioning subunit is specifically configured to:

Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;

Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;

If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.

FIG. 6 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 6, the terminal device 6 in this embodiment includes a processor 60 and a memory 61. The memory 61 stores computer-readable instructions 62 that can be run on the processor 60, such as an upstream and downstream relationship of an enterprise. Acquisition procedure. When the processor 60 executes the computer-readable instructions 62, the steps in the embodiment of the method for obtaining the upstream and downstream relationships of various enterprises are implemented, for example, steps 101 to 105 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of each module / unit in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 65 shown in FIG. 6.

Exemplarily, the computer-readable instructions 62 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 61 and executed by the processor 60, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.

The terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 60 and a memory 61. Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than shown in the figure, or combine some components or different components. For example, the terminal device may further include an input / output device, a network access device, a bus, and the like.

The processor 60 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) provided on the terminal device 6. Flash card Card) and so on. Further, the memory 61 may further include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is configured to store the computer-readable instructions and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .

As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting the present invention. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still interpret the foregoing. The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for obtaining upstream and downstream relationships of an enterprise, which is characterized by:

Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;

According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;

Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field

Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, based on the median value, the enterprise object identifiers existing in each of the table body regions are respectively grouped to obtain each Before the header field matched by the enterprise object identifier, the method further includes:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;

Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;

The text data is determined as one of the header fields in the table.
The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, based on the median value, the enterprise object identifiers existing in each of the table body regions are respectively grouped to obtain each The header fields matched by the enterprise object identifier include:

Obtaining the first median value of each header field in the header area separately;

Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;

Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;

And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, according to each xml tag included in the text to be analyzed after conversion, the form existing in the text to be analyzed is located, and The median value of each field area in the table includes:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
The method for obtaining upstream and downstream relationships of an enterprise according to claim 4, wherein, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the text The top attribute value in the tag, including:

Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;

Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;

If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
An apparatus for obtaining upstream and downstream relationships of an enterprise, which is characterized by comprising:

An obtaining unit, configured to obtain a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format;

A conversion unit, configured to convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;

A positioning unit, configured to locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value indicates A distance value between a center position of the field area and a left border of the page, and the field area includes a header area and a body area;

A grouping unit, configured to group and process the enterprise object identifiers existing in each of the table body regions based on the median value, to obtain a header field matched by each of the enterprise object identifiers, and the header field Including customer field and supplier field;

A determining unit, configured to determine an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
The device for acquiring upstream and downstream relationships of an enterprise according to claim 6, further comprising:

A reading unit, configured to locate each text label included in the page for each page in the text to be analyzed, and read the value of the top attribute in the text label;

A recording unit, configured to record each of the top attribute values in the page in a preset register if there are at least two of the text tags with the same top attribute value;

A searching unit, configured to search for the smallest top attribute value in the register, and read text data in the text label corresponding to the top attribute value;

A determining unit, configured to determine the text data as one of the header fields in the table.
The device for acquiring an upstream and downstream relationship of an enterprise according to claim 6, wherein the grouping unit comprises:

A first acquisition subunit, configured to respectively acquire a first centerline value of each header field in the header area;

A second obtaining subunit, configured to obtain, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;

A calculation subunit, configured to respectively calculate a relative distance between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;

An output subunit, configured to output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
The device for acquiring an upstream and downstream relationship of an enterprise according to claim 6, wherein the positioning unit comprises:

A positioning subunit, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

A detection subunit, configured to detect each of the text tags with the highest top attribute value and the smallest top attribute value in the page, and determine a page area between the two determined text tags It is positioned as an area where a table exists in the text to be analyzed.
The device for acquiring an upstream and downstream relationship of an enterprise according to claim 9, wherein the positioning subunit is specifically configured to:

Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;

Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;

If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
A terminal device includes a memory and a processor, and the memory stores computer-readable instructions that can be run on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: :

Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;

According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;

Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field

Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
The terminal device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;

Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;

The text data is determined as one of the header fields in the table.
The terminal device according to claim 11, wherein, based on the median value, the grouping of the enterprise object identifiers existing in each of the table body regions is performed separately to obtain each of the enterprise object identifiers. Matching header fields, including:

Obtaining the first median value of each header field in the header area separately;

Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;

Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;

And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
The terminal device according to claim 11, characterized in that, according to each xml tag included in the text to be analyzed after conversion, locating a table existing in the text to be analyzed, and acquiring each of the tables The median value of the field area, including:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
The terminal device according to claim 14, characterized in that, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the top attribute in the text label Values, including:

Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;

Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;

If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions implement the following steps when executed by at least one processor:

Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;

According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;

Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field

Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by at least one processor, the following steps are further implemented:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;

Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;

The text data is determined as one of the header fields in the table.
The computer-readable storage medium according to claim 16, wherein, based on the median value, the grouping of the enterprise object identifiers existing in each of the table body regions is performed to obtain each of the enterprises. Header fields matched by the object ID, including:

Obtaining the first median value of each header field in the header area separately;

Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;

Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;

And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
The computer-readable storage medium according to claim 16, wherein, according to each xml tag included in the text to be analyzed after conversion, locating a table existing in the text to be analyzed, and obtaining the The median value of each field area in the table, including:

For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;

In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
The computer-readable storage medium according to claim 19, wherein for each page in the text to be analyzed, each text label contained in the page is located, and the text label is read. The top attribute values include:

Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;

Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;

If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.