WO2019242125A1 - Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium - Google Patents

Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium Download PDF

Info

Publication number
WO2019242125A1
WO2019242125A1 PCT/CN2018/105543 CN2018105543W WO2019242125A1 WO 2019242125 A1 WO2019242125 A1 WO 2019242125A1 CN 2018105543 W CN2018105543 W CN 2018105543W WO 2019242125 A1 WO2019242125 A1 WO 2019242125A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
page
value
analyzed
field
Prior art date
Application number
PCT/CN2018/105543
Other languages
French (fr)
Chinese (zh)
Inventor
苏晓明
汪伟
王晓伟
王鸿滨
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019242125A1 publication Critical patent/WO2019242125A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application belongs to the technical field of data processing, and in particular, relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring an upstream and downstream relationship of an enterprise.
  • Enterprise industry chain information has important reference value in many aspects such as enterprise risk assessment, risk transmission, and industry correlation analysis.
  • the existing public documents of some companies often reveal the industrial chain relationships of some of the companies they are associated with. For example, in a public document such as a prospectus, annual report, and quarterly report issued by an enterprise, users can view the source of materials and sales destinations of products sold by the enterprise, so as to identify some upstream and downstream enterprises associated with the enterprise.
  • the embodiments of the present application provide a method, device, terminal device and medium for obtaining upstream and downstream relationships of an enterprise, so as to solve the problem that the efficiency of obtaining upstream and downstream relationships of various enterprises is relatively low in the public documents of various enterprises. .
  • a first aspect of the embodiments of the present application provides a method for obtaining an upstream and downstream relationship of an enterprise, including:
  • the initial format of the text to be analyzed is the portable document pdf format
  • each xml tag included in the text to be analyzed after conversion locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area
  • the distance between the center position and the left border of the page, and the field area includes a header area and a body area;
  • the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers.
  • the header field includes a customer field and a supply. Quotient field
  • a second aspect of the embodiments of the present application provides an apparatus for acquiring an upstream and downstream relationship of an enterprise, and the monitoring device includes a unit for executing the method for acquiring an upstream and downstream relationship of an enterprise according to the first aspect.
  • a third aspect of the embodiments of the present application provides a terminal device including a memory and a processor.
  • the memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions.
  • the steps of the method for obtaining the upstream and downstream relationships of the enterprise according to the first aspect are implemented.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. The steps of the method for obtaining the upstream and downstream relationships of the company.
  • the machine can recognize the xml tags
  • the location area to which the form belongs is determined to realize the automatic positioning of the form.
  • each field value included in the form exists in text form in each xml tag, so for the enterprises that exist in the form body area
  • the object identifier based on the midline value of each field area, to determine the customer field or supplier field that the enterprise object ID matches, can improve the accuracy of matching the header field to which each field value in the table body area belongs.
  • FIG. 1 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application
  • FIG. 2 is a specific implementation flowchart of a method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
  • FIG. 3 is a specific implementation flowchart of a method S1031 for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
  • FIG. 4 is a specific implementation flowchart of a method S104 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
  • FIG. 5 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of an apparatus for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application
  • FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
  • FIG. 1 shows an implementation flow of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application.
  • the method flow includes steps S101 to S105.
  • the specific implementation principle of each step is as follows:
  • S101 Acquire a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.
  • the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information.
  • PDF Portable Document Format
  • S102 Convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool.
  • the text to be analyzed in pdf format import it into a preset text conversion tool, and after detecting the format conversion instruction issued by the user, output the text to be analyzed based on the eXtensible Markup Language (xml) format.
  • xml eXtensible Markup Language
  • the above text conversion tools can be, for example, Foxit converters, PDF all-around converters, and All Office Converter and more.
  • the text to be analyzed based on the xml format may be, for example:
  • S103 Locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value represents the field The distance between the center position of the area and the left border of the page.
  • the field area includes the header area and the body area.
  • the text to be analyzed based on the xml format includes a text tag ⁇ text>, and the ⁇ text> tag also includes attribute values such as top, width, height, and font. It is worth noting that in addition to the text tag ⁇ text>, paragraph tags or other types of tags may exist in the text to be analyzed based on the xml format, which is not shown in the above example for the time being.
  • the text data corresponding to each text label is an attribute value of a field area in the table. According to the top attribute value of the text label, the position of each table existing in the text to be analyzed can be located.
  • FIG. 2 shows a specific implementation process of the method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application, which is detailed as follows:
  • each text to be analyzed associated with the enterprise object may be a pdf text displayed on a single page, or a pdf text displayed on multiple pages. After the text format conversion process is performed, the pdf text of each page will be converted to the corresponding page of xml text.
  • the text data of each field in the table will correspond to the text data in the text label ⁇ text>.
  • the top attribute value of each text tag indicates the distance between the position of the text data corresponding to the text label in the current page and the top of the page. It can be seen that if the text data is in different rows in the text to be analyzed, the top attribute value of the text label corresponding to the text data is different. In addition, if the text data appears at a higher position in the current page, the smaller the top attribute value of the corresponding text label is.
  • FIG. 3 shows a specific implementation process of the method S1031 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:
  • S10311 Scan each page in the text to be analyzed separately to determine the page containing a preset form name.
  • the form names of each table included in the texts to be analyzed are form names that conform to a preset format. Scan each page in the text to be analyzed according to a preset regular expression. The above regular expression is used to describe the pattern rule to which the table name conforms.
  • the page in the text to be analyzed is selected. After identifying each page in the text to be analyzed, multiple pages containing the name of the table can be determined in turn.
  • S10312 For the currently determined page, locate each text label contained in the page, and read the value of the top attribute in the text label.
  • a plurality of pages including a preset table name in the text to be analyzed are determined in advance, which is compared to directly reading the pages.
  • the method of determining the top attribute value of each text label in the text to determine whether the page contains a table improves the search efficiency of the table; after preliminary positioning each page to which the table in the text to be analyzed belongs, the table is further determined according to the top attribute value
  • the specific distribution position of the table avoids the situation that only the table name does not exist on the page, so the embodiment of the present application improves the accuracy of table positioning.
  • the text tags with the highest top attribute value and the smallest top attribute value are filtered out.
  • the text corresponding to the two text tags The data is in the first and last rows of the table. Therefore, in the embodiment of the present application, according to the position of the text label with the highest top attribute value and the smallest top attribute value in the current page, the page position of the last row of the table and the first row of the table can be determined on this page. Position the page area between these two page positions as the area where a table exists.
  • the K text tags K is an integer greater than zero
  • the K text tags that appear consecutively are determined as xml parameters corresponding to a table in the text to be analyzed.
  • the text label with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the area where the table exists. Therefore, based on the above manner, various forms existing in the current page can be located.
  • the value of the left attribute indicates the distance between the position of the text data corresponding to the text label on the current page and the left side of the page
  • the value of the width attribute indicates the field area corresponding to the text label in the table.
  • Width value, midline value indicates the distance between the center line of the field area on the current page and the left side of the page.
  • the Value [left] indicates the left label value of the text label corresponding to the field area; and the Value [width] indicates the width label value of the text label corresponding to the field area.
  • S104 Based on the median value, group the enterprise object identifiers existing in each of the table body regions to obtain a header field matched by each of the enterprise object identifiers.
  • the header field includes a customer field. And the vendor field.
  • the table body area and the header area are included.
  • the header area includes the field area to which the first row of text data in the table belongs; the body area includes the other field areas in the table except the header area.
  • the data column associated with the enterprise object identifier in each table is identified through a preset recognition algorithm.
  • the enterprise object identification includes, but is not limited to, the name of the enterprise object, the abbreviation of the company name, or the industry common name of the enterprise object.
  • the preset recognition algorithm may be, for example, acquiring multiple enterprise object identifiers collected in advance and storing the multiple enterprise object identifiers in an identifier list; judging for the text data corresponding to each text label Whether the text data matches any corporate object identification in the identification list; if the text data matches any corporate object identification system in the identification list, it is determined that the data column to which the text data belongs is a data column associated with the corporate object identification .
  • the corresponding header field is usually the customer field or the supplier field. Since it is difficult to intuitively reflect the correspondence between each enterprise object identifier and its header field in the text to be analyzed based on the xml format, in the embodiment of the present application, based on the midline value of the field area to which the enterprise object identifier belongs, the enterprise object The identifiers are grouped to determine whether each enterprise object identifier is the body data in the "Customer" field data column or the body data in the "Supplier" field data column.
  • FIG. 4 shows a specific implementation process of the method S104 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:
  • the text data corresponding to each text label with the smallest top attribute value is the header field of the table. Therefore, after calculating the median value of each text label with the smallest top attribute value, the median value is output as the median value of a header field corresponding to the text label.
  • the text data corresponding to the text label is detected to contain the enterprise object identifier, it is determined that the field area corresponding to the text label is the body area, so the text label corresponds to The midline value of the field area is output as the midline value of a body area in the current table.
  • the first midline value refers to the midline value of the header area
  • the second midline value refers to the midline value of the body area.
  • the “first” here is only for convenience of expression and reference, and does not mean that there must be a corresponding first midline value in the specific implementation of the present application.
  • the "second” in the second midline value is only for convenience of expression and reference, and does not mean that there will be a second midline value corresponding to it in the specific implementation of the present application.
  • S1043 Calculate the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.
  • a first midline values can be obtained.
  • the absolute value of the difference between the second midline value and each first midline value is calculated separately according to the second midline value of the body area to which it belongs, and the absolute value of the difference is output as the corporate object. Identifies the relative distance from the header field.
  • abs () is a preset absolute value value function
  • Line_mid [customer] is the first midline value of the header area to which the “customer” header field belongs
  • Line_mid [supplier] is the “vendor” header field to which it belongs The first midline value of the header area
  • Line_mid [crocodile group] is the second midline value of the body area to which the crocodile group belongs.
  • S1044 Output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
  • a relative distances can be obtained.
  • the relative distance with the smallest value is selected, and the first midline value associated with the relative distance is determined.
  • the header field is output as a header field that matches the enterprise object identifier.
  • the relative distance D1 between the body field of the "crocodile group” and the header field of the "customer” is 3, the body field of the "crocodile group” and the table “supplier”
  • the relative distance D2 of the header field is 4, then the header field with the smallest relative distance is the field "customer", so the header field to which "customer" belongs is output as the header field that matches the corporate object identifier, that is, , Determine the data column to which the enterprise object identifier belongs as the data column in which the field "customer” is located, so as to accurately group each enterprise object identifier in the table.
  • S105 Determine upstream and downstream relationships between the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  • the enterprise object identifiers corresponding to the respective text labels with the same top attribute value are displayed in the same row of information records of the two-dimensional data table created in advance.
  • the header field of the two-dimensional data table includes a customer field and a supplier field.
  • the data column to which each enterprise object identifier belongs in the two-dimensional data table is adjusted so that each enterprise object identifier and the matching header field are located in the same data Column.
  • the upstream and downstream hierarchical relationship between various enterprise objects can be determined.
  • the Crocodile Group is a downstream level relative to Wangwang Co., Ltd.
  • the Spring, Summer, Autumn and Winter Group is an upstream level relative to the Holawang Group.
  • determining a customer field or a supplier field matched by an enterprise object identifier based on a center line value of each field region can improve the matching accuracy rate of a header field to which each field value in the table body region belongs. Because there is a clear upstream and downstream relationship between the customer and the supplier, according to the corporate object identifiers that match the customer field and the supplier field, the industry chain information between the corporate objects can be obtained, thereby improving the upstream and downstream relationship of the enterprise. Acquisition efficiency.
  • the method further includes:
  • the text to be analyzed includes multiple pages. For each page, in the page based on the xml format, locate each text tag ⁇ text> contained in it, and read the top attribute value of each text tag.
  • each top attribute value that is subsequently read is recorded in a preset register until each top When the attribute values are all recorded, find the smallest top attribute value in the register.
  • the text data corresponding to the text label with the smallest top attribute value is "Serial Number” and "Project Name”. Therefore, "Serial Number” and "Project Name” are output as two header fields in the current table, respectively.
  • each page of the text to be analyzed is traversed to locate each text label included in the page. Only when the page contains at least two text labels with the same top attribute value, the page is analyzed.
  • Each top attribute value in the record is recorded in a preset register, which avoids the need to perform read and write operations of text labels on each page, achieves rapid positioning of the page to which the table belongs, and thus improves the search efficiency of tables in the text to be analyzed. As a result, the acquisition efficiency of the upstream and downstream relationships of the enterprise is also improved.
  • FIG. 6 shows a structural block diagram of the device for acquiring the upstream and downstream relationships of the enterprise provided in the embodiment of the present application. Examples related parts.
  • the device includes:
  • the obtaining unit 61 is configured to obtain a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.
  • the conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format by using a preset text conversion tool.
  • a positioning unit 63 configured to locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value Represents the distance between the center position of the field area and the left border of the page.
  • the field area includes the header area and the body area.
  • a grouping unit 64 is configured to perform group processing on the enterprise object identifiers existing in each of the table body regions based on the median value to obtain a header field matched by each of the enterprise object identifiers.
  • the fields include the customer field and the supplier field.
  • a determining unit 65 is configured to determine an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  • the apparatus for acquiring upstream and downstream relationships of the enterprise further includes:
  • the reading unit is configured to locate, for each page in the text to be analyzed, each text label included in the page, and read a top attribute value in the text label.
  • the recording unit is configured to record each of the top attribute values in the page in a preset register if there are at least two of the text tags with the same top attribute value.
  • the searching unit is configured to search for the smallest top attribute value in the register, and read text data in the text label corresponding to the top attribute value.
  • a determining unit configured to determine the text data as one of the header fields in the table.
  • the grouping unit 64 includes:
  • the first obtaining subunit is configured to obtain a first center line value of each header field in the header area separately.
  • the second obtaining subunit is configured to obtain, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region.
  • a calculation subunit configured to respectively calculate a relative distance between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.
  • An output subunit configured to output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
  • the positioning unit 63 includes:
  • a positioning subunit configured to locate each text label contained in the page for each page in the text to be analyzed, and read the value of the top attribute in the text label.
  • a detection subunit configured to detect each of the text tags with the highest top attribute value and the smallest top attribute value in the page, and determine a page area between the two determined text tags It is positioned as an area where a table exists in the text to be analyzed.
  • the positioning subunit is specifically configured to:
  • FIG. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.
  • the terminal device 6 in this embodiment includes a processor 60 and a memory 61.
  • the memory 61 stores computer-readable instructions 62 that can be run on the processor 60, such as an upstream and downstream relationship of an enterprise. Acquisition procedure.
  • the processor 60 executes the computer-readable instructions 62, the steps in the embodiment of the method for obtaining the upstream and downstream relationships of various enterprises are implemented, for example, steps 101 to 105 shown in FIG. 1.
  • the processor 60 executes the computer-readable instructions 62, the functions of each module / unit in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 65 shown in FIG. 6.
  • the computer-readable instructions 62 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 61 and executed by the processor 60, To complete this application.
  • the one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.
  • the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 60 and a memory 61.
  • FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than shown in the figure, or combine some components or different components.
  • the terminal device may further include an input / output device, a network access device, a bus, and the like.
  • the processor 60 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuits
  • FPGA off-the-shelf Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6.
  • the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) provided on the terminal device 6. Flash card () and so on.
  • the memory 61 may further include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is configured to store the computer-readable instructions and other programs and data required by the terminal device.
  • the memory 61 may also be used to temporarily store data that has been output or is to be output.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .

Abstract

The present solution provides a method and an apparatus for acquiring upstream and downstream relationships between companies, a terminal device and a medium, which are applicable to the technical field of data processing. The method comprises: converting the format of a text to be analyzed from pdf to xml format; according to each xml tag comprised in said text after conversion, positioning a form existing in said text, and acquiring a median value of each field region in the form; grouping, on the basis of the median value, the company object identifiers existing in each form body region to obtain form header fields matching the company object identifiers; and determining the upstream and downstream relationship between company objects according to the company object identifiers that match customer fields and vendor fields, respectively. The present solution achieves the automatic positioning of a form, and can obtain industrial chain information between company objects according to the company object identifiers that match the customer fields and the supplier fields, and improves the acquisition efficiency of the upstream and downstream relationship of companies.

Description

企业上下游关系的获取方法、装置、终端设备及介质Method, device, terminal equipment and medium for obtaining upstream and downstream relationships of enterprises
本申请要求于2018年06月19日提交中国专利局、申请号为201810630801.4 、发明名称为“企业上下游关系的获取方法、终端设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on June 19, 2018 with the Chinese Patent Office, application number 201810630801.4, and the invention name is "Methods, Terminals, and Media for Acquiring Upstream and Downstream Relations of an Enterprise", the entire contents of which are hereby incorporated by reference. Incorporated in this application.
技术领域Technical field
本申请属于数据处理技术领域,尤其涉及一种企业上下游关系的获取方法、装置、终端设备及计算机可读存储介质。The present application belongs to the technical field of data processing, and in particular, relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring an upstream and downstream relationship of an enterprise.
背景技术Background technique
企业产业链信息在企业风险评估、风险传导以及行业关联性分析等诸多方面均有着至关重要的参考价值。现有的一些企业公开文件中,往往会透露出其所关联的一些企业的产业链关系。例如,在企业所发布的招股书、年报以及季报等公开文件中,用户可以查看到该企业所销售产品的材料来源以及销售去向等,从而确定出该企业所关联的一些上下游企业。Enterprise industry chain information has important reference value in many aspects such as enterprise risk assessment, risk transmission, and industry correlation analysis. The existing public documents of some companies often reveal the industrial chain relationships of some of the companies they are associated with. For example, in a public document such as a prospectus, annual report, and quarterly report issued by an enterprise, users can view the source of materials and sales destinations of products sold by the enterprise, so as to identify some upstream and downstream enterprises associated with the enterprise.
然而,由于季报、年报以及招股书等公开文件的样式均较为复杂,故这类公开文件所包含的企业产业链信息也只能由人工来进行手动识别及获取,因而企业上下游关系的获取效率较为低下。However, because the styles of public documents such as quarterly reports, annual reports, and prospectuses are more complicated, the industrial chain information contained in such public documents can only be manually identified and obtained manually, so the efficiency of obtaining upstream and downstream relationships of enterprises More low.
技术问题technical problem
有鉴于此,本申请实施例提供了一种企业上下游关系的获取方法、装置、终端设备及介质,以解决当前在各类企业公开文件中,企业上下游关系的获取效率均较为低下的问题。In view of this, the embodiments of the present application provide a method, device, terminal device and medium for obtaining upstream and downstream relationships of an enterprise, so as to solve the problem that the efficiency of obtaining upstream and downstream relationships of various enterprises is relatively low in the public documents of various enterprises. .
技术解决方案Technical solutions
本申请实施例的第一方面提供了一种企业上下游关系的获取方法,包括:A first aspect of the embodiments of the present application provides a method for obtaining an upstream and downstream relationship of an enterprise, including:
获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式;Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;
通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式;Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;
根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域;According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;
基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段;Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field
根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
本申请实施例的第二方面提供了一种企业上下游关系的获取装置,所述监控装置包括用于执行上述第一方面所述的企业上下游关系的获取方法的单元。A second aspect of the embodiments of the present application provides an apparatus for acquiring an upstream and downstream relationship of an enterprise, and the monitoring device includes a unit for executing the method for acquiring an upstream and downstream relationship of an enterprise according to the first aspect.
本申请实施例的第三方面提供了一种终端设备,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如第一方面所述的企业上下游关系的获取方法的步骤。A third aspect of the embodiments of the present application provides a terminal device including a memory and a processor. The memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions. When the instruction is read, the steps of the method for obtaining the upstream and downstream relationships of the enterprise according to the first aspect are implemented.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如第一方面所述的企业上下游关系的获取方法的步骤。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. The steps of the method for obtaining the upstream and downstream relationships of the company.
有益效果Beneficial effect
本申请实施例中,由于原始加载得到的招股书、年报以及季报等公开文件都是以pdf格式存在的,故通过将这些公开文件的文本格式转换为xml格式,能够根据机器可识别的xml标签来确定出表格所属的位置区域,实现了对表格的自动化定位;上述公开文件中,表格所包含的各个字段值均以文本形式存在于各个xml标签中,故对于表体区域中所存在的企业对象标识,基于各个字段区域的中线值来确定企业对象标识所匹配的客户字段或供应商字段,能够提高对表体区域中每个字段值所属表头字段的匹配准确率。由于客户以及供应商之间存在明确的上下游关系,因此,根据客户字段以及供应商字段所分别匹配的企业对象标识,能够获知各企业对象之间的产业链信息,从而提高了企业上下游关系的获取效率。In the embodiment of the present application, since the public documents such as the prospectus, annual report, and quarterly report obtained in the original loading exist in the pdf format, by converting the text format of these public documents to the xml format, the machine can recognize the xml tags The location area to which the form belongs is determined to realize the automatic positioning of the form. In the above public document, each field value included in the form exists in text form in each xml tag, so for the enterprises that exist in the form body area The object identifier, based on the midline value of each field area, to determine the customer field or supplier field that the enterprise object ID matches, can improve the accuracy of matching the header field to which each field value in the table body area belongs. Because there is a clear upstream and downstream relationship between the customer and the supplier, according to the corporate object identifiers that match the customer field and the supplier field, the industry chain information between the corporate objects can be obtained, thereby improving the upstream and downstream relationship of the enterprise. Acquisition efficiency.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的企业上下游关系的获取方法的实现流程图;FIG. 1 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
图2是本申请实施例提供的企业上下游关系的获取方法S103的具体实现流程图;FIG. 2 is a specific implementation flowchart of a method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
图3是本申请实施例提供的企业上下游关系的获取方法S1031的具体实现流程图;FIG. 3 is a specific implementation flowchart of a method S1031 for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
图4是本申请实施例提供的企业上下游关系的获取方法S104的具体实现流程图;FIG. 4 is a specific implementation flowchart of a method S104 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application; FIG.
图5是本申请实施例提供的企业上下游关系的获取方法的实现流程图;5 is an implementation flowchart of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
图6是本申请实施例提供的企业上下游关系的获取装置的示意图;6 is a schematic diagram of an apparatus for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application;
图7是本申请实施例提供的终端设备的示意图。FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。In order to explain the technical solution described in this application, the following description is made through specific embodiments.
图1示出了本申请实施例提供的企业上下游关系的获取方法的实现流程,该方法流程包括步骤S101至S105。各步骤的具体实现原理如下:FIG. 1 shows an implementation flow of a method for acquiring an upstream and downstream relationship of an enterprise according to an embodiment of the present application. The method flow includes steps S101 to S105. The specific implementation principle of each step is as follows:
S101:获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式。S101: Acquire a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.
本申请实施例中,待分析文本为企业所发布的公开文件,包括季报、年报以及招股书等。根据预设的网站信息,定期从对应的公开网站中下载上述待分析文本。其中,由于企业在创建上述公开文件时,均以可移植文档(Portable Document Format,PDF)的格式进行输出,故从上述公开网站中所下载得到的待分析文本的格式均为PDF格式。In the embodiment of the present application, the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information. Among them, when companies create the above public documents, they use portable documents (Portable Document Format (PDF) format for output, so the format of the text to be analyzed downloaded from the above public website is PDF format.
S102:通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式。S102: Convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool.
对于每一个pdf格式的待分析文本,将其导入预设的文本转换工具,并在检测到用户发出的格式转换指令后,输出基于可扩展标记语言(eXtensible Markup Language,xml)格式的待分析文本。上述文本转换工具例如可以是福昕转换器、PDF全方位转换器以及All Office Converter等。示例性地,基于xml格式的待分析文本例如可以是:For each text to be analyzed in pdf format, import it into a preset text conversion tool, and after detecting the format conversion instruction issued by the user, output the text to be analyzed based on the eXtensible Markup Language (xml) format. . The above text conversion tools can be, for example, Foxit converters, PDF all-around converters, and All Office Converter and more. Exemplarily, the text to be analyzed based on the xml format may be, for example:
<text top="538" left="157" width="214" height="22" font="10">(三)其他重要事项</text><text top = "538" left = "157" width = "214" height = "22" font = "10"> (3) Other important matters </ text>
<text top="584" left="171" width="596" height="19" font="12">截至 2005 年 12 月 31 日,公司对外签署尚未完工的重大工程合同明细如下:</text><text top = "584" left = "171" width = "596" height = "19" font = "12"> as of 2005 As of December 31, 2013, the details of major unfinished engineering contracts signed by the company are as follows: </ text>
<text top="627" left="132" width="27" height="13" font="9">序号</text><text top = "627" left = "132" width = "27" height = "13" font = "9"> No. </ text>
S103:根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域。S103: Locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value represents the field The distance between the center position of the area and the left border of the page. The field area includes the header area and the body area.
根据上述示例中的待分析文本可知,基于xml格式的待分析文本包含有文本标签<text>,且<text>标签中还包含有top、width、height以及font等属性值。值得注意的是,除了文本标签<text>之外,基于xml格式的待分析文本还可能存在段落标签或其他类型的标签,在上述示例中暂时未显示。According to the text to be analyzed in the above example, it is known that the text to be analyzed based on the xml format includes a text tag <text>, and the <text> tag also includes attribute values such as top, width, height, and font. It is worth noting that in addition to the text tag <text>, paragraph tags or other types of tags may exist in the text to be analyzed based on the xml format, which is not shown in the above example for the time being.
本申请实施例中,每一文本标签所对应的文本数据为表格中一字段区域的属性值。根据文本标签的top属性值,可定位待分析文本中所存在的每一表格的位置。In the embodiment of the present application, the text data corresponding to each text label is an attribute value of a field area in the table. According to the top attribute value of the text label, the position of each table existing in the text to be analyzed can be located.
具体地,图2示出了本申请实施例提供的企业上下游关系的获取方法S103的具体实现流程,详述如下:Specifically, FIG. 2 shows a specific implementation process of the method S103 for obtaining an upstream and downstream relationship of an enterprise according to an embodiment of the present application, which is detailed as follows:
S1031:对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值。S1031: For each page in the text to be analyzed, locate each text label contained in the page, and read the value of the top attribute in the text label.
本申请实施例中,企业对象所关联的每一待分析文本可以为单页显示的pdf文本,也可以是多页显示的pdf文本。在执行文本格式转换处理后,每一页面的pdf文本将转为对应的一页xml文本。In the embodiment of the present application, each text to be analyzed associated with the enterprise object may be a pdf text displayed on a single page, or a pdf text displayed on multiple pages. After the text format conversion process is performed, the pdf text of each page will be converted to the corresponding page of xml text.
待分析文本中的表格在转换为xml格式之后,表格中每一字段的文本数据将与文本标签<text>中的文本数据相对应。对每一页面的xml文本,根据其所包含的各个文本标签,读取每一文本标签的top属性值。top属性值表示文本标签所对应的文本数据在当前页面中所处的位置与页面顶部的距离值。可见,若文本数据处于待分析文本中的不同行,则该文本数据所对应的文本标签的top属性值不同。并且,若文本数据出现于当前页面中较高的位置,则其所对应的文本标签的top属性值越小。After the table in the text to be analyzed is converted to the xml format, the text data of each field in the table will correspond to the text data in the text label <text>. For the xml text of each page, read the top attribute value of each text tag according to each text tag it contains. The value of the top attribute indicates the distance between the position of the text data corresponding to the text label in the current page and the top of the page. It can be seen that if the text data is in different rows in the text to be analyzed, the top attribute value of the text label corresponding to the text data is different. In addition, if the text data appears at a higher position in the current page, the smaller the top attribute value of the corresponding text label is.
作为本申请的一个实施例,图3示出了本申请实施例提供的企业上下游关系的获取方法S1031的具体实现流程,详述如下:As an embodiment of the present application, FIG. 3 shows a specific implementation process of the method S1031 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:
S10311:分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面。S10311: Scan each page in the text to be analyzed separately to determine the page containing a preset form name.
本申请实施例中,由于待分析文本为年报、季报以及招股书等公开文件,故待分析文本所包含的每一表格的表格名称都是符合预设格式的表格名称。根据预设的正则表达式,对待分析文本中的每一页面进行扫描。其中,上述正则表达式用于描述表格名称所符合的模式规则。In the embodiment of the present application, since the texts to be analyzed are public documents such as annual reports, quarterly reports, and prospectuses, the form names of each table included in the texts to be analyzed are form names that conform to a preset format. Scan each page in the text to be analyzed according to a preset regular expression. The above regular expression is used to describe the pattern rule to which the table name conforms.
若在当前页面中识别到与该正则表达式匹配的文本数据,则确定该页面中包含有预设的表格名称,故将待分析文本中的该页面进行选取。在对待分析文本中的各个页面进行识别后,可依次确定出包含表格名称的多个页面。If the text data matching the regular expression is identified in the current page, it is determined that the page contains a preset table name, so the page in the text to be analyzed is selected. After identifying each page in the text to be analyzed, multiple pages containing the name of the table can be determined in turn.
S10312:对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值。S10312: For the currently determined page, locate each text label contained in the page, and read the value of the top attribute in the text label.
S10313:若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。S10313: if at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the current determination The operation of locating each text label contained in the page and reading the top attribute value in the text label.
若当前确定出的一个页面中,不存在top属性值相同的至少两个文本标签,则表示该页面中不存在表格,因此,读取上述包含预设表格名称的下一页面,并返回执行步骤S10312。If at least two text tags with the same top attribute value do not exist in a currently determined page, it means that there is no form in the page, so the next page containing the preset form name is read, and the process returns to the execution step. S10312.
本申请实施例中,由于基于正则表达式来执行字符匹配的方式对系统资源消耗较少,故通过预先确定出待分析文本中包含预设表格名称的多个页面,其相对于直接读取页面中各个文本标签的top属性值来确定该页面是否包含表格的方式来说,提高了表格的查找效率;通过初步定位待分析文本中表格所属的各个页面后,再根据top属性值来进一步确定表格的具体分布位置,避免了页面中仅存在表格名称而不存在相应表格的情况,因此,本申请实施例提高了表格定位的准确性。In the embodiment of the present application, since a method of performing character matching based on a regular expression consumes less system resources, a plurality of pages including a preset table name in the text to be analyzed are determined in advance, which is compared to directly reading the pages. The method of determining the top attribute value of each text label in the text to determine whether the page contains a table improves the search efficiency of the table; after preliminary positioning each page to which the table in the text to be analyzed belongs, the table is further determined according to the top attribute value The specific distribution position of the table avoids the situation that only the table name does not exist on the page, so the embodiment of the present application improves the accuracy of table positioning.
S1032:在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。S1032: In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the The area of the table in the text to be analyzed.
在当前页面所包含的各个文本标签中,根据top属性值的数值大小,筛选出top属性值最大以及top属性值最小的文本标签,则在待分析文本中,这两个文本标签所对应的文本数据分别位于表格的第一行以及最后一行。因此,本申请实施例中,根据top属性值最大以及top属性值最小的文本标签在当前页面中所属的位置,可在该页面中,确定出表格最后一行以及表格第一行的页面位置。将这两个页面位置之间的页面区域定位为一表格所存在的区域。Among the text tags contained in the current page, according to the value of the top attribute value, the text tags with the highest top attribute value and the smallest top attribute value are filtered out. In the text to be analyzed, the text corresponding to the two text tags The data is in the first and last rows of the table. Therefore, in the embodiment of the present application, according to the position of the text label with the highest top attribute value and the smallest top attribute value in the current page, the page position of the last row of the table and the first row of the table can be determined on this page. Position the page area between these two page positions as the area where a table exists.
特别地,在当前页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签之前,先检测该页面中是否存在连续出现的多个文本标签。若存在连续出现的K(K为大于零的整数)个文本标签,则将上述连续出现的K个文本标签确定为与待分析文本中的一个表格相对应的xml参数。对于每一表格所对应的xml参数,检测出top属性值最大以及top属性值最小的文本标签,并将确定出的两个文本标签之间的页面区域定位为该表格所存在的区域。因此,基于上述方式,可定位出当前页面中所存在的各个表格。Particularly, in the current page, before detecting each of the text tags with the highest top attribute value and the smallest top attribute value, it is detected whether there are multiple consecutive text tags in the page. If there are K text tags (K is an integer greater than zero) that appear consecutively, the K text tags that appear consecutively are determined as xml parameters corresponding to a table in the text to be analyzed. For the xml parameter corresponding to each table, the text label with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the area where the table exists. Therefore, based on the above manner, various forms existing in the current page can be located.
本申请实施例中,left属性值表示文本标签所对应的文本数据在当前页面中所处的位置与页面左侧的距离值,width属性值表示文本标签所对应的字段区域在表格中所占的宽度值,中线值表示字段区域的中心线在当前页面中所处的位置与页面左侧的距离值。In the embodiment of the present application, the value of the left attribute indicates the distance between the position of the text data corresponding to the text label on the current page and the left side of the page, and the value of the width attribute indicates the field area corresponding to the text label in the table. Width value, midline value indicates the distance between the center line of the field area on the current page and the left side of the page.
通过以下公式,分别计算表格中每一字段区域的中线值Line_Mid:Calculate the median value Line_Mid of each field area in the table by the following formula:
Line_Mid=Value[left]+Value[width]/2Line_Mid = Value [left] + Value [width] / 2
其中,所述Value[left]表示字段区域所对应文本标签的left标签值;所述Value[width]表示字段区域所对应文本标签的width标签值。The Value [left] indicates the left label value of the text label corresponding to the field area; and the Value [width] indicates the width label value of the text label corresponding to the field area.
S104:基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段。S104: Based on the median value, group the enterprise object identifiers existing in each of the table body regions to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field. And the vendor field.
本申请实施例中,对于待分析文本中所定位得到的每一表格,其包含有表体区域以及表头区域。表头区域包括表格中第一行文本数据所属的字段区域;表体区域包括表格中除表头区域之外的其他字段区域。In the embodiment of the present application, for each table located in the text to be analyzed, the table body area and the header area are included. The header area includes the field area to which the first row of text data in the table belongs; the body area includes the other field areas in the table except the header area.
本申请实施例中,通过预设的识别算法,识别出每一表格中与企业对象标识关联的数据列。企业对象标识包括但不限于企业对象的名称、企业名称缩写或企业对象的行业通用名等。In the embodiment of the present application, the data column associated with the enterprise object identifier in each table is identified through a preset recognition algorithm. The enterprise object identification includes, but is not limited to, the name of the enterprise object, the abbreviation of the company name, or the industry common name of the enterprise object.
示例性地,上述预设的识别算法例如可以是;获取预先收集得到的多个企业对象标识,并将上述多个企业对象标识存入标识列表;对每一文本标签所对应的文本数据,判断该文本数据与标识列表中的任一企业对象识别是否匹配;若该文本数据与标识列表中的任一企业对象识别系统,则确定该文本数据所属的数据列为与企业对象标识关联的数据列。Exemplarily, the preset recognition algorithm may be, for example, acquiring multiple enterprise object identifiers collected in advance and storing the multiple enterprise object identifiers in an identifier list; judging for the text data corresponding to each text label Whether the text data matches any corporate object identification in the identification list; if the text data matches any corporate object identification system in the identification list, it is determined that the data column to which the text data belongs is a data column associated with the corporate object identification .
在季报、年报以及招股书等待分析文本所包含的表格中,对于企业对象标识所关联的数据列,其对应的表头字段通常为客户字段或供应商字段。由于基于xml格式的待分析文本中,难以直观地体现每一企业对象标识及其表头字段的对应关系,因此,本申请实施例中,基于企业对象标识所属字段区域的中线值,对企业对象标识进行分组处理,以确定每一企业对象标识是“客户”字段数据列中的表体数据还是“供应商”字段数据列中的表体数据。In the tables included in the quarterly report, annual report, and prospectus analysis text, for the data column associated with the corporate object identifier, the corresponding header field is usually the customer field or the supplier field. Since it is difficult to intuitively reflect the correspondence between each enterprise object identifier and its header field in the text to be analyzed based on the xml format, in the embodiment of the present application, based on the midline value of the field area to which the enterprise object identifier belongs, the enterprise object The identifiers are grouped to determine whether each enterprise object identifier is the body data in the "Customer" field data column or the body data in the "Supplier" field data column.
具体地,作为本申请的一个实施例,图4示出了本申请实施例提供的企业上下游关系的获取方法S104的具体实现流程,详述如下:Specifically, as an embodiment of the present application, FIG. 4 shows a specific implementation process of the method S104 for obtaining an upstream and downstream relationship of an enterprise provided by an embodiment of the present application, which is detailed as follows:
S1041:分别获取所述表头区域中每一表头字段的第一中线值。S1041: Obtain the first centerline value of each header field in the header area separately.
在当前页面所定位得到的一个表格中,根据上述分析可知,top属性值最小的各个文本标签所对应的文本数据为该表格的表头字段。因此,在计算出top属性值最小的每一文本标签的中线值后,将该中线值输出为该文本标签所对应的一个表头字段的中线值。In a table located on the current page, according to the above analysis, it can be known that the text data corresponding to each text label with the smallest top attribute value is the header field of the table. Therefore, after calculating the median value of each text label with the smallest top attribute value, the median value is output as the median value of a header field corresponding to the text label.
S1042:对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值。S1042: For each of the table body regions to which the enterprise object identifier belongs, obtain a second midline value of the table body region.
本申请实施例中,在文本标签所对应的文本数据中,若检测到该文本数据包含企业对象标识,则确定该文本标签所对应的字段区域为表体区域,故将该文本标签所对应的字段区域的中线值输出为当前表格中的一个表体区域的中线值。In the embodiment of the present application, if the text data corresponding to the text label is detected to contain the enterprise object identifier, it is determined that the field area corresponding to the text label is the body area, so the text label corresponds to The midline value of the field area is output as the midline value of a body area in the current table.
需要说明的是,本实施例第一中线值是指表头区域的中线值,第二中线值是指表体区域的中线值。“第一”在此仅为表述和指代的方便,并不意味着在本申请的具体实现方式中一定会有与之对应的第一中线值。类似地,第二中线值中的“第二”也仅仅是为了表述和指代方便,并不意味着在本申请的具体实现方式中一定会有与之对应的第二中线值。It should be noted that, in this embodiment, the first midline value refers to the midline value of the header area, and the second midline value refers to the midline value of the body area. The “first” here is only for convenience of expression and reference, and does not mean that there must be a corresponding first midline value in the specific implementation of the present application. Similarly, the "second" in the second midline value is only for convenience of expression and reference, and does not mean that there will be a second midline value corresponding to it in the specific implementation of the present application.
S1043:根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离。S1043: Calculate the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.
本申请实施例中,若表格中存在A(A为大于零的整数)个表头字段,则可以获得A个第一中线值。对每一个企业对象标识,根据其所属表体区域的第二中线值,分别计算该第二中线值与每一第一中线值的差值绝对值,将该差值绝对值输出为该企业对象标识与表头字段的相对距离。In the embodiment of the present application, if there are A (A is an integer greater than zero) header fields in the table, A first midline values can be obtained. For each corporate object identifier, the absolute value of the difference between the second midline value and each first midline value is calculated separately according to the second midline value of the body area to which it belongs, and the absolute value of the difference is output as the corporate object. Identifies the relative distance from the header field.
示例性地,若表格中存在一企业对象标识为“鳄鱼集团”,且该表格中存在两个表头字段,分别为“客户”和“供应商”,则“鳄鱼集团”所在的表体字段与“客户”这一表头字段的相对距离D1为:Exemplarily, if a corporate object identifier exists in the table as "crocodile group" and there are two header fields in the table, which are "customer" and "supplier", then the body field where "crocodile group" is located The relative distance D1 from the header field of "Customer" is:
D1=abs(Line_mid[鳄鱼集团]-Line_mid[客户]D1 = abs (Line_mid [Crocodile Group] -Line_mid [Customer]
“鳄鱼集团”所在的表体字段与“供应商”这一表头字段的相对距离D2为:D2=abs(Line_mid[鳄鱼集团]-Line_mid[供应商]The relative distance D2 between the table body field where "Crocodile Group" is located and the header field of "Supplier" is: D2 = abs (Line_mid [Crocodile Group] -Line_mid [Supplier]
其中,abs()为预设的绝对值取值函数;Line_mid[客户]为“客户”表头字段所属表头区域的第一中线值;Line_mid[供应商]为“供应商”表头字段所属表头区域的第一中线值;Line_mid[鳄鱼集团]为鳄鱼集团所属表体区域的第二中线值。Among them, abs () is a preset absolute value value function; Line_mid [customer] is the first midline value of the header area to which the “customer” header field belongs; Line_mid [supplier] is the “vendor” header field to which it belongs The first midline value of the header area; Line_mid [crocodile group] is the second midline value of the body area to which the crocodile group belongs.
S1044:将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。S1044: Output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
本申请实施例中,在分别计算出企业对象标识与A个表头字段的相对距离后,可得到A个相对距离。在上述A个相对距离中,筛选出数值最小的相对距离,并确定出与该相对距离关联的第一中线值。根据确定出的上述第一中线值所对应的表头字段,将该表头字段输出为与该企业对象标识匹配的表头字段。In the embodiment of the present application, after the relative distances between the enterprise object identifier and the A header fields are calculated, A relative distances can be obtained. Among the above A relative distances, the relative distance with the smallest value is selected, and the first midline value associated with the relative distance is determined. According to the determined header field corresponding to the first midline value, the header field is output as a header field that matches the enterprise object identifier.
例如,在上述示例中,若“鳄鱼集团”所在的表体字段与“客户”这一表头字段的相对距离D1为3,“鳄鱼集团”所在的表体字段与“供应商”这一表头字段的相对距离D2为4,则其中相对距离最小的表头字段为“客户”这一字段,故将“客户”所属的表头字段输出为与该企业对象标识匹配的表头字段,即,将该企业对象标识所属的数据列确定为“客户”这一字段所在的数据列,从而实现对表格中各企业对象标识的准确分组。For example, in the above example, if the relative distance D1 between the body field of the "crocodile group" and the header field of the "customer" is 3, the body field of the "crocodile group" and the table "supplier" The relative distance D2 of the header field is 4, then the header field with the smallest relative distance is the field "customer", so the header field to which "customer" belongs is output as the header field that matches the corporate object identifier, that is, , Determine the data column to which the enterprise object identifier belongs as the data column in which the field "customer" is located, so as to accurately group each enterprise object identifier in the table.
S105:根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。S105: Determine upstream and downstream relationships between the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
本申请实施例中,对于top属性值相同的各个文本标签所分别对应的企业对象标识,将这些企业对象标识展示于预先创建的二维数据表格的同一行信息记录中。其中,上述二维数据表格的表头字段包括客户字段以及供应商字段。In the embodiment of the present application, for the enterprise object identifiers corresponding to the respective text labels with the same top attribute value, these enterprise object identifiers are displayed in the same row of information records of the two-dimensional data table created in advance. The header field of the two-dimensional data table includes a customer field and a supplier field.
本申请实施例中,根据企业对象标识所各自匹配的表头字段,对二维数据表格中各个企业对象标识所属的数据列进行调整,以使各企业对象标识与其匹配的表头字段位于同一数据列中。In the embodiment of the present application, according to the header field that the enterprise object identifier matches, the data column to which each enterprise object identifier belongs in the two-dimensional data table is adjusted so that each enterprise object identifier and the matching header field are located in the same data Column.
示例性地,最终所输出得到的二维数据表格如下:Exemplarily, the two-dimensional data table finally output is as follows:
客户 Client 供应商 supplier
鳄鱼集团 Crocodile Group 望望有限公司 Wangwang Co., Ltd.
好来旺集团 Holawang Group 春夏秋冬集团 Chun Xia Qiu Dong Group
由于客户与供应商之间为下游与上游的供应链关系,故根据上述输出的二维数据表格,可确定出各个企业对象之间的上下游层级关系。例如,上述例子中,鳄鱼集团为相对于望望有限公司的下游层级,春夏秋冬集团为相对于好来旺集团的上游层级。Since the relationship between the customer and the supplier is the downstream and upstream supply chain relationship, based on the two-dimensional data table output above, the upstream and downstream hierarchical relationship between various enterprise objects can be determined. For example, in the above example, the Crocodile Group is a downstream level relative to Wangwang Co., Ltd., and the Spring, Summer, Autumn and Winter Group is an upstream level relative to the Holawang Group.
本申请实施例中,基于各个字段区域的中线值来确定企业对象标识所匹配的客户字段或供应商字段,能够提高对表体区域中每个字段值所属表头字段的匹配准确率。由于客户以及供应商之间存在明确的上下游关系,因此,根据客户字段以及供应商字段所分别匹配的企业对象标识,能够获知各企业对象之间的产业链信息,从而提高了企业上下游关系的获取效率。In the embodiment of the present application, determining a customer field or a supplier field matched by an enterprise object identifier based on a center line value of each field region can improve the matching accuracy rate of a header field to which each field value in the table body region belongs. Because there is a clear upstream and downstream relationship between the customer and the supplier, according to the corporate object identifiers that match the customer field and the supplier field, the industry chain information between the corporate objects can be obtained, thereby improving the upstream and downstream relationship of the enterprise. Acquisition efficiency.
作为本申请的另一实施例,如图5所示,在上述步骤S104之前,还包括:As another embodiment of the present application, as shown in FIG. 5, before step S104, the method further includes:
S106:对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值。S106: For each page in the text to be analyzed, locate each text label contained in the page, and read the value of the top attribute in the text label.
S107:若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器。S107: If there are at least two text tags with the same top attribute value, record each of the top attribute values in the page in a preset register.
S108:查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据。S108: Find the smallest top attribute value in the register, and read the text data in the text label corresponding to the top attribute value.
S109:将所述文本数据确定为所述表格中的一个所述表头字段。S109: Determine the text data as one of the header fields in the table.
本申请实施例中,待分析文本包含多个页面。对于每一页面,在基于xml格式的该页面中,定位其所包含的各个文本标签<text>,并读取各个文本标签的top属性值。In the embodiment of the present application, the text to be analyzed includes multiple pages. For each page, in the page based on the xml format, locate each text tag <text> contained in it, and read the top attribute value of each text tag.
本申请实施例中,判断当前页面是否存在top属性值相同的至少两个文本标签。若判断结果为否,则读取待分析文本中的下一页面,并返回执行上述步骤S106。若判断结果为是,则在当前页面中,以上述至少两个文本标签所属的页面位置为起点,将后续所读取到的每一top属性值均记录于预设的寄存器中,直至各个top属性值均记录完毕时,查找出寄存器中最小的top属性值。In the embodiment of the present application, it is determined whether there are at least two text tags with the same top attribute value on the current page. If the determination result is no, then read the next page in the text to be analyzed, and return to execute the above step S106. If the judgment result is yes, in the current page, starting from the position of the page to which the at least two text tags belong, each top attribute value that is subsequently read is recorded in a preset register until each top When the attribute values are all recorded, find the smallest top attribute value in the register.
读取与该top属性值对应的各个文本标签中的文本数据,将该文本数据输出为当前页面所包含的一个表格中的表头字段。Read the text data in each text label corresponding to the top attribute value, and output the text data as a header field in a table included in the current page.
例如,若基于xml格式的待分析文本为:For example, if the text to be analyzed based on the xml format is:
<text top="627" left="132" width="27" height="13" font="9"><text top = "627" left = "132" width = "27" height = "13" font = "9"> 序号Serial number </text> </ text>
<text top="627" left="224" width="51" height="13" font="9"><text top = "627" left = "224" width = "51" height = "13" font = "9"> 工程名称project name </text> </ text>
<text top="655" left="141" width="574" height="11" font="9">1  <text top = "655" left = "141" width = "574" height = "11" font = "9"> 1 复旦国权科技园Fudan Guoquan Science and Technology Park 2004  2004 year 10  10 month 28  28 day 上海上风科盛投资有限公司Shanghai Shangfeng Kesheng Investment Co., Ltd. 15,000  15,000 万元Ten thousand yuan </text> </ text>
则其中top属性值最小的文本标签所对应的文本数据为“序号”以及“工程名称”,因此,将“序号”以及“工程名称”分别输出为当前表格中的两个表头字段。The text data corresponding to the text label with the smallest top attribute value is "Serial Number" and "Project Name". Therefore, "Serial Number" and "Project Name" are output as two header fields in the current table, respectively.
本申请实施例中,通过遍历待分析文本的每一页面,定位每一页面所包含的各个文本标签,仅在该页面中包含有top属性值相同的至少两个文本标签时,才将该页面中的各个top属性值记录于预设的寄存器,避免了需要对每一页面执行文本标签的读写操作,实现了对表格所属页面的快速定位,故提高了待分析文本中表格的查找效率,从而也提高了对企业上下游关系的获取效率。In the embodiment of the present application, each page of the text to be analyzed is traversed to locate each text label included in the page. Only when the page contains at least two text labels with the same top attribute value, the page is analyzed. Each top attribute value in the record is recorded in a preset register, which avoids the need to perform read and write operations of text labels on each page, achieves rapid positioning of the page to which the table belongs, and thus improves the search efficiency of tables in the text to be analyzed. As a result, the acquisition efficiency of the upstream and downstream relationships of the enterprise is also improved.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
对应于上文实施例所述的企业上下游关系的获取方法,图6示出了本申请实施例提供的企业上下游关系的获取装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the method for acquiring the upstream and downstream relationships of the enterprise described in the foregoing embodiment, FIG. 6 shows a structural block diagram of the device for acquiring the upstream and downstream relationships of the enterprise provided in the embodiment of the present application. Examples related parts.
参照图6,该装置包括:Referring to FIG. 6, the device includes:
获取单元61,用于获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式。The obtaining unit 61 is configured to obtain a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format.
转换单元62,用于通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式。The conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format by using a preset text conversion tool.
定位单元63,用于根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域。A positioning unit 63, configured to locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value Represents the distance between the center position of the field area and the left border of the page. The field area includes the header area and the body area.
分组单元64,用于基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段。A grouping unit 64 is configured to perform group processing on the enterprise object identifiers existing in each of the table body regions based on the median value to obtain a header field matched by each of the enterprise object identifiers. The fields include the customer field and the supplier field.
确定单元65,用于根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。A determining unit 65 is configured to determine an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
可选地,所述企业上下游关系的获取装置还包括:Optionally, the apparatus for acquiring upstream and downstream relationships of the enterprise further includes:
读取单元,用于对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值。The reading unit is configured to locate, for each page in the text to be analyzed, each text label included in the page, and read a top attribute value in the text label.
记录单元,用于若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器。The recording unit is configured to record each of the top attribute values in the page in a preset register if there are at least two of the text tags with the same top attribute value.
查找单元,用于查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据。The searching unit is configured to search for the smallest top attribute value in the register, and read text data in the text label corresponding to the top attribute value.
确定单元,用于将所述文本数据确定为所述表格中的一个所述表头字段。A determining unit, configured to determine the text data as one of the header fields in the table.
可选地,所述分组单元64包括:Optionally, the grouping unit 64 includes:
第一获取子单元,用于分别获取所述表头区域中每一表头字段的第一中线值。The first obtaining subunit is configured to obtain a first center line value of each header field in the header area separately.
第二获取子单元,用于对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值。The second obtaining subunit is configured to obtain, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region.
计算子单元,用于根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离。A calculation subunit, configured to respectively calculate a relative distance between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value.
输出子单元,用于将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。An output subunit, configured to output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
可选地,所述定位单元63包括:Optionally, the positioning unit 63 includes:
定位子单元,用于对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值。A positioning subunit, configured to locate each text label contained in the page for each page in the text to be analyzed, and read the value of the top attribute in the text label.
检测子单元,用于在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。A detection subunit, configured to detect each of the text tags with the highest top attribute value and the smallest top attribute value in the page, and determine a page area between the two determined text tags It is positioned as an area where a table exists in the text to be analyzed.
可选地,所述定位子单元具体用于:Optionally, the positioning subunit is specifically configured to:
分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面;Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;
对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值;Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;
若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
图6是本申请一实施例提供的终端设备的示意图。如图6所示,该实施例的终端设备6包括:处理器60以及存储器61,所述存储器61中存储有可在所述处理器60上运行的计算机可读指令62,例如企业上下游关系的获取程序。所述处理器60执行所述计算机可读指令62时实现上述各个企业上下游关系的获取方法实施例中的步骤,例如图1所示的步骤101至105。或者,所述处理器60执行所述计算机可读指令62时实现上述各装置实施例中各模块/单元的功能,例如图6所示单元61至65的功能。FIG. 6 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 6, the terminal device 6 in this embodiment includes a processor 60 and a memory 61. The memory 61 stores computer-readable instructions 62 that can be run on the processor 60, such as an upstream and downstream relationship of an enterprise. Acquisition procedure. When the processor 60 executes the computer-readable instructions 62, the steps in the embodiment of the method for obtaining the upstream and downstream relationships of various enterprises are implemented, for example, steps 101 to 105 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of each module / unit in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 65 shown in FIG. 6.
示例性的,所述计算机可读指令62可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器61中,并由所述处理器60执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令62在所述终端设备6中的执行过程。Exemplarily, the computer-readable instructions 62 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 61 and executed by the processor 60, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.
所述终端设备6可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括,但不仅限于,处理器60、存储器61。本领域技术人员可以理解,图6仅仅是终端设备6的示例,并不构成对终端设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 60 and a memory 61. Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than shown in the figure, or combine some components or different components. For example, the terminal device may further include an input / output device, a network access device, a bus, and the like.
所称处理器60可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 60 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器61可以是所述终端设备6的内部存储单元,例如终端设备6的硬盘或内存。所述存储器61也可以是所述终端设备6的外部存储设备,例如所述终端设备6上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器61还可以既包括所述终端设备6的内部存储单元也包括外部存储设备。所述存储器61用于存储所述计算机可读指令以及所述终端设备所需的其他程序和数据。所述存储器61还可以用于暂时地存储已经输出或者将要输出的数据。The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) provided on the terminal device 6. Flash card Card) and so on. Further, the memory 61 may further include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is configured to store the computer-readable instructions and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting the present invention. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still interpret the foregoing. The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种企业上下游关系的获取方法,其特征在于,包括:A method for obtaining upstream and downstream relationships of an enterprise, which is characterized by:
    获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式;Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;
    通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式;Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;
    根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域;According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;
    基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段;Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field
    根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  2. 如权利要求1所述的企业上下游关系的获取方法,其特征在于,在所述基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段之前,还包括:The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, based on the median value, the enterprise object identifiers existing in each of the table body regions are respectively grouped to obtain each Before the header field matched by the enterprise object identifier, the method further includes:
    对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器;If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;
    查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据;Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;
    将所述文本数据确定为所述表格中的一个所述表头字段。The text data is determined as one of the header fields in the table.
  3. 如权利要求1所述的企业上下游关系的获取方法,其特征在于,所述基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,包括:The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, based on the median value, the enterprise object identifiers existing in each of the table body regions are respectively grouped to obtain each The header fields matched by the enterprise object identifier include:
    分别获取所述表头区域中每一表头字段的第一中线值;Obtaining the first median value of each header field in the header area separately;
    对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值;Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;
    根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离;Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;
    将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
  4. 如权利要求1所述的企业上下游关系的获取方法,其特征在于,所述根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值,包括:The method for obtaining an upstream and downstream relationship of an enterprise according to claim 1, wherein, according to each xml tag included in the text to be analyzed after conversion, the form existing in the text to be analyzed is located, and The median value of each field area in the table includes:
    对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
  5. 如权利要求4所述的企业上下游关系的获取方法,其特征在于,所述对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值,包括:The method for obtaining upstream and downstream relationships of an enterprise according to claim 4, wherein, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the text The top attribute value in the tag, including:
    分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面;Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;
    对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值;Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;
    若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
  6. 一种企业上下游关系的获取装置,其特征在于,包括:An apparatus for obtaining upstream and downstream relationships of an enterprise, which is characterized by comprising:
    获取单元,用于获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式;An obtaining unit, configured to obtain a text to be analyzed associated with an enterprise object; an initial format of the text to be analyzed is a portable document pdf format;
    转换单元,用于通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式;A conversion unit, configured to convert a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;
    定位单元,用于根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域;A positioning unit, configured to locate a table existing in the text to be analyzed according to each xml tag included in the text to be analyzed after conversion, and obtain a median value of each field region in the table; the median value indicates A distance value between a center position of the field area and a left border of the page, and the field area includes a header area and a body area;
    分组单元,用于基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段;A grouping unit, configured to group and process the enterprise object identifiers existing in each of the table body regions based on the median value, to obtain a header field matched by each of the enterprise object identifiers, and the header field Including customer field and supplier field;
    确定单元,用于根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。A determining unit, configured to determine an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  7. 根据权利要求6所述的企业上下游关系的获取装置,其特征在于,还包括:The device for acquiring upstream and downstream relationships of an enterprise according to claim 6, further comprising:
    读取单元,用于对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值;A reading unit, configured to locate each text label included in the page for each page in the text to be analyzed, and read the value of the top attribute in the text label;
    记录单元,用于若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器;A recording unit, configured to record each of the top attribute values in the page in a preset register if there are at least two of the text tags with the same top attribute value;
    查找单元,用于查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据;A searching unit, configured to search for the smallest top attribute value in the register, and read text data in the text label corresponding to the top attribute value;
    确定单元,用于将所述文本数据确定为所述表格中的一个所述表头字段。A determining unit, configured to determine the text data as one of the header fields in the table.
  8. 根据权利要求6所述的企业上下游关系的获取装置,其特征在于,所述分组单元包括:The device for acquiring an upstream and downstream relationship of an enterprise according to claim 6, wherein the grouping unit comprises:
    第一获取子单元,用于分别获取所述表头区域中每一表头字段的第一中线值;A first acquisition subunit, configured to respectively acquire a first centerline value of each header field in the header area;
    第二获取子单元,用于对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值;A second obtaining subunit, configured to obtain, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;
    计算子单元,用于根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离;A calculation subunit, configured to respectively calculate a relative distance between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;
    输出子单元,用于将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。An output subunit, configured to output the header field with the smallest relative distance as a header field that matches the enterprise object identifier.
  9. 根据权利要求6所述的企业上下游关系的获取装置,其特征在于,所述定位单元包括:The device for acquiring an upstream and downstream relationship of an enterprise according to claim 6, wherein the positioning unit comprises:
    定位子单元,用于对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值;A positioning subunit, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    检测子单元,用于在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。A detection subunit, configured to detect each of the text tags with the highest top attribute value and the smallest top attribute value in the page, and determine a page area between the two determined text tags It is positioned as an area where a table exists in the text to be analyzed.
  10. 根据权利要求9所述的企业上下游关系的获取装置,其特征在于,所述定位子单元具体用于:The device for acquiring an upstream and downstream relationship of an enterprise according to claim 9, wherein the positioning subunit is specifically configured to:
    分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面;Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;
    对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值;Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;
    若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
  11. 一种终端设备,其特征在于,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device includes a memory and a processor, and the memory stores computer-readable instructions that can be run on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: :
    获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式;Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;
    通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式;Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;
    根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域;According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;
    基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段;Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field
    根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  12. 根据权利要求11所述的终端设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:The terminal device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:
    对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器;If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;
    查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据;Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;
    将所述文本数据确定为所述表格中的一个所述表头字段。The text data is determined as one of the header fields in the table.
  13. 根据权利要求11所述的终端设备,其特征在于,所述基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,包括:The terminal device according to claim 11, wherein, based on the median value, the grouping of the enterprise object identifiers existing in each of the table body regions is performed separately to obtain each of the enterprise object identifiers. Matching header fields, including:
    分别获取所述表头区域中每一表头字段的第一中线值;Obtaining the first median value of each header field in the header area separately;
    对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值;Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;
    根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离;Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;
    将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
  14. 根据权利要求11所述的终端设备,其特征在于,所述根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值,包括:The terminal device according to claim 11, characterized in that, according to each xml tag included in the text to be analyzed after conversion, locating a table existing in the text to be analyzed, and acquiring each of the tables The median value of the field area, including:
    对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
  15. 根据权利要求14所述的终端设备,其特征在于,所述对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值,包括:The terminal device according to claim 14, characterized in that, for each page in the text to be analyzed, positioning each text label contained in the page, and reading the top attribute in the text label Values, including:
    分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面;Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;
    对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值;Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;
    若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被至少一个处理器执行时实现如下步骤:A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions implement the following steps when executed by at least one processor:
    获取与企业对象关联的待分析文本;所述待分析文本的初始格式为可移植文档pdf格式;Obtaining the text to be analyzed associated with the enterprise object; the initial format of the text to be analyzed is the portable document pdf format;
    通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为可扩展标记语言xml格式;Converting a text format of the text to be analyzed from the pdf format to an extensible markup language xml format through a preset text conversion tool;
    根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值;所述中线值表示所述字段区域的中心位置与页面左边界的距离值,所述字段区域包括表头区域以及表体区域;According to each xml tag included in the text to be analyzed after conversion, locate a table existing in the text to be analyzed, and obtain the median value of each field area in the table; the median value represents the value of the field area The distance between the center position and the left border of the page, and the field area includes a header area and a body area;
    基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,所述表头字段包括客户字段以及供应商字段;Based on the median value, the enterprise object identifiers existing in each of the table body regions are separately processed to obtain a header field matched by each of the enterprise object identifiers. The header field includes a customer field and a supply. Quotient field
    根据所述客户字段以及所述供应商字段所分别匹配的所述企业对象标识,确定各个所述企业对象之间的上下游关系。Determining an upstream and downstream relationship between each of the enterprise objects according to the enterprise object identifiers respectively matched by the customer field and the supplier field.
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述计算机可读指令被至少一个处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by at least one processor, the following steps are further implemented:
    对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    若存在所述top属性值相同的至少两个所述文本标签,则将所述页面中的各个所述top属性值记录于预设的寄存器;If there are at least two text tags with the same top attribute value, each of the top attribute values in the page is recorded in a preset register;
    查找所述寄存器中最小的所述top属性值,并读取与该top属性值对应的所述文本标签中的文本数据;Find the smallest value of the top attribute in the register, and read the text data in the text label corresponding to the top attribute value;
    将所述文本数据确定为所述表格中的一个所述表头字段。The text data is determined as one of the header fields in the table.
  18. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述基于所述中线值,分别对存在于各个所述表体区域中的企业对象标识进行分组处理,以得到各个所述企业对象标识所匹配的表头字段,包括:The computer-readable storage medium according to claim 16, wherein, based on the median value, the grouping of the enterprise object identifiers existing in each of the table body regions is performed to obtain each of the enterprises. Header fields matched by the object ID, including:
    分别获取所述表头区域中每一表头字段的第一中线值;Obtaining the first median value of each header field in the header area separately;
    对每一所述企业对象标识所属的所述表体区域,获取该表体区域的第二中线值;Obtaining, for each of the table body regions to which the enterprise object identifier belongs, a second midline value of the table body region;
    根据所述第一中线值以及所述第二中线值,分别计算该企业对象标识与各个所述表头字段的相对距离;Calculating the relative distances between the enterprise object identifier and each of the header fields according to the first midline value and the second midline value;
    将所述相对距离最小的所述表头字段输出为与该企业对象标识匹配的表头字段。And outputting the header field having the smallest relative distance as a header field that matches the enterprise object identifier.
  19. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述根据转换后所述待分析文本所包含的各个xml标签,定位所述待分析文本中所存在的表格,并获取所述表格中各个字段区域的中线值,包括:The computer-readable storage medium according to claim 16, wherein, according to each xml tag included in the text to be analyzed after conversion, locating a table existing in the text to be analyzed, and obtaining the The median value of each field area in the table, including:
    对所述待分析文本中的每一页面,定位该页面所包含的各个文本标签,并读取所述文本标签中的top属性值;For each page in the text to be analyzed, positioning each text label contained in the page, and reading the value of the top attribute in the text label;
    在该页面中,分别检测出所述top属性值最大以及所述top属性值最小的各个所述文本标签,并将确定出的两个所述文本标签之间的页面区域定位为所述待分析文本中表格所存在的区域。In this page, each of the text tags with the highest top attribute value and the smallest top attribute value is detected, and the page area between the two determined text tags is positioned as the to-be-analyzed The area where the table exists in the text.
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述对所述待分析文本中的每一页面,定位所述页面所包含的各个文本标签,并读取所述文本标签中的top属性值,包括:The computer-readable storage medium according to claim 19, wherein for each page in the text to be analyzed, each text label contained in the page is located, and the text label is read. The top attribute values include:
    分别对所述待分析文本中的每一页面进行扫描,以确定出包含预设表格名称的所述页面;Scanning each page in the text to be analyzed separately to determine the page containing a preset form name;
    对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值;Locating each text label contained in the page currently determined, and reading the value of the top attribute in the text label;
    若当前所述页面中不存在所述top属性值相同的至少两个所述文本标签,则确定出包含所述预设表格名称的下一所述页面,并返回执行所述对当前确定出的所述页面,定位其所包含的各个文本标签,并读取所述文本标签中的top属性值的操作。If at least two of the text tags with the same top attribute value do not exist in the current page, determine the next page containing the preset form name, and return to execute the currently determined The operation of positioning the text labels contained in the page and reading the value of the top attribute in the text labels.
PCT/CN2018/105543 2018-06-19 2018-09-13 Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium WO2019242125A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810630801.4A CN109002425B (en) 2018-06-19 2018-06-19 Method for acquiring upstream and downstream relations of enterprise, terminal device and medium
CN201810630801.4 2018-06-19

Publications (1)

Publication Number Publication Date
WO2019242125A1 true WO2019242125A1 (en) 2019-12-26

Family

ID=64600526

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/105543 WO2019242125A1 (en) 2018-06-19 2018-09-13 Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium

Country Status (2)

Country Link
CN (1) CN109002425B (en)
WO (1) WO2019242125A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382843A (en) * 2020-03-06 2020-07-07 浙江网商银行股份有限公司 Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining
CN112435051A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Obtaining method of associated enterprise, electronic equipment, computer readable storage medium and terminal

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909123B (en) * 2019-10-23 2023-08-25 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776673A (en) * 2005-12-03 2006-05-24 福州大学 Method for converting PDF file to XML file
US20100161693A1 (en) * 2008-12-18 2010-06-24 Hong Fu Jin Precision Industry (Shenzhen) Co., Ltd. System and method for signing an electronic document
CN104090920A (en) * 2014-06-17 2014-10-08 安徽教育网络出版有限公司 System for realizing digital content cross-terminal publishing
CN108132920A (en) * 2018-01-10 2018-06-08 北京仁和汇智信息技术有限公司 A kind of method and device of XML file and pdf document synchronization association

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446938B (en) * 2008-12-04 2011-10-12 金蝶软件(中国)有限公司 Method for generating table and processing device thereof
US20150046787A1 (en) * 2013-08-06 2015-02-12 International Business Machines Corporation Url tagging based on user behavior
CN103886098B (en) * 2014-04-04 2017-05-17 浙江大学城市学院 Word document format checking method
CN105138609B (en) * 2015-08-04 2019-07-30 广东瑞德智能科技股份有限公司 A kind of household appliance based on XML language describes method
CN107818075A (en) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 Form data structuring extracting method, electronic equipment and computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776673A (en) * 2005-12-03 2006-05-24 福州大学 Method for converting PDF file to XML file
US20100161693A1 (en) * 2008-12-18 2010-06-24 Hong Fu Jin Precision Industry (Shenzhen) Co., Ltd. System and method for signing an electronic document
CN104090920A (en) * 2014-06-17 2014-10-08 安徽教育网络出版有限公司 System for realizing digital content cross-terminal publishing
CN108132920A (en) * 2018-01-10 2018-06-08 北京仁和汇智信息技术有限公司 A kind of method and device of XML file and pdf document synchronization association

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382843A (en) * 2020-03-06 2020-07-07 浙江网商银行股份有限公司 Method and device for establishing upstream and downstream relation recognition model of enterprise and relation mining
CN111382843B (en) * 2020-03-06 2023-10-20 浙江网商银行股份有限公司 Method and device for establishing enterprise upstream and downstream relationship identification model and mining relationship
CN112435051A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Obtaining method of associated enterprise, electronic equipment, computer readable storage medium and terminal
CN112435051B (en) * 2020-11-13 2023-11-28 海创汇科技创业发展股份有限公司 Acquisition method, electronic equipment, computer readable storage medium and terminal of associated enterprises

Also Published As

Publication number Publication date
CN109002425A (en) 2018-12-14
CN109002425B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
WO2019237540A1 (en) Method and device for acquiring financial data, terminal device, and medium
US10614527B2 (en) System and method for automatic generation of reports based on electronic documents
RU2679209C2 (en) Processing of electronic documents for invoices recognition
WO2021184578A1 (en) Ocr-based target field recognition method and apparatus, electronic device, and storage medium
CN110737689B (en) Data standard compliance detection method, device, system and storage medium
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
WO2019242125A1 (en) Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium
CN111506608B (en) Structured text comparison method and device
CN113342976B (en) Method, device, storage medium and equipment for automatically acquiring and processing data
CN110765750B (en) Report data input method and terminal equipment
US10146881B2 (en) Scalable processing of heterogeneous user-generated content
JP5702342B2 (en) Receipt definition data creation device and program
CN114529933A (en) Contract data difference comparison method, device, equipment and medium
CN110704635B (en) Method and device for converting triplet data in knowledge graph
CN110908983A (en) Intelligent marketing system based on user portrait recognition
CN116303820A (en) Label generation method, label generation device, computer equipment and medium
TWI785724B (en) Method for creating data warehouse, electronic device, and storage medium
CN111125483B (en) Webpage data extraction template generation method and device, computer device and storage medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
US20170169518A1 (en) System and method for automatically tagging electronic documents
CN110909112B (en) Data extraction method, device, terminal equipment and medium
CN110909538B (en) Question and answer content identification method and device, terminal equipment and medium
CN110413659B (en) General shopping ticket data accurate extraction method
CN113434734A (en) Method, device, equipment and storage medium for generating file and reading file
US11170164B2 (en) System and method for cell comparison between spreadsheets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18923462

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18923462

Country of ref document: EP

Kind code of ref document: A1