CN111428497A - Method, device and equipment for automatically extracting financing information - Google Patents

Method, device and equipment for automatically extracting financing information Download PDF

Info

Publication number
CN111428497A
CN111428497A CN202010243586.XA CN202010243586A CN111428497A CN 111428497 A CN111428497 A CN 111428497A CN 202010243586 A CN202010243586 A CN 202010243586A CN 111428497 A CN111428497 A CN 111428497A
Authority
CN
China
Prior art keywords
text
information
company
matching
chapters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010243586.XA
Other languages
Chinese (zh)
Inventor
吴良顺
周鑫
景紫薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Original Assignee
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Erzhi Lian Wuhan Research Institute Co Ltd filed Critical Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority to CN202010243586.XA priority Critical patent/CN111428497A/en
Publication of CN111428497A publication Critical patent/CN111428497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for automatically extracting the funding information, which comprises the steps of obtaining a company chapter; reading a text of the chapter of the company, and segmenting the text of the chapter of the company to obtain a segmentation result; matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name; the position of the candidate name in the text of the company chapters is located, and if the location of the located position and the position of the preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the number of the keywords is larger than the threshold value, carrying out keyword labeling or prompting. According to the scheme, effective funding information can be automatically and rapidly extracted from the chapters of the company, and more labor cost and time cost are not required to be consumed. In addition, the application also provides a device, equipment and a computer readable storage medium for automatically extracting the funding information, wherein the device, the equipment and the computer readable storage medium have the technical effects.

Description

Method, device and equipment for automatically extracting financing information
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for automatically extracting funding information.
Background
The company regulation refers to a basic document which is set by a company and specifies important matters such as company name, residence, operation range, operation management system and the like, and is also a written document which is necessary for the company and specifies basic rules of company organization and activities.
The financing information refers to the information of the financer, the financing proportion and the financing in-place time when the company registers, is the basic information to be audited by the credit of financial institutions such as banks, is the basic information to be audited by the register institutions of the industrial and commercial companies, and is the basic information which needs to be mastered and known by the company, financing and other investors. At present, stockholder information of a website such as a sky-eye survey does not display the funding ratio and the fund in-place time information, and only displays the registered capital fund and the stockholder information.
The chapters of the companies are long, if effective funding information is read manually and extracted and recorded, a lot of time is consumed, and particularly when the number of companies needing to be audited and audited is large, the time and labor cost are large. How to automatically extract the funding information from the company according to the chapters is a technical problem to be solved urgently.
Disclosure of Invention
The invention aims to provide a method, a device and equipment for automatically extracting funding information and a computer readable storage medium, so as to solve the problem of high labor cost and time cost caused by manually reading and extracting effective funding information.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:
the application discloses a method for automatically extracting financing information, which comprises the following steps:
acquiring a company chapter;
reading the text of the company chapters, and performing word segmentation on the text of the company chapters to obtain word segmentation results;
matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name;
the position of the candidate name in the text of the company chapter is positioned, and if the position of the positioned position and the position of a preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
Optionally, the obtaining a company chapter includes:
acquiring an original file of a company chapter;
and carrying out format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format.
Optionally, the reading the text of the company chapter and performing word segmentation on the text of the company chapter to obtain a word segmentation result includes:
and reading the text of the company chapter, removing the blank spaces and the punctuations in the text, and performing word segmentation on the text of the company chapter to obtain a word segmentation result.
Optionally, after the reading the text of the company chapter and performing word segmentation on the text of the company chapter to obtain a word segmentation result, the method further includes:
and acquiring a stop word database, traversing the word segmentation result to determine whether stop words in the stop word database appear, and if so, deleting the stop words corresponding to the stop words.
Optionally, after the acquiring the company chapters, the method further comprises:
matching a preset format of the payment acceptance information from the text of the company chaplet, and extracting name information and amount information from the text when the payment acceptance information is matched, wherein the name information corresponds to the amount information one by one, and the value corresponding to the amount information is the amount of payment of a corresponding payer;
matching a predetermined format in which registered capital information appears from the text of the company chapters, and extracting the amount of registered capital from the matched text;
and respectively calculating the proportion of the fund amount of each sponsor to the registered fund amount to generate corresponding fund proportion information of each sponsor.
Optionally, the matching of the predetermined format of the occurrence of the payment acceptance information from the text of the company chapters, and the extracting of the name information and the amount information from the matching include:
matching a preset format of the payment information from the text of the company chapters, segmenting the matched text when the payment information is matched, and extracting name information and digital information from the matched text;
judging whether the digital information is in an identity card format or not, and if so, filtering the corresponding digital information;
and taking the filtered digital information as the money information.
Optionally, after the acquiring the company chapters, the method further comprises:
matching the appearing date format in the text of the company chapters to obtain a date list;
after the step of determining the candidate surname as the sponsor information, the method further comprises the following steps:
locating a first location of each patron in the text of the corporate chapter;
traversing the date list, locating a second position of each date in the text of the company chapters, calculating the distance between the first position and the second position, and determining the date as the fund in-place time corresponding to the sponsor if the sponsor with the minimum distance is selected.
The application also provides a device for automatically extracting the funding information, comprising:
the acquisition module is used for acquiring a company chapter;
the reading module is used for reading the text of the company chapters and performing word segmentation on the text of the company chapters to obtain word segmentation results;
the matching module is used for matching each participle with the surname in the surname database, and taking the participle obtained by matching as a candidate name;
the determining module is used for positioning the position of the candidate name in the text of the company chapters, and if the position of the positioning position and the position of a preset keyword in the text are smaller than or equal to a threshold value, determining the candidate name as sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
The application also provides an automatic equipment of withdrawing the information of financing, include:
a memory for storing a computer program;
a processor for implementing the steps of any one of the above-described methods for automatically extracting funding information when executing the computer program.
The present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of any of the above-mentioned methods for automatically extracting funding information.
According to the scheme, the method for automatically extracting the funding information provided by the embodiment of the invention is used after the regulation of the company is obtained; reading a text of the chapter of the company, and segmenting the text of the chapter of the company to obtain a segmentation result; matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name; the position of the candidate name in the text of the company chapters is located, and if the location of the located position and the position of the preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the number of the keywords is larger than the threshold value, carrying out keyword labeling or prompting. According to the scheme, effective funding information can be automatically and rapidly extracted from the chapters of the company, and more labor cost and time cost are not required to be consumed. In addition, the application also provides a device, equipment and a computer readable storage medium for automatically extracting the funding information, wherein the device, the equipment and the computer readable storage medium have the technical effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of an embodiment of a method for automatically extracting funding information according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a method for automatically extracting funding information according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for automatically extracting funding information according to another embodiment of the present invention;
FIG. 4 is a block diagram of an apparatus for automatically extracting funding information according to an embodiment of the present invention;
fig. 5 is a block diagram of a system for automatically extracting funding information according to an embodiment of the present invention.
Detailed Description
In the prior art, the chapters of companies are long, if effective funding information is read manually and extracted and recorded, a lot of time is consumed, and particularly when the number of companies needing to be audited and audited is large, the time and labor cost are high. How to automatically extract the funding information according to the text of the chapters of the company is a technical problem to be solved urgently. In view of the situation, the method and the device are used for solving the problem that the labor time cost is high due to the fact that effective financing information is extracted through manual reading.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A flowchart of a specific implementation manner of the method for automatically extracting funding information according to an embodiment of the present invention is shown in fig. 1, where the method includes:
step S101: acquiring a company chapter;
in this embodiment, the company chapters may be a text file obtained directly, or a text file obtained by converting the format of an obtained original file, and specifically may be: and acquiring an original file of the company chaplet, and performing format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format. The format conversion may be performed once or multiple times, and is not limited herein. Taking the original file of the company chaplet as a PDF file as an example, an open source tool may be adopted to convert the PDF file into a picture file. And then, converting the company chapters into picture files into text files by adopting an open source OCR tool to obtain the company chapters.
Step S102: reading the text of the company chapters, and performing word segmentation on the text of the company chapters to obtain word segmentation results;
in this embodiment, a text of a chapter of a company is read, and a word segmentation is performed on the text to obtain a word segmentation result. The text may be segmented by using an open source segmentation tool, and the segmentation result may be in the form of a segmentation list. Spaces and punctuation may also be removed from the text of a company chapter prior to word segmentation. The format of the spaces and punctuation may include, but is not limited to: ", ' \ t ', ' \ n ', '. ',': ',' □ ','? ',' |! ','; ',', '… …', 'and'.
Further, after reading the text of the company chapter and performing word segmentation on the text of the company chapter to obtain a word segmentation result, the method further comprises the following steps: and acquiring a stop word database, traversing the word segmentation result to determine whether stop words in the stop word database appear, and if so, deleting the stop words corresponding to the stop words. Stop words are words that are considered to be unnecessary to include, and stop words may be words that have no meaning or words that are very dense. In order to save server resources, the search crawler does not include the word. In this embodiment, a Baidu stop word list may be specifically obtained, and by traversing the participles in the participle list, if a stop word in the stop word database occurs, a deletion operation is performed.
Step S103: matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name;
the method can be specifically divided into the following steps: step S1031: and combining all the participles in the participle result into a text, and matching the participles with the surnames in the surname database in a text matching mode.
The surname database may be specifically a pre-established database that includes a plurality of surnames, such as common surnames. And traversing the surname database, and matching the participles with surnames in the surname database in a text matching mode. The text matching mode can be specifically a regular matching mode.
Step S1032: and when the text is successfully matched with the surnames in the surname database, taking the corresponding participles as candidate names.
And when the text is successfully matched with the surnames in the surname database, identifying the corresponding participles as possible participles of Chinese names, namely as candidate names. The number of candidate names may be 1 or more, and a list format may be generated when there are a plurality of candidate names.
Step S104: the position of the candidate name in the text of the company chapter is positioned, and if the position of the positioned position and the position of a preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
When determining that a participle which may be a Chinese name is a candidate name, further positioning the position of the Chinese name in the text of a company chapter, intercepting preset characters (for example, 10 characters) before and after the participle, and if a preset keyword appears, determining that the candidate name is the sponsor information in the sponsor information. The preset keywords can be keywords such as 'payment acceptance' and 'funding', and can be preset to be used as confirmation information of the name appearance of the funder. And if the preset keywords do not appear, namely the positions of the positioning position and the preset keywords in the text are greater than a threshold value, labeling or prompting the keywords. The labeling and prompting modes can be various, and are not limited herein.
The method for automatically extracting the funding information provided by the embodiment of the invention is used after the regulation of a company is obtained; reading a text of the chapter of the company, and segmenting the text of the chapter of the company to obtain a segmentation result; matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name; the position of the candidate name in the text of the company chapters is located, and if the location of the located position and the position of the preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the number of the keywords is larger than the threshold value, carrying out keyword labeling or prompting. According to the scheme, effective funding information can be automatically and rapidly extracted from the chapters of the company, and more labor cost and time cost are not required to be consumed.
Furthermore, the method and the device can automatically acquire the funding proportion in the funding information. As shown in fig. 2, a flowchart of another specific embodiment of a method for automatically extracting funding information provided by the present application specifically includes:
step S201: acquiring a company chapter;
in this embodiment, the company chapters may be a text file obtained directly, or a text file obtained by converting the format of an obtained original file, and specifically may be: and acquiring an original file of the company chaplet, and performing format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format. The format conversion may be performed once or multiple times, and is not limited herein. Taking the original file of the company chaplet as a PDF file as an example, an open source tool may be adopted to convert the PDF file into a picture file. And then, converting the company chapters into picture files into text files by adopting an open source OCR tool to obtain the company chapters.
The text of the company chapters is read, and all spaces and punctuation are removed. The format of the spaces and punctuation may include, but is not limited to: ", ' \ t ', ' \ n ', '. ',': ',' □ ','? ',' |! ','; ',', '… …', 'and'.
Step S202: matching a preset format of the payment acceptance information from the text of the company chaplet, and extracting name information and amount information from the text when the payment acceptance information is matched, wherein the name information corresponds to the amount information one by one, and the value corresponding to the amount information is the amount of payment of a corresponding payer;
after the text of the company chapter is obtained, matching a preset format in which the payment accepting information appears in a text rule matching mode, if the preset format appears, considering that the payment accepting information exists in the text, and further extracting name information and amount information from the corresponding text. The predetermined format may be a regular matching rule, i.e. when combined information of name information, keywords and amount information is detected in the text, it is considered that the payment acceptance information occurs. For example, a format such as ". x. x." or ". x. x." a banknote is recognized and a rmb is recognized appears in the text. The keywords can be payment, RMB, payment amount, payment amount, payment in currency, etc. The specific format may be: "{ 1,3} RMB \ d +% s [ \ u4E00- \ u9FA5] {1,3} RMB \ d +% s [ \ u4E00- \\ u9FA5] {1,3} d {17} [ a-zA-Z ] \\ d +% s [ \ u4E00- \ u9FA5] {1,3} + |% s [ \\ u4E00- \\ u9FA5] {1,3} currency d +% s [ \\\\ u4E00- \ u9FA5] {1,3} earned [ \\\ u4E00- } u9 5 {1,3} found { 5 } found [ \\\\\\\\\\ d } found \\\\\\\ u 5 } found {2 \\\\\\\\\\\\\\\\\ FA 594 \\\\\\ p \ s } found { 4E 9 \\\\\\\\\\\\\\\\\\\\\\\\\ p \ s \ p \\ p \\\ p \, 3} \ d {17} [ a-zA-Z ] currency \ d + |% s [ \ u4E00- \ u9FA5] {1,3} is funded in currency form \ d + $ "% (i, i, i, i, i, i, i, i, i, i). The specific format is only exemplary and not limiting.
After matching the preset format of the payment information in the text of the company chapters, performing word segmentation on the matched text, and extracting name information and numerical information from the matched text. Generally, the name information and the number information are in a one-to-one correspondence relationship. For example, when the text "zhang san jiu 20 ten thousand yuan" is matched, the name information "zhang san" and the corresponding fund amount "20 ten thousand yuan" are extracted. The '20 ten thousand yuan' is the corresponding financing amount of Zhang III of the sponsor.
Because the condition of connection with the identity card number often exists after the name information, whether the digital information is in the form of the identity card can be further judged after the name information and the digital information are extracted, if so, the corresponding digital information is filtered, and the filtered digital information is used as the money information. The situation that the identification number is recognized as the amount of the capital is avoided. The method for determining whether the digital information is in the form of the identification card may be to determine whether the digital information is 18-digit numbers, and if so, determine that the digital information is in the form of the identification card.
Step S203: matching a predetermined format in which registered capital information appears from the text of the company chapters, and extracting the amount of registered capital from the matched text;
the predetermined format of the registered capital information occurs in a company chapter match. The predetermined format may be text matched to "registered capital" and matched to currency information. And when the matching is carried out, extracting the currency information from the database as the registered fund amount. For example, if "80 ten thousand dollars of registered capital" is matched in the text, 80 ten thousand dollars are used as the amount of registered capital.
Step S204: and respectively calculating the proportion of the fund amount of each sponsor to the registered fund amount to generate corresponding fund proportion information of each sponsor.
The corresponding relationship between the sponsor and the amount of the fund is obtained in step S202, and the amount of the registered fund is obtained in step S203. The proportion of the fund amount of each sponsor to the registered fund amount can be obtained through mathematical operation, so that the corresponding fund proportion information of each sponsor is obtained. For example, when the funding amount corresponding to "zhang san" obtained in the foregoing is "20 ten thousand yuan", and the registered funding amount is 80 ten thousand yuan, "zhang san" has funding ratio information of 20/80 being 25%.
On the basis of any one of the above embodiments, the fund in-place time information in the funding information can be automatically acquired. As shown in fig. 3, a flowchart of another embodiment of a method for automatically extracting funding information provided by the present application specifically includes:
step S301: acquiring a company chapter;
in this embodiment, the company chapters may be a text file obtained directly, or a text file obtained by converting the format of an obtained original file, and specifically may be: and acquiring an original file of the company chaplet, and performing format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format. The format conversion may be performed once or multiple times, and is not limited herein. Taking the original file of the company chaplet as a PDF file as an example, an open source tool may be adopted to convert the PDF file into a picture file. And then, converting the company chapters into picture files into text files by adopting an open source OCR tool to obtain the company chapters.
The text of the company chapters is read, and all spaces and punctuation are removed. The format of the spaces and punctuation may include, but is not limited to: ", ' \ t ', ' \ n ', '. ',': ',' □ ','? ',' |! ','; ',', '… …', 'and'.
Step S302: matching the appearing date format in the text of the company chapters to obtain a date list;
all date formats that appear are matched in the text of the company chapters, resulting in a list of dates. For example, a regular matching rule may be employed: the date list is obtained after matching of 20\ d {2} year \ d {1,2} month \ d {1,2} day |20\ d {2} \\ d {1,2} |20\ d {2} \\\ d {1,2} ". The regular matching rule is only one embodiment and is not limited.
Step S303: after determining that the candidate names are sponsor information in the sponsor information, locating a first position of each sponsor in text of the company chapters;
step S304: traversing the date list, locating a second position of each date in the text of the company chapters, calculating the distance between the first position and the second position, and determining the date as the fund in-place time corresponding to the sponsor if the sponsor with the minimum distance is selected.
Locate the location of each patron in the text of the corporate chapters [ loc _ name1, loc _ name2, … ]; locate the first location loc _ date in the text of the company chapters, calculate the distance to each patron: if the sponsor with the smallest distance is selected, the fund arrival time of the sponsor is the date, | loc _ date-loc _ name1, | loc _ date-loc _ name2|, ….
According to the scheme, effective funding information can be automatically and rapidly extracted from the company chapters, the funding information can include funding person information, funding proportion information and fund in-place time, and more labor cost and time cost are not required to be consumed.
The following describes an apparatus for automatically extracting funding information according to an embodiment of the present invention, and the apparatus for automatically extracting funding information described below and the method for automatically extracting funding information described above may be referred to in correspondence with each other.
Fig. 4 is a block diagram illustrating an apparatus for automatically extracting funding information according to an embodiment of the present invention, where the apparatus for automatically extracting funding information according to fig. 4 may include:
an acquisition module 100 for acquiring a company chapter;
the reading module 200 is configured to read a text of the company chapter, perform word segmentation on the text of the company chapter, and obtain a word segmentation result;
the matching module 300 is configured to match each participle with a surname in a surname database, and use the participle obtained through matching as a candidate name;
a determining module 400, configured to locate a position of the candidate name in the text of the company chapter, and if the located position and a position of a preset keyword in the text are less than or equal to a threshold, determine that the candidate name is sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
The apparatus for automatically extracting the funding information of this embodiment is used to implement the aforementioned method for automatically extracting the funding information, and therefore, specific embodiments of the apparatus for automatically extracting the funding information may be found in the foregoing embodiments of the method for automatically extracting the funding information, for example, the obtaining module 100, the reading module 200, the matching module 300, and the determining module 400 are respectively used to implement steps S101, S102, S103, and S104 in the above method for automatically extracting the funding information, so that the specific embodiments thereof may refer to descriptions of corresponding embodiments of each part, and are not described herein again.
The device for automatically extracting the funding information provided by the embodiment of the invention is used for automatically extracting the funding information after the regulation of a company is obtained; reading a text of the chapter of the company, and segmenting the text of the chapter of the company to obtain a segmentation result; matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name; the position of the candidate name in the text of the company chapters is located, and if the location of the located position and the position of the preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the number of the keywords is larger than the threshold value, carrying out keyword labeling or prompting. By the scheme, effective funding information can be automatically and rapidly extracted from the chapters of the company without consuming more labor cost and time cost
In addition, the present application further provides a system for automatically extracting the funding information, as shown in fig. 5, the system 1 for automatically extracting the funding information may specifically include:
a memory 11 for storing a computer program;
a processor 12 for implementing the following steps when executing the computer program: acquiring a company chapter; reading the text of the company chapters, and performing word segmentation on the text of the company chapters to obtain word segmentation results; matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name; the position of the candidate name in the text of the company chapter is positioned, and if the position of the positioned position and the position of a preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit, such as a hard disk. The memory 11 may also be an external storage device of the device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device. The memory 11 may be used not only to store application software installed in the device and various types of data, but also to temporarily store data that has been output or will be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code stored in memory 11 or processes data.
Optionally, the processor 12 is configured to implement the following steps when executing the computer program: acquiring an original file of a company chapter; and carrying out format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format.
Optionally, the processor 12 is configured to implement the following steps when executing the computer program: and reading the text of the company chapter, removing the blank spaces and the punctuations in the text, and performing word segmentation on the text of the company chapter to obtain a word segmentation result.
Optionally, the processor 12, when executing the computer program, may further implement the following steps: and acquiring a stop word database, traversing the word segmentation result to determine whether stop words in the stop word database appear, and if so, deleting the stop words corresponding to the stop words.
Optionally, the processor 12, when executing the computer program, may further implement the following steps: matching a preset format of the payment acceptance information from the text of the company chaplet, and extracting name information and amount information from the text when the payment acceptance information is matched, wherein the name information corresponds to the amount information one by one, and the value corresponding to the amount information is the amount of payment of a corresponding payer; matching a predetermined format in which registered capital information appears from the text of the company chapters, and extracting the amount of registered capital from the matched text; and respectively calculating the proportion of the fund amount of each sponsor to the registered fund amount to generate corresponding fund proportion information of each sponsor.
Optionally, the processor 12, when executing the computer program, may further implement the following steps: matching a preset format of the payment information from the text of the company chapters, segmenting the matched text when the payment information is matched, and extracting name information and digital information from the matched text; judging whether the digital information is in an identity card format or not, and if so, filtering the corresponding digital information; and taking the filtered digital information as the money information.
Optionally, the processor 12, when executing the computer program, may further implement the following steps: matching the appearing date format in the text of the company chapters to obtain a date list; after the determination that the candidate surnames are sponsor information, locating a first location of each sponsor in text of the company chapters; traversing the date list, locating a second position of each date in the text of the company chapters, calculating the distance between the first position and the second position, and determining the date as the fund in-place time corresponding to the sponsor if the sponsor with the minimum distance is selected.
Furthermore, the present application provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of any one of the above-mentioned methods for automatically extracting funding information.
The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The system and the computer-readable storage medium for automatically extracting the funding information provided by the application correspond to the method. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The book information recommendation method, device, system and computer readable storage medium provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (9)

1. A method for automatically extracting funding information, comprising:
acquiring a company chapter;
reading the text of the company chapters, and performing word segmentation on the text of the company chapters to obtain word segmentation results;
matching each participle with surnames in a surname database, and taking the participle obtained by matching as a candidate name;
the position of the candidate name in the text of the company chapter is positioned, and if the position of the positioned position and the position of a preset keyword in the text are smaller than or equal to a threshold value, the candidate name is determined to be sponsor information; if the number of the keywords is larger than the threshold value, carrying out keyword labeling or prompting;
after the reading of the text of the company chapter, the word segmentation of the text of the company chapter and the obtaining of a word segmentation result, the method further comprises:
and acquiring a stop word database, traversing the word segmentation result to determine whether stop words in the stop word database appear, and if so, deleting the stop words corresponding to the stop words.
2. The method of automatically extracting funding information of claim 1, wherein said obtaining a corporate chaplet comprises:
acquiring an original file of a company chapter;
and carrying out format conversion on the original file of the company chaplet to convert the original file of the company chaplet into the company chaplet in a text format.
3. The method for automatically extracting funding information according to claim 2, wherein the reading the text of the company chapters and the segmenting the text of the company chapters to obtain the segmentation result comprises:
and reading the text of the company chapter, removing the blank spaces and the punctuations in the text, and performing word segmentation on the text of the company chapter to obtain a word segmentation result.
4. The method of automatically extracting funding information of claim 3, further comprising, after said obtaining a corporate chapter:
matching a preset format of the payment acceptance information from the text of the company chaplet, and extracting name information and amount information from the text when the payment acceptance information is matched, wherein the name information corresponds to the amount information one by one, and the value corresponding to the amount information is the amount of payment of a corresponding payer;
matching a predetermined format in which registered capital information appears from the text of the company chapters, and extracting the amount of registered capital from the matched text;
and respectively calculating the proportion of the fund amount of each sponsor to the registered fund amount to generate corresponding fund proportion information of each sponsor.
5. The method of automatically extracting funding information according to claim 4, wherein said matching a predetermined format of the occurrence of due information from the text of said company chapters, and wherein extracting name information and amount information therefrom upon matching comprises:
matching a preset format of the payment information from the text of the company chapters, segmenting the matched text when the payment information is matched, and extracting name information and digital information from the matched text;
judging whether the digital information is in an identity card format or not, and if so, filtering the corresponding digital information;
and taking the filtered digital information as the money information.
6. The method of automatically extracting funding information of claim 3, further comprising, after said obtaining a corporate chapter:
matching the appearing date format in the text of the company chapters to obtain a date list;
after the step of determining the candidate surname as the sponsor information, the method further comprises the following steps:
locating a first location of each patron in the text of the corporate chapter;
traversing the date list, locating a second position of each date in the text of the company chapters, calculating the distance between the first position and the second position, and determining the date as the fund in-place time corresponding to the sponsor if the sponsor with the minimum distance is selected.
7. An apparatus for automatically extracting funding information, comprising:
the acquisition module is used for acquiring a company chapter;
the reading module is used for reading the text of the company chapters and performing word segmentation on the text of the company chapters to obtain word segmentation results;
the matching module is used for matching each participle with the surname in the surname database, and taking the participle obtained by matching as a candidate name;
the determining module is used for positioning the position of the candidate name in the text of the company chapters, and if the position of the positioning position and the position of a preset keyword in the text are smaller than or equal to a threshold value, determining the candidate name as sponsor information; and if the threshold value is larger than the threshold value, carrying out keyword labeling or prompting.
8. An apparatus for automatically extracting funding information, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of automatically extracting funding information as claimed in any one of claims 1 to 6 when said computer program is executed.
9. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for automatically extracting funding information according to any one of claims 1 to 6.
CN202010243586.XA 2020-03-31 2020-03-31 Method, device and equipment for automatically extracting financing information Pending CN111428497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010243586.XA CN111428497A (en) 2020-03-31 2020-03-31 Method, device and equipment for automatically extracting financing information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010243586.XA CN111428497A (en) 2020-03-31 2020-03-31 Method, device and equipment for automatically extracting financing information

Publications (1)

Publication Number Publication Date
CN111428497A true CN111428497A (en) 2020-07-17

Family

ID=71556071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010243586.XA Pending CN111428497A (en) 2020-03-31 2020-03-31 Method, device and equipment for automatically extracting financing information

Country Status (1)

Country Link
CN (1) CN111428497A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183031A (en) * 2020-10-16 2021-01-05 卓尔智联(武汉)研究院有限公司 Text processing method and device and electronic equipment
CN112613310A (en) * 2021-01-04 2021-04-06 成都颜创启新信息技术有限公司 Name matching method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090112537A1 (en) * 2007-10-29 2009-04-30 Oki Electric Industry Co., Ltd. Location expression detection device and computer readable medium
CN107392436A (en) * 2017-06-27 2017-11-24 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprise's incidence relation information
CN107608965A (en) * 2017-09-14 2018-01-19 掌阅科技股份有限公司 Extracting method, electronic equipment and the storage medium of books the names of protagonists
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text
WO2019056392A1 (en) * 2017-09-25 2019-03-28 深圳市云中飞网络科技有限公司 Information processing method and apparatus, mobile terminal, and computer readable storage medium
CN109857992A (en) * 2018-12-29 2019-06-07 医渡云(北京)技术有限公司 Medical data structuring analytic method, device, readable medium and electronic equipment
CN109858033A (en) * 2019-02-21 2019-06-07 陈包容 A method of name is extracted from text
CN110502694A (en) * 2019-07-23 2019-11-26 平安科技(深圳)有限公司 Lawyer's recommended method and relevant device based on big data analysis
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090112537A1 (en) * 2007-10-29 2009-04-30 Oki Electric Industry Co., Ltd. Location expression detection device and computer readable medium
CN107392436A (en) * 2017-06-27 2017-11-24 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprise's incidence relation information
CN107608965A (en) * 2017-09-14 2018-01-19 掌阅科技股份有限公司 Extracting method, electronic equipment and the storage medium of books the names of protagonists
WO2019056392A1 (en) * 2017-09-25 2019-03-28 深圳市云中飞网络科技有限公司 Information processing method and apparatus, mobile terminal, and computer readable storage medium
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text
CN109857992A (en) * 2018-12-29 2019-06-07 医渡云(北京)技术有限公司 Medical data structuring analytic method, device, readable medium and electronic equipment
CN109858033A (en) * 2019-02-21 2019-06-07 陈包容 A method of name is extracted from text
CN110502694A (en) * 2019-07-23 2019-11-26 平安科技(深圳)有限公司 Lawyer's recommended method and relevant device based on big data analysis
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
乔磊,李存华,仲兆满等: "基于规则的人物信息抽取算法的研究", vol. 35, no. 4, pages 134 - 139 *
周昆,胡学钢: "一种基于本体论和规则匹配的中文人名识别方法", vol. 26, no. 31, pages 87 - 89 *
张悦等: "人名识别技术在中国招中标领域的应用", 《北京信息科技大学学报(自然科学版)》, vol. 32, no. 5, pages 72 - 77 *
朱全银,周培,尹永华等: "基于Web数据挖掘的多因素科技专家信息提取方法", vol. 22, no. 5, pages 23 - 27 *
马晶晶: "金融领域信息的自动抽取与分析方法", no. 4, pages 138 - 1237 *
黑马程序员编著: "《Python数据分析与应用 从数据获取到可视化》", 31 January 2019, 北京:中国铁道出版社, pages: 226 - 227 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183031A (en) * 2020-10-16 2021-01-05 卓尔智联(武汉)研究院有限公司 Text processing method and device and electronic equipment
CN112183031B (en) * 2020-10-16 2023-08-01 卓尔智联(武汉)研究院有限公司 Text processing method and device and electronic equipment
CN112613310A (en) * 2021-01-04 2021-04-06 成都颜创启新信息技术有限公司 Name matching method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN112199506B (en) Information detection method, device and equipment for application program
CN106886509B (en) Automatic detection method for academic paper format
US8577155B2 (en) System and method for duplicate text recognition
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN110765770A (en) Automatic contract generation method and device
US11182544B2 (en) User interface for contextual document recognition
CN106815201B (en) Method and device for automatically judging judgment result of referee document
CN107274291B (en) Cross-platform valuation table analysis method, storage medium and application server
CN110110325B (en) Repeated case searching method and device and computer readable storage medium
CN110634223A (en) Bill verification method and device
CN111428497A (en) Method, device and equipment for automatically extracting financing information
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN114358798A (en) Method, device and system for enterprise information authentication based on picture identification
CN111401002A (en) Method, device and computer storage medium for automatically identifying PDF electronic receipt information
CN113220885B (en) Text processing method and system
CN114511866A (en) Data auditing method, device, system, processor and machine-readable storage medium
US20210390488A1 (en) Citation and policy based document classification
CN108073678B (en) Document analysis processing method, system and device applied to big data analysis
CN113450077A (en) Foreign exchange voucher processing method and device
CN116798061A (en) Bill auditing and identifying method, device, terminal and storage medium
CN111178365A (en) Picture character recognition method and device, electronic equipment and storage medium
US20190057456A1 (en) System and methods thereof for associating electronic documents to evidence
CN116189215A (en) Automatic auditing method and device, electronic equipment and storage medium
CN110909112B (en) Data extraction method, device, terminal equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination