CN111914064A

CN111914064A - Text mining method, device, equipment and medium

Info

Publication number: CN111914064A
Application number: CN202010744784.4A
Authority: CN
Inventors: 王嘉兴
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-10

Abstract

The embodiment of the invention discloses a text mining method, a text mining device, text mining equipment and a text mining medium. Wherein, the method comprises the following steps: determining a target resource positioning character from the candidate resource positioning characters; acquiring information of a target text in a target data source according to the target resource positioning character; and checking the information of the target text, and determining the text to be mined from the target text according to a checking result. The embodiment of the invention can accurately determine the target data source to be searched in a plurality of data sources, thereby effectively improving the extraction efficiency of the required text.

Description

Text mining method, device, equipment and medium

Technical Field

Embodiments of the present invention relate to information processing technologies, and in particular, to a text mining method, apparatus, device, and medium.

Background

The medium and small-sized enterprises are capillary vessels of national economy and play a vital role in the development of national economy; however, because the information of the medium and small enterprises is opaque, the financial institutions cannot effectively identify and manage the risks, so that a large number of medium and small enterprises are difficult to obtain sufficient financial support from the financial institutions. At present, the method for mining the enterprise text mainly screens a plurality of webpage data resources in a large area to determine useful and real text information.

The defects of the scheme are as follows: the data source is single, most of the data source is public opinion and social network information, effective crawling on information disclosure websites is avoided, the information disclosure websites are queried in numerous webpage data one by one, and the query efficiency of text information is greatly reduced.

Disclosure of Invention

The embodiment of the application provides a text mining method, a text mining device, text mining equipment and a text mining medium, which can accurately determine a target data source to be searched in a plurality of data sources, so that the extraction efficiency of a required text can be effectively improved.

In a first aspect, an embodiment of the present invention provides a text mining method, including:

determining a target resource positioning character from the candidate resource positioning characters;

acquiring information of a target text in a target data source according to the target resource positioning character;

and checking the information of the target text, and determining the text to be mined from the target text according to a checking result.

Optionally, determining the target resource locator character from the candidate resource locator characters includes:

searching candidate resource positioning characters to obtain at least two candidate data sources;

determining a target resource positioning character according to the attribute information of at least two candidate data sources; wherein the attribute information of the candidate data source comprises at least one of a business name, a business registration address and a business type.

Optionally, obtaining information of a target text in a target data source according to the target resource positioning character includes:

calling the target resource positioning character through a crawler driver, and downloading information of an initial text from a target data source;

and extracting the information of the initial text to obtain the information of the target text.

Optionally, the information extraction of the initial text is performed to obtain information of a target text, and the information extraction includes:

constructing a matching dictionary according to the items to be detected; the items to be detected comprise at least one of named entities, events, numerical values and time;

and extracting information of the initial text by using the matching dictionary to obtain information of a target text containing a to-be-detected item.

Optionally, the checking the information of the target text, and determining a text to be mined from the target text according to a checking result, includes:

verifying the information of the target text according to the enterprise risk value predicted by the information of the target text and a preset risk threshold;

and if the enterprise risk value is smaller than a preset risk value, taking the target text as a text to be mined, and evaluating the loan risk of the enterprise.

In a second aspect, an embodiment of the present invention provides a file mining apparatus, including:

the character determining module is used for determining a target resource positioning character from the candidate resource positioning characters;

the information acquisition module is used for acquiring the information of the target text in the target data source according to the target resource positioning character;

and the information checking module is used for checking the information of the target text and determining the text to be mined from the target text according to a checking result.

Optionally, the character determining module is specifically configured to:

Optionally, the information obtaining module is specifically configured to:

Optionally, the information obtaining module is further specifically configured to:

Optionally, the information checking module is specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the text-mining method of any of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the text mining method according to any one of the embodiments of the present invention.

Determining a target resource positioning character from candidate resource positioning characters; acquiring information of a target text in a target data source according to the target resource positioning character; and checking the information of the target text, and determining the text to be mined from the target text according to a checking result. The embodiment of the invention can accurately determine the target data source to be searched in a plurality of data sources, thereby effectively improving the extraction efficiency of the required text.

Drawings

FIG. 1 is a flowchart illustrating a text mining method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text mining method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a text mining device in a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device in a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart illustrating a text mining method according to a first embodiment of the present invention. The embodiment can be suitable for accurately and quickly acquiring the text information. The method of the embodiment may be performed by a text mining apparatus, which may be implemented in hardware and/or software and may be configured in an electronic device. The text mining method according to any embodiment of the present application can be realized. As shown in fig. 1, the method specifically includes the following steps:

s110, determining target resource positioning characters from the candidate resource positioning characters.

In this embodiment, the candidate Resource Locator is a Location address of a professional website web page in the financial industry, i.e. a URL (Uniform Resource Locator), which may be displayed in a Location or a URL box at the top of the browser. The target resource positioning character is used for eliminating a data source which is useless for a practitioner from a candidate data source in the candidate resource positioning characters according to the requirement of the practitioner, and taking the resource positioning character of the useful data source as the target resource positioning character.

In the traditional mode, the required data sources are inquired one by one in a plurality of non-professional websites, and the data sources are screened according to the inquiry result to obtain useful data sources for subsequent risk assessment; the data source of the query mode is single, most of the data source is public opinion social network information, effective data crawling from professional websites cannot be conducted in a targeted mode, and the data acquisition efficiency is greatly reduced. Therefore, the method and the device perform targeted selection in the query website, and firstly determine the target resource positioning character according to different requirements of practitioners, so that the required information can be directly obtained through the target resource positioning character, the query workload is reduced, and the query accuracy is improved.

And S120, acquiring information of a target text in the target data source according to the target resource positioning characters.

In this embodiment, in the locator box, the target resource locator character is input, and the user can directly jump to the corresponding web page containing the information of the target text, and can acquire the information of the target text from the web page.

Specifically, the required information can be automatically acquired based on the distributed web crawler through the target resource positioning characters; the Distributed web crawler is an open source Apache project of Hadoop, and mainly comprises a MapReduce component and a storage component of a Hadoop Distributed File System (HDFS); the MapReduce component can be used for dividing, shuffling and merging data, and can also help a plurality of computers to compute a task in parallel; the storage component of a Hadoop Distributed File System (HDFS) is a distributed file system that processes large datasets on multiple computers, making web crawlers more efficient.

S130, checking the information of the target text, and determining the text to be mined from the target text according to the checking result.

In this embodiment, because the obtained information of the target text may not obtain a great help for enterprise risk assessment, validity verification needs to be performed on the information of the target text, so that the text adopted in enterprise risk assessment has a high representativeness, and thus effective help can be accurately provided for enterprise risk analysis. The embodiment of the invention can accurately determine the target data source to be searched in a plurality of data sources, thereby effectively improving the extraction efficiency of the required text.

Example two

Fig. 2 is a flowchart illustrating a text mining method according to a second embodiment of the present invention. The embodiment is further expanded and optimized on the basis of the embodiment, and can be combined with any optional alternative in the technical scheme. As shown in fig. 2, the method includes:

s210, determining target resource positioning characters from the candidate resource positioning characters.

S220, calling the target resource positioning characters through the crawler driving program, and downloading information of the initial text from the target data source.

In the embodiment, a target resource positioning character is selected and a local file system motion configuration file is selected, so that the target resource positioning character can be accurately and quickly uploaded to a list database of the HDFS; sequentially sending target resource positioning characters (a plurality of target resource positioning characters can be provided in the embodiment) in the list database to a crawler driver which is written in advance; the crawler driver comprises a hadoop-based MapReduce process, requests resources (a webpage obtained by searching target resource positioning characters) from the internet and downloads files displayed on the webpage.

Specifically, the types of the display files on the web page include ". html", ". css", and ". js"; the HTML represents a hypertext markup language, is a standard markup language for creating a Web page and comprises all texture information of the Web page; CSS representative cascading style sheet, which describes how to display HTML elements on a screen; java Script (. js) is applied to HTML documents, and dynamic interaction on websites can be provided; CSS and JS are generally less useful for text-mining tasks, so only html is stored in originalbubdb in this embodiment.

Extracting information of the target resource positioning characters and the initial text according to the regular expression and the XML path language rule by using a parser driver; the MapReduce process is used for helping to optimize the target resource positioning character storage to update the Urlast DB in the HDFS; and finally, the step (a) will be carried out. Then, information of the initial text is stored in a text database.

And S230, extracting the information of the initial text to obtain the information of the target text.

In this embodiment, the information of the initial text stored in the text database is unstructured data, which still is inconvenient for a practitioner to use for analysis, and therefore, it is necessary to perform necessary processing on the unstructured data and convert the unstructured data into structured data so as to accurately identify valid data contained in the information of the initial text, thereby obtaining useful target text information.

S240, checking the information of the target text, and determining the text to be mined from the target text according to the checking result.

On the basis of the foregoing embodiment, optionally, S210 includes:

In this embodiment, the candidate resource positioning character may be a link address of an information careless website of a professional website; wherein, the professional websites comprise a bidding website, a recruitment website, a referee document website or a national enterprise credit information website and the like; after the search is carried out through the plurality of candidate resource positioning characters, a plurality of candidate data sources can be obtained, useless resources are eliminated according to the requirements of practitioners in the candidate data sources, and useful resources are reserved, so that the subsequent search workload is effectively reduced.

If the association degree of the data in the candidate data source identified according to the attribute information of the candidate data source and the attribute information of the candidate data source is greater than the association degree threshold value, the candidate data source is considered to be effective, and the candidate data source is used as a target data source; and if the relevance between the data in the candidate data source identified according to the attribute information of the candidate data source and the attribute information of the candidate data source is less than the relevance threshold, the candidate data source is considered to be useless, and the candidate data source is excluded.

On the basis of the foregoing embodiment, optionally, the extracting information of the initial text to obtain information of the target text includes:

and extracting information of the initial text by using the matching dictionary to obtain the information of the target text containing the items to be detected.

In this embodiment, different detection items are independently stored in the matching dictionary; the detection items can accurately identify the matching degree of the information so as to effectively reduce the information of the initial text and obtain useful and representative information of the target text.

Specifically, the named entity may be each keyword included in the information of the initial text, for example, a title keyword on a web page; the event can be the text content in the information of the initial text; the numerical value can be the content in the form of numbers or characters in the information of the initial text; the time may be a publication time or an update time of the information of the initial text.

On the basis of the above embodiment, optionally, S240 includes:

and if the enterprise risk value is smaller than the preset risk value, taking the target text as the text to be mined, and evaluating the loan risk of the enterprise.

In this embodiment, the enterprise risk value is obtained according to the target text and the risk factor prediction, and the qualification degree of the information of the target text to the enterprise under the risk factor can be effectively determined according to the comparison between the enterprise risk value and the preset risk threshold.

If the enterprise risk value is smaller than the preset risk value, the information of the obtained target text can reduce the risk of the enterprise from the selected risk factors, and the loan operation of the small and medium-sized enterprises is smoother. And if the enterprise risk value is larger than the preset risk value, reselecting the target resource positioning character to continue to repeat the subsequent operation and determine the information of the target text until the qualified information of the target text is determined and is used as the text to be mined.

In the loan practice of medium and small enterprises, there are mainly two types of risk factors, namely, factors related to the medium and small enterprises and factors related to the core large enterprises. The risks of small and medium-sized enterprises are divided into: liquidity, profitability, leverage, debt ability, activity ratio, and asset size and equity structure, similar to the risk factors of traditional financial solutions. The relevant factors of the core large enterprise are mainly divided into endogenous risks and exogenous risks; the endogenous risks comprise enterprise operation risks such as uncertain demand, uncertain supply, uncertain product price and the like, and financial risks such as enterprise credit or liquidity and the like; exogenous risks include macroscopic economic situation, specific industry situation to which the core large enterprise belongs, and supply chain situation.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a text mining device in a third embodiment of the present invention, which is applicable to a case of accurately and quickly acquiring text information. The device is configured in the electronic equipment, and can realize the text mining method in any embodiment of the application. The device specifically comprises the following steps:

a character determination module 310, configured to determine a target resource locator character from the candidate resource locator characters;

the information acquisition module 320 is configured to acquire information of a target text in a target data source according to the target resource location character;

and the information checking module 330 is configured to check information of the target text, and determine a text to be mined from the target text according to a check result.

Optionally, the character determining module 310 is specifically configured to:

Optionally, the information obtaining module 320 is specifically configured to:

Optionally, the information obtaining module 320 is further specifically configured to:

Optionally, the information checking module 330 is specifically configured to:

By the text mining device of the third embodiment of the invention, the target data source to be searched can be accurately determined from the plurality of data sources, so that the extraction efficiency of the required text can be effectively improved.

The text mining device provided by the embodiment of the invention can execute the text mining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the electronic device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.

Memory 420 serves as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text-mining methods of embodiments of the present invention. The processor 410 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 420, so as to implement the text mining method provided by the embodiment of the present invention.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to an electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, and may include a keyboard, a mouse, and the like. The output device 440 may include a display device such as a display screen.

EXAMPLE five

The present embodiments provide a storage medium containing computer-executable instructions that, when executed by a computer processor, are used to implement a text mining method provided by embodiments of the present invention.

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the text mining method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of text mining, the method comprising:

2. The method of claim 1, wherein determining the target resource locator character from the candidate resource locator characters comprises:

3. The method of claim 1, wherein obtaining information of a target text in a target data source according to the target resource locator character comprises:

4. The method of claim 3, wherein extracting information from the initial text to obtain information about a target text comprises:

5. The method of claim 1, wherein checking the information of the target text and determining the text to be mined from the target text according to the checking result comprises:

6. A text mining apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein the character determination module is specifically configured to:

8. The apparatus of claim 6, wherein the information obtaining module is specifically configured to:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the text mining method of any of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text mining method as claimed in any one of claims 1 to 5.