CN110110295B - Large sample research and report information extraction method, device, equipment and storage medium - Google Patents

Large sample research and report information extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN110110295B
CN110110295B CN201910271619.9A CN201910271619A CN110110295B CN 110110295 B CN110110295 B CN 110110295B CN 201910271619 A CN201910271619 A CN 201910271619A CN 110110295 B CN110110295 B CN 110110295B
Authority
CN
China
Prior art keywords
text
report information
word
table data
research
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910271619.9A
Other languages
Chinese (zh)
Other versions
CN110110295A (en
Inventor
李海疆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910271619.9A priority Critical patent/CN110110295B/en
Publication of CN110110295A publication Critical patent/CN110110295A/en
Priority to PCT/CN2019/103230 priority patent/WO2020199482A1/en
Application granted granted Critical
Publication of CN110110295B publication Critical patent/CN110110295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a large sample research information extraction method, a device, computer equipment and a readable storage medium, wherein the method comprises the following steps: performing text conversion on the report information to obtain table data, wherein the table data is stored in a plain text form; counting the table data, and outputting word frequency of each word in the table data; counting the research report information to obtain a similarity index between each research report information and the rest of the research report information, and sequencing; and drawing a relation network of the report information by taking the numerical number of the report information as a node and the obtained similarity index as a branch. The beneficial effects of the invention are as follows: the method obtains the message meaning similarity index of the research report based on the Zipf law and the two-eight law, and draws a relation network of the research report information, and most important research reports with relatively close message meaning can be screened out according to the relation network, so that the research report information with more value can be screened out more efficiently.

Description

Large sample research and report information extraction method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of financial data processing, in particular to a method and a device for extracting large sample research information and a readable access medium.
Background
The research report information is simply called research report, refers to the analysis of the operating conditions and the profit situation of some marketing companies based on independent objective standpoints in the financial industry, and has important significance for fund managers.
For most fund managers of buyers, facing a huge number of research reports each day, it is obviously impossible to want to read most of the research reports in a limited time. Even though the field is familiar to the fund manager in the industry, the personal experience and the industry knowledge can not fully reflect all important information or central problems in a large amount of the research reports, more precisely, the personal experience and the industry knowledge have hysteresis and the field is not familiar to the person, so that how to help the fund manager to screen the research reports in as little time as possible and obtain fully useful information is an important and practical problem.
Disclosure of Invention
In order to overcome the problems in the related art, the invention provides a method, a device, computer equipment and a readable access medium for extracting large sample report information, so as to screen mass report information through a visible relation network and combining keywords to screen more valuable report information more effectively.
In a first aspect, an embodiment of the present invention provides a method for extracting large sample report information, where the method includes:
performing text conversion on the report information to obtain table data, wherein the table data is stored in a plain text form;
counting the table data, and outputting word frequency of each word in the table data;
counting the research report information to obtain a similarity index between each research report information and the rest of the research report information, and sequencing;
and drawing a relation network of the report information by taking the numerical number of the report information as a node and the obtained similarity index as a branch.
In combination with another aspect, in another possible embodiment of the present invention, the counting the table data includes:
performing word segmentation on the input text form table data to obtain word segmentation results;
outputting word frequency of each word in the table data, including:
takes the word segmentation result list of the text as { X ] 1 ,X 2 ,...,X N The corresponding word frequency list is { Y }, which is 1 ,Y 2 ,...,Y N },Y i For word X i The number of occurrences in the text; recording deviceThe word frequency percentage list corresponding to the word segmentation is { Z ] 1 ,Z 2 ,...,Z N Z is }, where i =Y i /Y all (unit: 0.1%), Z i For word X i The duty cycle of the frequency of occurrence in the text.
In combination with another aspect, another possible implementation manner of the present invention at least includes text 1 and text 2, the step of obtaining and sorting the similarity index between each report information and the rest of report information includes the steps of:
take word segmentation result list of text 1 asThe word segmentation result list of the text 2 isRespectively to A 1 And A 2 Arranging the word frequency percentages from large to small according to the respective corresponding word frequency percentages, and dividing the arranged result into A' 1 And A' 2 ,/> The corresponding word frequency percentage list is +.>And->
A screening mechanism is introduced:
recording deviceWherein i is 1 <N 1 And satisfy->
Calculating the meaning similarity degree of the text 1 and the text 2:
note m= (0.8A' 1 )∩(0.8A′ 2 ) Set MThe element number is m, and the word frequency percentage list corresponding to the m words in the text 1 and the text 2 is respectivelyAnd->Will->And->Regarded as two vectors, note->Due to->And->The respective components satisfy the regularities, so +.>The value range of (2) is +.>The value range of omega is +.>And the larger ω, the closer the two texts are;
note u= (0.8A' 1 )U(0.8A′ 2 ) The number of elements of the set U is recorded as U, and definition is givenρ=a ω The index ρ is a representation value of the meaning similarity of the two texts.
In combination with another aspect, in another possible implementation manner of the present invention, the step of obtaining and sorting the similarity index between each report information and the rest of report information includes the steps of:
the sorting process comprises the following steps:
and counting index sums of similar text and meaning programs of each report information and the rest report information, and sequencing the index sums.
In combination with another aspect, in another possible implementation manner of the present invention, the drawing the relationship network of the report information with the number of the report information as a node and the obtained proximity index as a branch includes:
and acquiring the digital number of the report information as a node of a relational network, wherein branches between two nodes are similar degree indexes of text and meaning, and the length and the sign of each branch are similar to the size of a program index.
In a second aspect, the present invention also provides a large sample research information extraction device, which includes:
the conversion module is used for carrying out text conversion on the research report information to obtain form data, and the form data is stored in a plain text form;
the word segmentation module is used for counting the table data and outputting word frequency of each word in the table data;
the statistics module is used for carrying out statistics on the research report information to obtain and sort the similarity index between each research report information and the rest of the research report information;
and the drawing module is used for drawing the relation network of the report information by taking the numerical number of the report information as a node and taking the obtained similarity index as a branch.
In a third aspect, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the computer program.
In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.
According to the invention, the index of the meaning similarity of the messages of the research report is obtained based on the Zipf law and the two-eight law, and the relation network of the research report information is drawn, so that most important research reports with the meaning similar to the meaning of the messages can be screened out according to the relation network, more valuable research report information can be screened out more efficiently, and the focus of the capital market attention problem in a transaction time interval can be obtained through the key words and the node density of the relation network.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a basic flow diagram illustrating a large sample research information extraction method according to an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating the length of branches in a relational network, according to an example embodiment.
FIG. 3 is a schematic diagram of a relationship network shown according to an exemplary embodiment.
Fig. 4 is a schematic block diagram of a large sample research information extraction device, according to an example embodiment.
Fig. 5 is a block diagram of a computer device illustrating an implementation method according to an example embodiment.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although steps are described in a flowchart as a sequential process, many of the steps can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the steps may be rearranged, the process may be terminated when its operations are completed, but there may be other steps not included in the drawings. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The invention relates to a large sample research and report information extraction method, a device, computer equipment and a readable storage medium, which are mainly applied to a scene of technical processing of specific requirements on research and report information in the financial industry, and the basic idea is as follows: based on Zipf law, the frequency of the occurrence of a word is inversely proportional to the ranking of the word in a frequency table, and two-eight law, the research report corresponding to the information reflected by the word with the word frequency of 20% higher than the first word frequency in the research report information occupies the important 80% of the research report information, on the basis, the index of the similarity of the meaning of the research report is obtained by statistics, and a relation network of the research report information is drawn, most of the research report with important meaning and the meaning of the text is relatively close can be screened according to the relation network, the research report information with more value can be screened more conveniently from the relation network, and meanwhile, the focus of the problem of attention of capital market can be obtained according to the key words in the relation network.
The present embodiment is applicable to a case of extracting large sample research information in an intelligent terminal with a central processing module, and the method may be executed by the central processing module, where the central processing module may be implemented by software and/or hardware, and may be generally integrated in the intelligent terminal, as shown in fig. 1, and is a basic flow diagram of the large sample research information extraction method of the present invention, and the method specifically includes the following steps:
in step 110, text conversion is performed on the report information, and table data is stored in a plain text form after conversion;
the research report information is generally in a PDF format, but text information in the PDF format cannot be directly processed, at this time, the research report information needs to be converted, and by means of existing software such as smallpdf, the research report information in the PDF format can be converted into a word format, and then the word file is saved as txt format and only text is reserved.
Word segmentation is carried out on text in txt format, word segmentation package can be adopted in the process, the output result is a file in CSV format, the CSV format is used for storing table data such as text and numbers in plain text form, and the CSV format is composed of character sequences instead of binary data which needs to be interpreted.
In step 120, statistics is performed on the table data, and word frequencies of words in the table data are output;
the CSV format text is counted to obtain a statistical result, wherein the statistical result is a list which comprises all words of the information and corresponding occurrence times, and then the word frequency result is converted into a percentage form, and the statistical result can be obtained through the following formula I:
equation 1:
assume that the word segmentation result list of the text is { X } 1 ,X 2 ,...,X N The corresponding word frequency list is { Y }, which is 1 ,Y 2 ,...,Y N },Y i For word X i The number of occurrences in the text; recording deviceThe word frequency percentage list corresponding to the word segmentation is { Z ] 1 ,Z 2 ,...,Z N Z is }, where i =Y i /Y all (unit: 0.1%), Z i For word X i The ratio of the frequency of occurrence in the text, i.e. the word frequency.
In step 130, statistics is performed on the report information to obtain a similarity index between each report information and the rest of report information, and the report information is ranked;
in this step, the meaning similarity index refers to the meaning similarity between different information, which can reflect the meaning similarity between different information, and it can be obtained by the following formula:
formula II:
in a possible implementation manner of the exemplary embodiment of the present invention, at least text 1 and text 2 are included, as shown in fig. 2, text 3 may also be included, and the word segmentation result list of text 1 is taken as a listThe word segmentation result list of text 2 is +.>Respectively to A 1 And A 2 Arranging the word frequency percentages from large to small according to the respective corresponding word frequency percentages, and dividing the arranged result into A' 1 And A' 2 ,/>The corresponding word frequency percentage list is +.>And->
A screening mechanism is introduced:
recording deviceWherein i is 1 <N 1 And satisfy->
Calculating the meaning similarity degree of the text 1 and the text 2:
note m= (0.8A' 1 )∩(0.8A′ 2 ) The element number of the set M is M, and the word frequency percentage list corresponding to the M words in the text 1 and the text 2 is respectivelyAnd->Will->And->Regarded as two vectors, note->Due to->And->The respective components satisfy the regularities, so +.>The value range of (2) is +.>The value range of omega is +.>And the larger ω, the closer the two texts are.
Note u= (0.8A' 1 )∪(0.8A′ 2 ) The number of elements of the set U is recorded as U, and definition is givenρ=a ω The index ρ is a representation value of the meaning similarity of the two texts.
In step 140, the number of the report information is used as a node, and the obtained similarity index is used as a branch to draw a relation network of the report information.
The method comprises the steps of drawing and researching, wherein a relation network with researching reports is drawn, data numbers are firstly required to be set for the researching reports, the data numbers correspond to the researching reports one by one and are independent and unique, the digital numbers of the researching reports are used as nodes of the relation network, branches between two nodes of the relation network are the index of the degree of similarity of meanings, the lengths of the branches are represented by the inverse of the index, the closer the meaning of the letters is, the shorter the branches are, and meanwhile, the larger the similarity of the two researching reports is.
Referring to fig. 2, the text 1, the text 2 and the text 3 are included, the length of the branch between the text 1 and the text 2 is denoted as branch 1, the length of the branch between the text 1 and the text 3 is denoted as branch 2, and the length of the branch 2 is greater than that of the branch 1, so that the text meaning between the text 1 and the text 2 in fig. 2 is more similar to the text meaning between the text 1 and the text 3.
In connection with fig. 3, in order to construct the relationship network after the visualization is completed, the nodes with higher branch density can be seen from the visual view of the relationship network, and the research report corresponding to the nodes with higher branch density can be focused and researched, so that the research efficiency is greatly improved.
According to the method, on the basis of the Zipf law and the two-eight law, the steps of text conversion, word segmentation processing, word frequency statistics, meaning similarity calculation, relation network drawing and the like are respectively carried out, so that the research relation network capable of showing the word frequency importance and meaning similarity indexes is finally obtained, most of important research reports with meaning close can be screened according to the research relation network, and the reading efficiency is greatly improved.
Fig. 4 is a schematic structural diagram of a large sample research information extraction device according to an embodiment of the present invention, where the device may be implemented by software and/or hardware, and is generally integrated in an intelligent terminal, and may be implemented by a large sample research information extraction method. As shown in the figure, the present embodiment may be based on the above embodiment, and provides a large sample research information extraction device, which mainly includes a conversion module 410, a word segmentation module 420, a statistics module 430, and a drawing module 440.
The conversion module 410 is configured to perform text conversion on the report information to obtain table data, where the table data is stored in a plain text form;
the word segmentation module 420 is configured to count the table data and output word frequencies of words in the table data;
the statistics module 430 is configured to perform statistics on the report information, obtain a similarity index between each report information and the rest of report information, and order the report information;
the drawing module 440 is configured to draw the relationship network of the report information by using the number of the report information as a node and the obtained proximity index as a branch.
In an implementation manner of the exemplary embodiment of the present invention, the word segmentation module is further configured to:
performing word segmentation on the input text form table data to obtain word segmentation results;
outputting word frequency of each word in the table data, including:
takes the word segmentation result list of the text as { X ] 1 ,X 2 ,...,X N The corresponding word frequency list is { Y }, which is 1 ,Y 2 ,...,Y N },Y i For word X i The number of occurrences in the text; recording deviceThe word frequency percentage list corresponding to the word segmentation is { Z ] 1 ,Z 2 ,...,Z N Z is }, where i =Y i /Y all (unit: 0.1%), Z i For word X i The duty cycle of the frequency of occurrence in the text.
In one implementation of the exemplary embodiment of the present invention, at least including text 1 and text 2, the statistics module includes a first statistics sub-module for executing the following formula:
take word segmentation result list of text 1 asThe word segmentation result list of the text 2 isRespectively to A 1 And A 2 Arranging the word frequency percentages from large to small according to the respective corresponding word frequency percentages, and dividing the arranged result into A' 1 And A' 2 ,/> The corresponding word frequency percentage list is +.>And->
A screening mechanism is introduced:
recording deviceWherein i is 1 <N 1 And satisfy->
Calculating the meaning similarity degree of the text 1 and the text 2:
note m= (0.8A' 1 )∩(0.8A′ 2 ) The element number of the set M is M, and the word frequency percentage list corresponding to the M words in the text 1 and the text 2 is respectivelyAnd->Will->And->Regarded as two vectors, note->Due to->And->The respective components satisfy the regularities, so +.>The value range of (2) is +.>The value range of omega is +.>And the larger ω, the closer the two texts are.
Note u= (0.8A' 1 )∪(0.8A′ 2 ) The number of elements of the set U is recorded as U, and definition is givenρ=a ω The index ρ is a representation value of the meaning similarity of the two texts.
The large sample grinding and reporting information extraction device provided in the above embodiment can execute the large sample grinding and reporting information extraction method provided in any embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method, and technical details not described in detail in the above embodiment can be referred to the large sample grinding and reporting information extraction method provided in any embodiment of the present invention.
It will be appreciated that the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the method according to the invention. It will also be noted that such a program may have many different architecture designs. For example, program code implementing the functionality of a method or system according to the invention may be subdivided into one or more subroutines.
Many different ways to distribute functionality among these subroutines will be apparent to the skilled person. The subroutines may be stored together in one executable file to form a self-contained program. Such executable files may include computer executable instructions, such as processor instructions and/or interpreter instructions (e.g., java interpreter instructions). Alternatively, one or more or all of the subroutines may be stored in at least one external library file and linked with the main program either statically or dynamically (e.g., at run-time). The main program contains at least one call to at least one of the subroutines. Subroutines may also include function calls to each other. Embodiments related to computer program products include computer-executable instructions for each of the processing steps of at least one of the illustrated methods. The instructions may be subdivided into subroutines and/or stored in one or more files that may be statically or dynamically linked.
Another embodiment related to a computer program product includes computer-executable instructions corresponding to each of the devices of at least one of the systems and/or products set forth. The instructions may be subdivided into subroutines and/or stored in one or more files that may be statically or dynamically linked.
The carrier of the computer program may be any entity or device capable of carrying the program. For example, the carrier may comprise a storage medium such as a (ROM, e.g. CDROM or semiconductor ROM) or a magnetic recording medium (e.g. floppy disk or hard disk). Further, the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such cable or device. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or for use in the performance of, the relevant method.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, one or more of the functions described above may be optional or may be combined, if desired.
The steps discussed above are not limited to the order of execution in the embodiments, and different steps may be performed in different orders and/or concurrently with each other, if desired. Moreover, in other embodiments, one or more of the steps described above may be optional or may be combined.
Although various aspects of the invention are presented in the independent claims, other aspects of the invention comprise combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set forth in the claims.
It is noted herein that while the above describes example embodiments of the invention, these descriptions should not be viewed in a limiting sense. Rather, several variations and modifications may be made without departing from the scope of the invention as defined in the appended claims.
It should be understood by those skilled in the art that each module in the apparatus of the present embodiment may be implemented by a general-purpose computing device, and each module may be centralized in a single computing device or a network group formed by computing devices, where the apparatus of the present embodiment corresponds to the method in the foregoing embodiment, and may be implemented by executable program code, or may be implemented by a combination of integrated circuits, and thus, the present invention is not limited to specific hardware or software and combinations thereof.
It should be understood by those skilled in the art that each module in the apparatus of the embodiment of the present invention may be implemented by a general-purpose mobile terminal, and each module may be centralized in a single mobile terminal or a combination of devices formed by mobile terminals, where the apparatus of the embodiment of the present invention corresponds to the method in the foregoing embodiment, and may be implemented by editing executable program code, or may be implemented by a combination of integrated circuits, and thus the present invention is not limited to specific hardware or software and combinations thereof.
The present embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by a plurality of servers) that can execute a program. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in fig. 5. It should be noted that fig. 5 only shows a computer device 20 having components 21-22, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead.
In the present embodiment, the memory 21 (i.e., readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a flash Card (FlashCard) or the like, which are provided on the computer device 20. Of course, the memory 21 may also include both internal storage units of the computer device 20 and external storage devices. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed on the computer device 20, such as program codes of RNNs neural networks of embodiment one. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.
The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to execute the program code or process data stored in the memory 21, for example, to implement each layer structure of the deep learning model, so as to implement the large sample research information extraction method of the foregoing embodiment.
The present embodiment also provides a computer-readable storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs the corresponding functions. The computer readable storage medium of the present embodiment is used for storing a financial applet, and when executed by a processor, implements the large sample research information extraction method of the above embodiment.
Another embodiment related to a computer program product includes computer-executable instructions corresponding to each of the devices of at least one of the systems and/or products set forth. The instructions may be subdivided into subroutines and/or stored in one or more files that may be statically or dynamically linked.
The carrier of the computer program may be any entity or device capable of carrying the program. For example, the carrier may comprise a storage medium such as a (ROM, e.g. CDROM or semiconductor ROM) or a magnetic recording medium (e.g. floppy disk or hard disk). Further, the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such cable or device. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or for use in the performance of, the relevant method.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, one or more of the functions described above may be optional or may be combined, if desired.
The steps discussed above are not limited to the order of execution in the embodiments, and different steps may be performed in different orders and/or concurrently with each other, if desired. Moreover, in other embodiments, one or more of the steps described above may be optional or may be combined.
Although various aspects of the invention are presented in the independent claims, other aspects of the invention comprise combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set forth in the claims.
It is noted herein that while the above describes example embodiments of the invention, these descriptions should not be viewed in a limiting sense. Rather, several variations and modifications may be made without departing from the scope of the invention as defined in the appended claims.
It should be understood by those skilled in the art that each module in the apparatus of the present embodiment may be implemented by a general-purpose computing device, and each module may be centralized in a single computing device or a network group formed by computing devices, where the apparatus of the present embodiment corresponds to the method in the foregoing embodiment, and may be implemented by executable program code, or may be implemented by a combination of integrated circuits, and thus, the present invention is not limited to specific hardware or software and combinations thereof.
It should be understood by those skilled in the art that each module in the apparatus of the embodiment of the present invention may be implemented by a general-purpose mobile terminal, and each module may be centralized in a single mobile terminal or a combination of devices formed by mobile terminals, where the apparatus of the embodiment of the present invention corresponds to the method in the foregoing embodiment, and may be implemented by editing executable program code, or may be implemented by a combination of integrated circuits, and thus the present invention is not limited to specific hardware or software and combinations thereof.
Note that the above is only exemplary embodiments of the present invention and the technical principles applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (6)

1. A method for extracting large sample research information, the method comprising:
performing text conversion on the report information to obtain table data, wherein the table data is stored in a plain text form;
counting the table data, and outputting word frequency of each word in the table data;
counting the research report information to obtain a similarity index between each research report information and the rest of the research report information, and sequencing;
drawing a relation network of the report information by taking the numerical number of the report information as a node and the obtained similarity index as a branch;
the counting the table data comprises the following steps:
performing word segmentation on the input text form table data to obtain word segmentation results;
outputting word frequency of each word in the table data, including:
the word segmentation result list of the text is { X } 1 ,X 2 ,…,X N The corresponding word frequency list is { Y }, which is 1 ,Y 2 ,…,Y N },Y i For word X i Appear in textIs a number of times (1); recording deviceThe word frequency percentage list corresponding to the word segmentation is { Z ] 1 ,Z 2 ,…,Z N Z is }, where i =Y i /Y all Units: 0.1%, Z i For word X i The ratio of the frequencies that appear in the text;
the text at least comprises a text 1 and a text 2, and the steps of obtaining and sequencing the similarity index between each piece of the information and the rest pieces of the information comprise the following statistics steps:
the word segmentation result list of the text 1 isThe word segmentation result list of the text 2 isRespectively to A 1 And A 2 Arranging the word frequency percentages from large to small according to the respective corresponding word frequency percentages, and dividing the arranged result into A' 1 And A' 2 ,/> The corresponding word frequency percentage list is +.>And->
A screening mechanism is introduced:
recording deviceWherein i is 1 <N 1 And satisfy->
Calculating the meaning similarity degree of the text 1 and the text 2: z is Z i =Y i /Y all Units: 0.1 percent,
note m= (0.8A 1 )∩(0.8A 2 ) The element number of the set M is M, and the word frequency percentage list corresponding to the M words in the text 1 and the text 2 is respectivelyAnd->Will->And->Regarded as two vectors, notedDue to->And->The respective components satisfy the regularities, so +.>The value range of (2) is +.>The value range of omega is +.>And the larger ω, the closer the two texts are;
note u= (0.8A 1 )∪(0.8A 2 ) The number of elements of the set U is recorded as U, and definition is givenρ=a ω The index ρ is a representation value of the meaning similarity of the two texts.
2. The method of claim 1, wherein the step of deriving and ranking the similarity index between each report information and the rest of the report information comprises the steps of:
the sorting process comprises the following steps:
and counting index sums of similar text and meaning programs of each report information and the rest report information, and sequencing the index sums.
3. The method according to claim 1, wherein the drawing the relationship network of the report information with the number of the report information as a node and the obtained proximity index as a branch includes:
and acquiring the numerical number of the report information as a node of a relation network, wherein branches between two nodes are similar degree indexes of text and meaning, and the length and the sign of each branch are similar to the size of a program index.
4. A large sample research information extraction device, the device comprising:
the conversion module is used for carrying out text conversion on the research report information to obtain form data, and the form data is stored in a plain text form;
the word segmentation module is used for counting the table data and outputting word frequency of each word in the table data;
the statistics module is used for carrying out statistics on the research report information to obtain and sort the similarity index between each research report information and the rest of the research report information;
the drawing module is used for drawing a relation network of the report information by taking the digital number of the report information as a node and taking the obtained similarity index as a branch;
the word segmentation module is further configured to:
performing word segmentation on the input text form table data to obtain word segmentation results;
outputting word frequency of each word in the table data, including:
takes the word segmentation result list of the text as { X ] 1 ,X 2 ,…,X N The corresponding word frequency list is { Y }, which is 1 ,Y 2 ,…,Y N },Y i For word X i The number of occurrences in the text; recording deviceThe word frequency percentage list corresponding to the word segmentation is { Z ] 1 ,Z 2 ,…,Z N Z is }, where i =Y i /Y all Units: 0.1%, Z i For word X i The ratio of the frequencies that appear in the text;
the text at least comprises text 1 and text 2, and the statistics module comprises a first statistics sub-module for executing the following formula:
take word segmentation result list of text 1 asThe word segmentation result list of the text 2 isRespectively to A 1 And A 2 Arranging the word frequency percentages from large to small according to the respective corresponding word frequency percentages, and dividing the arranged result into A 1 And A 2 ,/>
The corresponding word frequency percentage list is +.>And->
A screening mechanism is introduced:
recording deviceWherein i is 1 <N 1 And satisfy->
Calculating the meaning similarity degree of the text 1 and the text 2:
note m= (0.8A' 1 )∩(0.8A′ 2 ) The element number of the set M is M, and the word frequency percentage list corresponding to the M words in the text 1 and the text 2 is respectivelyAnd->Will->And->Regarded as two vectors, note->Due to->And->The respective components satisfy the regularities, so +.>The value range of (2) is +.>The value range of omega is +.>And the larger ω, the closer the two texts are;
note u= (0.8A' 1 )∪(0.8A′ 2 ) The number of elements of the set U is recorded as U, and definition is givenρ=a ω The index ρ is a representation value of the meaning similarity of the two texts.
5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 3 when the computer program is executed by the processor.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 3.
CN201910271619.9A 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium Active CN110110295B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910271619.9A CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium
PCT/CN2019/103230 WO2020199482A1 (en) 2019-04-04 2019-08-29 Large sample research report information extraction method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910271619.9A CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110110295A CN110110295A (en) 2019-08-09
CN110110295B true CN110110295B (en) 2023-10-20

Family

ID=67485207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910271619.9A Active CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110110295B (en)
WO (1) WO2020199482A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110295B (en) * 2019-04-04 2023-10-20 平安科技(深圳)有限公司 Large sample research and report information extraction method, device, equipment and storage medium
CN111694928A (en) * 2020-05-28 2020-09-22 平安资产管理有限责任公司 Data index recommendation method and device, computer equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108647822A (en) * 2018-05-10 2018-10-12 平安科技(深圳)有限公司 Electronic device, based on the prediction technique and computer storage media for grinding count off evidence
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109284504A (en) * 2018-10-22 2019-01-29 平安科技(深圳)有限公司 It grinds to call the score using the security of deep learning model and analyses method and device
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN109460550A (en) * 2018-10-22 2019-03-12 平安科技(深圳)有限公司 Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2824586A1 (en) * 2013-07-09 2015-01-14 Universiteit Twente Method and computer server system for receiving and presenting information to a user in a computer network
US20170300564A1 (en) * 2016-04-19 2017-10-19 Sprinklr, Inc. Clustering for social media data
CN106446148B (en) * 2016-09-21 2019-08-09 中国运载火箭技术研究院 A kind of text duplicate checking method based on cluster
CN109325035A (en) * 2018-11-29 2019-02-12 阿里巴巴集团控股有限公司 The recognition methods of similar table and device
CN110110295B (en) * 2019-04-04 2023-10-20 平安科技(深圳)有限公司 Large sample research and report information extraction method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108647822A (en) * 2018-05-10 2018-10-12 平安科技(深圳)有限公司 Electronic device, based on the prediction technique and computer storage media for grinding count off evidence
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109284504A (en) * 2018-10-22 2019-01-29 平安科技(深圳)有限公司 It grinds to call the score using the security of deep learning model and analyses method and device
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN109460550A (en) * 2018-10-22 2019-03-12 平安科技(深圳)有限公司 Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data

Also Published As

Publication number Publication date
CN110110295A (en) 2019-08-09
WO2020199482A1 (en) 2020-10-08

Similar Documents

Publication Publication Date Title
CN109492772B (en) Method and device for generating information
US20180260484A1 (en) Method, Apparatus, and Device for Generating Hot News
CN112507116B (en) Customer portrait method based on customer response corpus and related equipment thereof
CN112541745A (en) User behavior data analysis method and device, electronic equipment and readable storage medium
CN111666304B (en) Data processing device, data processing method, storage medium, and electronic apparatus
CN110765101B (en) Label generation method and device, computer readable storage medium and server
CN109274843B (en) Key prediction method, device and computer readable storage medium
CN112860662B (en) Automatic production data blood relationship establishment method, device, computer equipment and storage medium
CN110110295B (en) Large sample research and report information extraction method, device, equipment and storage medium
CN112434884A (en) Method and device for establishing supplier classified portrait
CN111104590A (en) Information recommendation method, device, medium and electronic equipment
CN110807050B (en) Performance analysis method, device, computer equipment and storage medium
CN116450723A (en) Data extraction method, device, computer equipment and storage medium
CN112256566B (en) Fresh-keeping method and device for test cases
CN113656649A (en) Generation and storage algorithm and system for label portrait data
CN107368597B (en) Information output method and device
CN111611056A (en) Data processing method and device, computer equipment and storage medium
CN111209747A (en) Word vector file loading method and device, storage medium and electronic equipment
CN104750823A (en) Popularization condition data search method and device
CN110851346A (en) Method, device and equipment for detecting boundary problem of query statement and storage medium
CN113255710B (en) Method, device, equipment and storage medium for classifying mobile phone numbers
CN114861650B (en) Noise data cleaning method and device, storage medium and electronic equipment
CN110992067B (en) Message pushing method, device, computer equipment and storage medium
CN115953228A (en) Bank guest group generation method and system and computer equipment
CN116402644A (en) Legal supervision method and system based on big data multi-source data fusion analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant