CN110110295A - Large sample grinds report information extracting method, device, equipment and storage medium - Google Patents

Large sample grinds report information extracting method, device, equipment and storage medium Download PDF

Info

Publication number
CN110110295A
CN110110295A CN201910271619.9A CN201910271619A CN110110295A CN 110110295 A CN110110295 A CN 110110295A CN 201910271619 A CN201910271619 A CN 201910271619A CN 110110295 A CN110110295 A CN 110110295A
Authority
CN
China
Prior art keywords
text
breath
notifying
word
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910271619.9A
Other languages
Chinese (zh)
Other versions
CN110110295B (en
Inventor
李海疆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910271619.9A priority Critical patent/CN110110295B/en
Publication of CN110110295A publication Critical patent/CN110110295A/en
Priority to PCT/CN2019/103230 priority patent/WO2020199482A1/en
Application granted granted Critical
Publication of CN110110295B publication Critical patent/CN110110295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The present invention relates to a kind of large samples to grind report information extracting method, device, computer equipment and readable storage medium storing program for executing, therein the described method includes: obtaining list data to breath progress text conversion of notifying is ground, the list data is stored with plain text;The list data is counted, the word frequency of each word in the list data is exported;The breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind the close level index between breath of notifying and sort;Using the digital number for grinding breath of notifying as node, using the close level index obtained as the network of personal connections for grinding breath of notifying described in branch drafting.The invention has the benefit that by showing that anticipate close level index and drawing out of the text for grinding report grinds the network of personal connections of breath of notifying based on Zipf law and sixteen rules, grind report according to what the network of personal connections can filter out that most of more important and Wen Yi is closer to, make it possible to more efficient filter out more be worth grind breath of notifying.

Description

Large sample grinds report information extracting method, device, equipment and storage medium
Technical field
The present embodiments relate to finance data processing technology fields more particularly to a kind of large sample to grind report information extraction side Method method, apparatus and can read access medium.
Background technique
Research report information referred to as grinds report, and management state and profit in some listed companies are referred in financial industry The analysis that situation is made based on independent objective position, has more important meaning for fund manager.
For most fund managers of buyer, report can be ground in face of magnanimity daily, want to read in the limited time Read it is most grind report it is clearly impossible.At present in industry for fund manager, even if known field, by a People's experience and industry understanding selectively read it is therein grind report, also can not fully reflect magnanimity grind report in own Keynote message or central issue, let alone personal experience and industry understand the hysteresis quality having in itself and personally for its Still there is unfamiliar field, therefore, how to help fund manager to screen in the time as few as possible and grind report, obtains abundant Useful information, being one has the problem of important realistic meaning.
Summary of the invention
In order to overcome the problems, such as present in the relevant technologies, the present invention provides a kind of large sample and grinds report information extracting method side Method, device, computer equipment and can read access medium, magnanimity ground by visual network of personal connections combination keyword notify to realize Breath is screened efficiently to filter out and more valuable grind breath of notifying.
In a first aspect, the embodiment of the invention provides a kind of large samples to grind report information extracting method, which comprises
List data is obtained to breath progress text conversion of notifying is ground, the list data is stored with plain text;
The list data is counted, the word frequency of each word in the list data is exported;
The breath of notifying that grinds is counted, show that each grind notifies breath and remaining close degree ground between breath of notifying refers to It marks and sorts;
Using the digital number for grinding breath of notifying as node, ground using the close level index obtained as described in branch drafting It notifies the network of personal connections of breath.
It is described that the list data is counted in another feasible embodiment of the invention in conjunction with another aspect, packet It includes:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN, Yi For word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1, Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
In conjunction on the other hand, in another feasible embodiment of the present invention, text 1 and text 2 are included at least, it is described to obtain It is each grind notify breath and remaining grind the close level index between breath of notifying and sort, including statistic procedure:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement As a result it is divided into A '1With A '2, Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2 The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two Text is closer;
Remember U=(0.8A '1)U(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is It is that the texts of two texts is anticipated the characterization value of close degree.
It is described to show that each breath of notifying that grinds is ground with remaining in another feasible embodiment of the invention in conjunction with another aspect It notifies and the close level index between breath and sorts, including sequence step:
Include: when sequence
Count it is each grind breath of notifying ground with remaining breath of notifying Wen Yixiang short-range order index and, to the index and progress Sequence.
It is described with the digital number for grinding breath of notifying in another feasible embodiment of the invention in conjunction with another aspect For node, using the close level index that obtains as branch draw described in grind the network of personal connections of breath of notifying, comprising:
Node of the digital number for breath of notifying as network of personal connections net is ground described in acquisition, the branch between two nodes is Wen Yixiang Nearly level index, the size of the close Program Index of length characterization text meaning of the branch.
Second aspect, the present invention also provides a kind of large samples to grind report information extracting device, and described device includes:
Conversion module, for grind notify breath carry out text conversion obtain list data, the list data is with plain text Form storage;
Word segmentation module exports the word frequency of each word in the list data for counting to the list data;
Statistical module, for being counted to the breath of notifying that grinds, obtain it is each grind breath of notifying ground with remaining notify breath it Between close level index and sort;
Drafting module, for using the digital number for grinding breath of notifying as node, with the close level index obtained For the network of personal connections for grinding breath of notifying described in branch drafting.
The third aspect the present invention also provides a kind of computer equipment, including memory, processor and is stored in storage On device and the computer program that can run on a processor, the processor realize above-mentioned side when executing the computer program Method.
Fourth aspect, the present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, institute State the step of above method is realized when computer program is executed by processor.
The present invention is ground by being obtained to grind the close level index of text meaning of report and draw out based on Zipf law and sixteen rules It notifies the network of personal connections of breath, grinds report according to what the network of personal connections can filter out that most of more important and Wen Yi is closer to, make Can more efficient filter out more be worth grind breath of notifying, can also by the node density of keyword and network of personal connections The problem of obtaining the capital market concern gone out embodied in the exchange hour section focus.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not It can the limitation present invention.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is the basic procedure signal that a kind of large sample shown according to an exemplary embodiment grinds report information extracting method Figure.
Fig. 2 is the schematic diagram of branch length in network of personal connections shown according to an exemplary embodiment.
Fig. 3 is the schematic diagram of network of personal connections shown according to an exemplary embodiment.
Fig. 4 is the schematic block diagram that a kind of large sample shown according to an exemplary embodiment grinds report information extracting device.
Fig. 5 is the block diagram of the computer equipment of implementation method shown according to an exemplary embodiment.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail The processing or method described as flow chart.It is therein to be permitted although each step to be described as to the processing of sequence in flow chart Multi-step can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of each step can be rearranged, when its operation The processing can be terminated when completion, it is also possible to have the other steps being not included in attached drawing.Processing can correspond to In method, function, regulation, subroutine, subprogram etc..
The present invention relates to a kind of large samples to grind report information extracting method, device, computer equipment and readable storage medium storing program for executing, Mainly apply in scene of the financial industry for the technical treatment for grinding breath progress particular demands of notifying, basic thought is: Based on Zipf law, number that word occurs and its ranking in frequency meter is inversely proportional and sixteen rules, to grinding report In information word frequency sort forward 20% the information that is reflected of word corresponding to grind retribution this occupy and grinds the important of breath of notifying 80%, statistics show that the text for grinding report is anticipated and close level index and draws out the network of personal connections for grinding breath of notifying on this basis, according to What the network of personal connections can filter out that most of more important and Wen Yi is closer to grinds report, can be more easily from network of personal connections The problem of filtering out the breath of notifying that grinds more being worth, while can also obtaining capital market concern according to the keyword in network of personal connections is burnt Point.
The present embodiment is applicable to grind report information extraction in the intelligent terminal with central processing module to carry out large sample In the case where, this method can be executed by central processing module, and wherein the central processing module can be by software and/or hardware It realizes, can generally be integrated in intelligent terminal, report the basic of information extracting method as shown in Figure 1, grinding for large sample of the present invention Flow diagram, the method specifically comprise the following steps:
In step 110, to breath progress text conversion of notifying is ground, list data is stored with plain text after conversion;
The breath of notifying that grinds is generally PDF format, and the textural information in PDF format generally can not be carried out directly Processing, at this time, it may be necessary to convert to breath of notifying is ground, by existing software such as smallpdf, can grind report for PDF format Information is converted into word format, and word document is then saved as txt format and only retains text.
Word segmentation processing is carried out to the text of txt format, participle packet can be used in this process and carry out word segmentation processing, output knot Fruit is the file of CSV format, and CSV format is to store list data such as text and number with plain text, which is word Accord with the data that sequence forms and this needs of nonbinary are interpreted.
In the step 120, the list data is counted, exports the word frequency of each word in the list data;
The CSV format text is counted to obtain statistical result, which is a list comprising grinds report All words of information and corresponding frequency of occurrence, are then converted into percents for word frequency result, can pass through following equation One obtains:
Formula 1:
Assuming that the word segmentation result list of text is { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN, YiFor word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1, Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text, i.e., singly Word word frequency.
In step 130, the breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind between breath of notifying Close level index and sort;
In this step for calculate text anticipate close level index the step of, anticipate close level index of text refers to different grinding report Text between information is anticipated close degree, be can reflect out the text that different grinding is notified between breath and is anticipated close degree, can lead to Following equation two is crossed to obtain:
Formula two:
In a kind of feasible embodiment of exemplary embodiment of the present, text 1 and text 2 are included at least, in conjunction with Fig. 2 institute Show, may also include text 3, is with the word segmentation result list of text 1The word segmentation result of text 2 arranges Table isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement Result be divided into A '1With A '2,It is right The word frequency percentage list answered is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2 The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two Text is closer.
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is It is that the texts of two texts is anticipated the characterization value of close degree.
In step 140, using the digital number for grinding breath of notifying as node, it is with the close level index obtained The network of personal connections for breath of notifying is ground described in branch drafting.
This step is to draw to grind step, and drafting has the network of personal connections for grinding report, it is necessary first to data be arranged for the breath of notifying that grinds Number, data number correspond in it is described grind notify breath and be it is independent unique, to grind the digital number for breath of notifying as relationship The node of net, the branch between two nodes of network of personal connections are exactly that text is anticipated close level index size, and the length of branch is with index value inverse Characterization, Wen Yiyue proximal branch is shorter, while also illustrating that two to grind manner of breathing short range degree of notifying bigger.
As shown in connection with fig. 2, including text 1, text 2 and text 3, the length of the branch between text 1 and text 2 are expressed as branch 1, the length of the branch between text 1 and text 3 is expressed as branch 2, and the length of branch 2 is greater than branch 1, then the text 1 and text in Fig. 2 Text Wen Yi between 2 is more close compared with the text Wen Yi between text 1 and text 3.
As shown in connection with fig. 3, the network of personal connections for the present invention after visualization is completed in modeling, can from the Visual Graph of network of personal connections Find out the higher node of density, the report that grinds corresponding to the higher node of these density can be paid close attention to and study carefully, study effect carefully Rate is substantially improved.
Method of the invention, based on the basis of Zipf law and sixteen rules, carrying out text conversion respectively, at participle Reason, word frequency statistics, text are anticipated, and close degree calculates, network of personal connections is drawn, finally obtain can embody word frequency importance and The close level index of text meaning grinds report network of personal connections, and most of more important and Wen Yi can be filtered out by grinding report network of personal connections according to this What is be closer to grinds report, greatly improves reading efficiency.
Fig. 4 is the structural schematic diagram that a kind of large sample provided in an embodiment of the present invention grinds report information extracting device, the device It can be implemented by software and/or hardware, be generally integrated in intelligent terminal, report information extracting method can be ground by large sample come real It is existing.As shown, the present embodiment can be provided a kind of large sample and be ground report information extracting device based on above-described embodiment, It mainly includes conversion module 410, word segmentation module 420, statistical module 430 and drafting module 440.
Conversion module 410 therein, for grind notify breath carry out text conversion obtain list data, the list data It is stored with plain text;
Word segmentation module 420 therein exports each word in the list data for counting to the list data Word frequency;
Statistical module 430 therein show that each breath of notifying that grinds is ground with remaining for counting to the breath of notifying that grinds It notifies and the close level index between breath and sorts;
Drafting module 440 therein is described close with what is obtained for using the digital number for grinding breath of notifying as node Level index is the network of personal connections that breath of notifying is ground described in branch is drawn.
In an embodiment of exemplary embodiment of the present, the word segmentation module is also used to:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN, Yi For word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1, Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
In an embodiment of exemplary embodiment of the present, text 1 and text 2, the statistical module are included at least Including the first statistic submodule, for executing following equation:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, the knot after arrangement Fruit is divided into A '1With A '2, It is corresponding Word frequency percentage list is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2 The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two Text is closer.
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is It is that the texts of two texts is anticipated the characterization value of close degree.
The large sample provided in above-described embodiment grinds in the executable present invention of report information extracting device institute in any embodiment The large sample of offer grinds report information extracting method, has and executes the corresponding functional module of this method and beneficial effect, not above-mentioned The technical detail being described in detail in embodiment, reference can be made to large sample provided in any embodiment of that present invention grinds report information extraction Method.
It will be appreciated that the present invention also extends to the computer program for being suitable for putting the invention into practice, especially Computer program on carrier or in carrier.Program can be with source code, object code, code intermediate source and such as part volume The form of the object code for the form translated, or it is suitble to the shape used in the realization of the method according to the invention with any other Formula.Also it will be noted that, such program may have many different frame designs.For example, realizing side according to the invention Functional program code of method or system may be subdivided into one or more subroutine.
For that will be apparent for technical personnel in the functional many different modes of these subroutine intermediate distributions. Subroutine can be collectively stored in an executable file, to form self-contained program.Such executable file can To include computer executable instructions, such as processor instruction and/or interpreter instruction (for example, Java interpreter instruction).It can Alternatively, one or more or all subroutines of subroutine may be stored at least one external library file, and And it statically or dynamically (such as at runtime between) is linked with main program.Main program contains at least one of subroutine At least one calling.Subroutine also may include to mutual function call.It is related to the embodiment packet of computer program product Include the computer executable instructions for corresponding at least one of illustrated method each step of the processing step of method.These refer to Subroutine can be subdivided into and/or be stored in one or more possible static or dynamic link file by enabling.
Another embodiment for being related to computer program product includes corresponding in illustrated system and/or product at least The computer executable instructions of each device in one device.These instructions can be subdivided into subroutine and/or be stored In one or more possible static or dynamic link file.
The carrier of computer program can be any entity or device that can deliver program.For example, carrier can wrap Containing storage medium, such as (ROM such as CDROM or semiconductor ROM) either magnetic recording media (such as floppy disk or hard disk).Into One step, carrier can be the carrier that can be transmitted, such as electricity perhaps optical signalling its can via cable or optical cable, or Person is transmitted by radio or other means.When program is embodied as such signal, carrier can be by such cable Or device composition.Alternatively, carrier can be the integrated circuit for being wherein embedded with program, and the integrated circuit is suitable for holding Row correlation technique, or used in execution for correlation technique.
Should be noted that embodiment mentioned above be illustrate the present invention, rather than limit the present invention, and this The technical staff in field will design many alternate embodiments, without departing from scope of the appended claims.It is weighing During benefit requires, the reference symbol of any placement between round parentheses is not to be read as being limitations on claims.Verb " packet Include " and its paradigmatic depositing using the element being not excluded for other than those of recording in the claims or step ?.The article " one " before element or "one" be not excluded for the presence of a plurality of such elements.The present invention can pass through Hardware including several visibly different components, and realized by properly programmed computer.Enumerating several devices In device claim, several in these devices can be embodied by the same item of hardware.In mutually different appurtenance Benefit states that the simple fact of certain measures does not indicate that the combination of these measures cannot be used to benefit in requiring.
If desired, different function discussed herein can be executed with different order and/or be executed simultaneously with one another. In addition, if one or more functions described above can be optional or can be combined if expectation.
If desired, each step is not limited to the sequence that executes in each embodiment, different step as discussed above It can be executed with different order and/or be executed simultaneously with one another.In addition, in other embodiments, described above one or more A step can be optional or can be combined.
Although various aspects of the invention provide in the independent claim, other aspects of the invention include coming from The combination of the dependent claims of the feature of described embodiment and/or the feature with independent claims, and not only It is the combination clearly provided in claim.
It is to be noted here that although these descriptions are not the foregoing describe example embodiment of the invention It should be understood in a limiting sense.It is wanted on the contrary, several change and modification can be carried out without departing from such as appended right The scope of the present invention defined in asking.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general meter Device is calculated to realize, each module can concentrate in the group of networks of single computing device or computing device composition, and the present invention is real The method that the device in example corresponds in previous embodiment is applied, can be realized, can also be led to by executable program code The mode of integrated circuit combination is crossed to realize, therefore the invention is not limited to specific hardware or software and its combinations.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general shifting Dynamic terminal realizes that each module can concentrate in the device combination of single mobile terminal or mobile terminal composition, the present invention Device in embodiment corresponds to the method in previous embodiment, can be realized by editing executable program code, It can be realized by way of integrated circuit combination, therefore the invention is not limited to specific hardware or softwares and its knot It closes.
The present embodiment also provides a kind of computer equipment, can such as execute the smart phone, tablet computer, notebook of program Computer, desktop computer, rack-mount server, blade server, tower server or Cabinet-type server are (including independent Server cluster composed by server or multiple servers) etc..The computer equipment 20 of the present embodiment includes at least but not It is limited to: memory 21, the processor 22 of connection can be in communication with each other by system bus, as shown in Figure 5.It is pointed out that Fig. 5 The computer equipment 20 with component 21-22 is illustrated only, it should be understood that being not required for implementing all groups shown Part, the implementation that can be substituted is more or less component.
In the present embodiment, memory 21 (i.e. readable storage medium storing program for executing) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic Disk, CD etc..In some embodiments, memory 21 can be the internal storage unit of computer equipment 20, such as the calculating The hard disk or memory of machine equipment 20.In further embodiments, memory 21 is also possible to the external storage of computer equipment 20 The plug-in type hard disk being equipped in equipment, such as the computer equipment 20, intelligent memory card (Smart Media Card, SMC), peace Digital (Secure Digital, SD) card, flash card (FlashCard) etc..Certainly, memory 21 can also both include calculating The internal storage unit of machine equipment 20 also includes its External memory equipment.In the present embodiment, memory 21 is commonly used in storage peace Operating system and types of applications software loaded on computer equipment 20, for example, embodiment one RNNs neural network program code Deng.In addition, memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.
Processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), Controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in control computer equipment 20 overall operation.In the present embodiment, program code or processing data of the processor 22 for being stored in run memory 21, Such as realize each layer structure of deep learning model, to realize that the large sample of above-described embodiment grinds report information extracting method.
The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc. Answer function.The computer readable storage medium of the present embodiment is realized above-mentioned for storing financial small routine when being executed by processor The large sample of embodiment grinds report information extracting method.
Another embodiment for being related to computer program product includes corresponding in illustrated system and/or product at least The computer executable instructions of each device in one device.These instructions can be subdivided into subroutine and/or be stored In one or more possible static or dynamic link file.
The carrier of computer program can be any entity or device that can deliver program.For example, carrier can wrap Containing storage medium, such as (ROM such as CDROM or semiconductor ROM) either magnetic recording media (such as floppy disk or hard disk).Into One step, carrier can be the carrier that can be transmitted, such as electricity perhaps optical signalling its can via cable or optical cable, or Person is transmitted by radio or other means.When program is embodied as such signal, carrier can be by such cable Or device composition.Alternatively, carrier can be the integrated circuit for being wherein embedded with program, and the integrated circuit is suitable for holding Row correlation technique, or used in execution for correlation technique.
Should be noted that embodiment mentioned above be illustrate the present invention, rather than limit the present invention, and this The technical staff in field will design many alternate embodiments, without departing from scope of the appended claims.It is weighing During benefit requires, the reference symbol of any placement between round parentheses is not to be read as being limitations on claims.Verb " packet Include " and its paradigmatic depositing using the element being not excluded for other than those of recording in the claims or step ?.The article " one " before element or "one" be not excluded for the presence of a plurality of such elements.The present invention can pass through Hardware including several visibly different components, and realized by properly programmed computer.Enumerating several devices In device claim, several in these devices can be embodied by the same item of hardware.In mutually different appurtenance Benefit states that the simple fact of certain measures does not indicate that the combination of these measures cannot be used to benefit in requiring.
If desired, different function discussed herein can be executed with different order and/or be executed simultaneously with one another. In addition, if one or more functions described above can be optional or can be combined if expectation.
If desired, each step is not limited to the sequence that executes in each embodiment, different step as discussed above It can be executed with different order and/or be executed simultaneously with one another.In addition, in other embodiments, described above one or more A step can be optional or can be combined.
Although various aspects of the invention provide in the independent claim, other aspects of the invention include coming from The combination of the dependent claims of the feature of described embodiment and/or the feature with independent claims, and not only It is the combination clearly provided in claim.
It is to be noted here that although these descriptions are not the foregoing describe example embodiment of the invention It should be understood in a limiting sense.It is wanted on the contrary, several change and modification can be carried out without departing from such as appended right The scope of the present invention defined in asking.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general meter Device is calculated to realize, each module can concentrate in the group of networks of single computing device or computing device composition, and the present invention is real The method that the device in example corresponds in previous embodiment is applied, can be realized, can also be led to by executable program code The mode of integrated circuit combination is crossed to realize, therefore the invention is not limited to specific hardware or software and its combinations.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general shifting Dynamic terminal realizes that each module can concentrate in the device combination of single mobile terminal or mobile terminal composition, the present invention Device in embodiment corresponds to the method in previous embodiment, can be realized by editing executable program code, It can be realized by way of integrated circuit combination, therefore the invention is not limited to specific hardware or softwares and its knot It closes.
Note that above are only exemplary embodiment of the present invention and institute's application technology principle.Those skilled in the art can manage Solution, the invention is not limited to the specific embodiments described herein, is able to carry out various apparent changes for a person skilled in the art Change, readjust and substitutes without departing from protection scope of the present invention.Therefore, although by above embodiments to the present invention into It has gone and has been described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, It can also include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (10)

1. a kind of large sample grinds report information extracting method, which is characterized in that the described method includes:
List data is obtained to breath progress text conversion of notifying is ground, the list data is stored with plain text;
The list data is counted, the word frequency of each word in the list data is exported;
The breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind close level index between breath of notifying simultaneously Sequence;
Using the digital number for grinding breath of notifying as node, using the close level index that obtains as branch draw described in grind and notify The network of personal connections of breath.
2. the method according to claim 1, wherein described count the list data, comprising:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
The word segmentation result list of text is { X1,X2,…,XN, corresponding word frequency list is { Y1,Y2,…,YN, YiFor word Xi? The number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,Z2,…,ZN, wherein(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
3. described according to the method described in claim 2, it is characterized in that, the text includes at least text 1 and text 2 Out it is each grind notify breath and remaining grind the close level index between breath of notifying and sort, including statistic procedure:
The word segmentation result list of text 1 isThe word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement As a result it is divided into A '1With A '2, Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
Note 0.8Wherein, i1<N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:, Zi=Yi/Yall(unit: 0.1%)
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is corresponding in text 1 and text 2 Word frequency percentage list is respectivelyWithIt willWithIt is regarded as two Vector, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two Text is closer;
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is two The text of text is anticipated the characterization value of close degree.
4. according to the method described in claim 2, it is characterized in that, it is described obtain it is each grind breath of notifying ground with remaining notify breath it Between close level index and sort, including sequence step:
Include: when sequence
Count it is each grind breath of notifying ground with remaining breath of notifying Wen Yixiang short-range order index and, to the index and arranging Sequence.
5. the method according to claim 1, wherein described using the digital number for grinding breath of notifying as node, Using the close level index obtained as the network of personal connections for grinding breath of notifying described in branch drafting, comprising:
Node of the digital number as network of personal connections for breath of notifying is ground described in acquisition, the branch between two nodes is the close degree of text meaning Index, the size of the close Program Index of length characterization text meaning of the branch.
6. a kind of large sample grinds report information extracting device, which is characterized in that described device includes:
Conversion module, for grind notify breath carry out text conversion obtain list data, the list data is with plain text Storage;
Word segmentation module exports the word frequency of each word in the list data for counting to the list data;
Statistical module, for being counted to the breath of notifying that grinds, obtain it is each grind notify breath and remaining grind between breath of notifying Close level index simultaneously sorts;
Drafting module, for using the digital number for grinding breath of notifying as node, using the close level index obtained as branch The network of personal connections for breath of notifying is ground described in drafting.
7. device according to claim 6, which is characterized in that the word segmentation module is also used to:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1,X2,…,XN, corresponding word frequency list is { Y1,Y2,…,YN, YiFor word Xi The number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,Z2,…,ZN, In, Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
8. device according to claim 7, which is characterized in that include at least text 1 and text 2, the statistical module packet The first statistic submodule is included, for executing following equation:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement As a result it is divided into A '1With A '2, Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
Note 0.8Wherein, i1<N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is corresponding in text 1 and text 2 Word frequency percentage list is respectivelyWithIt willWithIt is regarded as two Vector, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two Text is closer;
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is two The text of text is anticipated the characterization value of close degree.
9. a kind of computer equipment, can run on a memory and on a processor including memory, processor and storage Computer program, which is characterized in that the processor realizes any one of claim 1 to 5 institute when executing the computer program The step of stating method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of any one of claim 1 to 5 the method is realized when being executed by processor.
CN201910271619.9A 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium Active CN110110295B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910271619.9A CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium
PCT/CN2019/103230 WO2020199482A1 (en) 2019-04-04 2019-08-29 Large sample research report information extraction method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910271619.9A CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110110295A true CN110110295A (en) 2019-08-09
CN110110295B CN110110295B (en) 2023-10-20

Family

ID=67485207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910271619.9A Active CN110110295B (en) 2019-04-04 2019-04-04 Large sample research and report information extraction method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110110295B (en)
WO (1) WO2020199482A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694928A (en) * 2020-05-28 2020-09-22 平安资产管理有限责任公司 Data index recommendation method and device, computer equipment and readable storage medium
WO2020199482A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Large sample research report information extraction method and apparatus, device, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160140234A1 (en) * 2013-07-09 2016-05-19 Universiteit Twente Method and Computer Server System for Receiving and Presenting Information to a User in a Computer Network
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108647822A (en) * 2018-05-10 2018-10-12 平安科技(深圳)有限公司 Electronic device, based on the prediction technique and computer storage media for grinding count off evidence
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109284504A (en) * 2018-10-22 2019-01-29 平安科技(深圳)有限公司 It grinds to call the score using the security of deep learning model and analyses method and device
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN109460550A (en) * 2018-10-22 2019-03-12 平安科技(深圳)有限公司 Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170300564A1 (en) * 2016-04-19 2017-10-19 Sprinklr, Inc. Clustering for social media data
CN106446148B (en) * 2016-09-21 2019-08-09 中国运载火箭技术研究院 A kind of text duplicate checking method based on cluster
CN109325035A (en) * 2018-11-29 2019-02-12 阿里巴巴集团控股有限公司 The recognition methods of similar table and device
CN110110295B (en) * 2019-04-04 2023-10-20 平安科技(深圳)有限公司 Large sample research and report information extraction method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160140234A1 (en) * 2013-07-09 2016-05-19 Universiteit Twente Method and Computer Server System for Receiving and Presenting Information to a User in a Computer Network
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108647822A (en) * 2018-05-10 2018-10-12 平安科技(深圳)有限公司 Electronic device, based on the prediction technique and computer storage media for grinding count off evidence
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109284504A (en) * 2018-10-22 2019-01-29 平安科技(深圳)有限公司 It grinds to call the score using the security of deep learning model and analyses method and device
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN109460550A (en) * 2018-10-22 2019-03-12 平安科技(深圳)有限公司 Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020199482A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Large sample research report information extraction method and apparatus, device, and storage medium
CN111694928A (en) * 2020-05-28 2020-09-22 平安资产管理有限责任公司 Data index recommendation method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
WO2020199482A1 (en) 2020-10-08
CN110110295B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN108376364B (en) Payment system account checking method and device and terminal device
CN109902250A (en) Sharing method, sharing means, computer equipment and the storage medium of questionnaire survey
CN109191090A (en) Means of payment recommended method, device, equipment and computer readable storage medium
CN105824855B (en) Method and device for screening and classifying data objects and electronic equipment
CN109918678B (en) Method and device for identifying field meaning
CN110264342A (en) A kind of business audit method and device based on machine learning
CN109711733A (en) For generating method, electronic equipment and the computer-readable medium of Clustering Model
CN107330572A (en) Air control method, apparatus and system
CN110110295A (en) Large sample grinds report information extracting method, device, equipment and storage medium
CN112434884A (en) Method and device for establishing supplier classified portrait
CN112052385A (en) Investment and financing project recommendation method and device, electronic equipment and readable storage medium
CN110222286A (en) Information acquisition method, device, terminal and computer readable storage medium
CN108460673A (en) A kind of processing method and processing device of training data
CN105930323A (en) File generating method and apparatus
CN112132690B (en) Method and device for pushing foreign exchange product information, computer equipment and storage medium
CN110533406B (en) Payment calling method, device and system
CN112417018A (en) Data sharing method and device
CN110377269A (en) Business approval system configuration method, apparatus and storage medium
CN113554448A (en) User loss prediction method and device and electronic equipment
CN112579082A (en) Interactive state data establishing method and device, storage medium and electronic equipment
CN110458549B (en) Classification method and device for mobile payment
CN113052675B (en) Data display method and device
CN114267115B (en) Bill identification method and system
CN117688351B (en) Auxiliary screening method, device and equipment based on model processing result
CN108287719A (en) Call the cut-in method and application server of anti-fake system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant