CN110110295A - Large sample grinds report information extracting method, device, equipment and storage medium - Google Patents
Large sample grinds report information extracting method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN110110295A CN110110295A CN201910271619.9A CN201910271619A CN110110295A CN 110110295 A CN110110295 A CN 110110295A CN 201910271619 A CN201910271619 A CN 201910271619A CN 110110295 A CN110110295 A CN 110110295A
- Authority
- CN
- China
- Prior art keywords
- text
- breath
- notifying
- word
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The present invention relates to a kind of large samples to grind report information extracting method, device, computer equipment and readable storage medium storing program for executing, therein the described method includes: obtaining list data to breath progress text conversion of notifying is ground, the list data is stored with plain text;The list data is counted, the word frequency of each word in the list data is exported;The breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind the close level index between breath of notifying and sort;Using the digital number for grinding breath of notifying as node, using the close level index obtained as the network of personal connections for grinding breath of notifying described in branch drafting.The invention has the benefit that by showing that anticipate close level index and drawing out of the text for grinding report grinds the network of personal connections of breath of notifying based on Zipf law and sixteen rules, grind report according to what the network of personal connections can filter out that most of more important and Wen Yi is closer to, make it possible to more efficient filter out more be worth grind breath of notifying.
Description
Technical field
The present embodiments relate to finance data processing technology fields more particularly to a kind of large sample to grind report information extraction side
Method method, apparatus and can read access medium.
Background technique
Research report information referred to as grinds report, and management state and profit in some listed companies are referred in financial industry
The analysis that situation is made based on independent objective position, has more important meaning for fund manager.
For most fund managers of buyer, report can be ground in face of magnanimity daily, want to read in the limited time
Read it is most grind report it is clearly impossible.At present in industry for fund manager, even if known field, by a
People's experience and industry understanding selectively read it is therein grind report, also can not fully reflect magnanimity grind report in own
Keynote message or central issue, let alone personal experience and industry understand the hysteresis quality having in itself and personally for its
Still there is unfamiliar field, therefore, how to help fund manager to screen in the time as few as possible and grind report, obtains abundant
Useful information, being one has the problem of important realistic meaning.
Summary of the invention
In order to overcome the problems, such as present in the relevant technologies, the present invention provides a kind of large sample and grinds report information extracting method side
Method, device, computer equipment and can read access medium, magnanimity ground by visual network of personal connections combination keyword notify to realize
Breath is screened efficiently to filter out and more valuable grind breath of notifying.
In a first aspect, the embodiment of the invention provides a kind of large samples to grind report information extracting method, which comprises
List data is obtained to breath progress text conversion of notifying is ground, the list data is stored with plain text;
The list data is counted, the word frequency of each word in the list data is exported;
The breath of notifying that grinds is counted, show that each grind notifies breath and remaining close degree ground between breath of notifying refers to
It marks and sorts;
Using the digital number for grinding breath of notifying as node, ground using the close level index obtained as described in branch drafting
It notifies the network of personal connections of breath.
It is described that the list data is counted in another feasible embodiment of the invention in conjunction with another aspect, packet
It includes:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN, Yi
For word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,
Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
In conjunction on the other hand, in another feasible embodiment of the present invention, text 1 and text 2 are included at least, it is described to obtain
It is each grind notify breath and remaining grind the close level index between breath of notifying and sort, including statistic procedure:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement
As a result it is divided into A '1With A '2,
Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2
The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as
Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two
Text is closer;
Remember U=(0.8A '1)U(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is
It is that the texts of two texts is anticipated the characterization value of close degree.
It is described to show that each breath of notifying that grinds is ground with remaining in another feasible embodiment of the invention in conjunction with another aspect
It notifies and the close level index between breath and sorts, including sequence step:
Include: when sequence
Count it is each grind breath of notifying ground with remaining breath of notifying Wen Yixiang short-range order index and, to the index and progress
Sequence.
It is described with the digital number for grinding breath of notifying in another feasible embodiment of the invention in conjunction with another aspect
For node, using the close level index that obtains as branch draw described in grind the network of personal connections of breath of notifying, comprising:
Node of the digital number for breath of notifying as network of personal connections net is ground described in acquisition, the branch between two nodes is Wen Yixiang
Nearly level index, the size of the close Program Index of length characterization text meaning of the branch.
Second aspect, the present invention also provides a kind of large samples to grind report information extracting device, and described device includes:
Conversion module, for grind notify breath carry out text conversion obtain list data, the list data is with plain text
Form storage;
Word segmentation module exports the word frequency of each word in the list data for counting to the list data;
Statistical module, for being counted to the breath of notifying that grinds, obtain it is each grind breath of notifying ground with remaining notify breath it
Between close level index and sort;
Drafting module, for using the digital number for grinding breath of notifying as node, with the close level index obtained
For the network of personal connections for grinding breath of notifying described in branch drafting.
The third aspect the present invention also provides a kind of computer equipment, including memory, processor and is stored in storage
On device and the computer program that can run on a processor, the processor realize above-mentioned side when executing the computer program
Method.
Fourth aspect, the present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, institute
State the step of above method is realized when computer program is executed by processor.
The present invention is ground by being obtained to grind the close level index of text meaning of report and draw out based on Zipf law and sixteen rules
It notifies the network of personal connections of breath, grinds report according to what the network of personal connections can filter out that most of more important and Wen Yi is closer to, make
Can more efficient filter out more be worth grind breath of notifying, can also by the node density of keyword and network of personal connections
The problem of obtaining the capital market concern gone out embodied in the exchange hour section focus.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
It can the limitation present invention.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is the basic procedure signal that a kind of large sample shown according to an exemplary embodiment grinds report information extracting method
Figure.
Fig. 2 is the schematic diagram of branch length in network of personal connections shown according to an exemplary embodiment.
Fig. 3 is the schematic diagram of network of personal connections shown according to an exemplary embodiment.
Fig. 4 is the schematic block diagram that a kind of large sample shown according to an exemplary embodiment grinds report information extracting device.
Fig. 5 is the block diagram of the computer equipment of implementation method shown according to an exemplary embodiment.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail
The processing or method described as flow chart.It is therein to be permitted although each step to be described as to the processing of sequence in flow chart
Multi-step can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of each step can be rearranged, when its operation
The processing can be terminated when completion, it is also possible to have the other steps being not included in attached drawing.Processing can correspond to
In method, function, regulation, subroutine, subprogram etc..
The present invention relates to a kind of large samples to grind report information extracting method, device, computer equipment and readable storage medium storing program for executing,
Mainly apply in scene of the financial industry for the technical treatment for grinding breath progress particular demands of notifying, basic thought is:
Based on Zipf law, number that word occurs and its ranking in frequency meter is inversely proportional and sixteen rules, to grinding report
In information word frequency sort forward 20% the information that is reflected of word corresponding to grind retribution this occupy and grinds the important of breath of notifying
80%, statistics show that the text for grinding report is anticipated and close level index and draws out the network of personal connections for grinding breath of notifying on this basis, according to
What the network of personal connections can filter out that most of more important and Wen Yi is closer to grinds report, can be more easily from network of personal connections
The problem of filtering out the breath of notifying that grinds more being worth, while can also obtaining capital market concern according to the keyword in network of personal connections is burnt
Point.
The present embodiment is applicable to grind report information extraction in the intelligent terminal with central processing module to carry out large sample
In the case where, this method can be executed by central processing module, and wherein the central processing module can be by software and/or hardware
It realizes, can generally be integrated in intelligent terminal, report the basic of information extracting method as shown in Figure 1, grinding for large sample of the present invention
Flow diagram, the method specifically comprise the following steps:
In step 110, to breath progress text conversion of notifying is ground, list data is stored with plain text after conversion;
The breath of notifying that grinds is generally PDF format, and the textural information in PDF format generally can not be carried out directly
Processing, at this time, it may be necessary to convert to breath of notifying is ground, by existing software such as smallpdf, can grind report for PDF format
Information is converted into word format, and word document is then saved as txt format and only retains text.
Word segmentation processing is carried out to the text of txt format, participle packet can be used in this process and carry out word segmentation processing, output knot
Fruit is the file of CSV format, and CSV format is to store list data such as text and number with plain text, which is word
Accord with the data that sequence forms and this needs of nonbinary are interpreted.
In the step 120, the list data is counted, exports the word frequency of each word in the list data;
The CSV format text is counted to obtain statistical result, which is a list comprising grinds report
All words of information and corresponding frequency of occurrence, are then converted into percents for word frequency result, can pass through following equation
One obtains:
Formula 1:
Assuming that the word segmentation result list of text is { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN,
YiFor word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,
Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text, i.e., singly
Word word frequency.
In step 130, the breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind between breath of notifying
Close level index and sort;
In this step for calculate text anticipate close level index the step of, anticipate close level index of text refers to different grinding report
Text between information is anticipated close degree, be can reflect out the text that different grinding is notified between breath and is anticipated close degree, can lead to
Following equation two is crossed to obtain:
Formula two:
In a kind of feasible embodiment of exemplary embodiment of the present, text 1 and text 2 are included at least, in conjunction with Fig. 2 institute
Show, may also include text 3, is with the word segmentation result list of text 1The word segmentation result of text 2 arranges
Table isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement
Result be divided into A '1With A '2,It is right
The word frequency percentage list answered is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2
The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as
Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two
Text is closer.
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is
It is that the texts of two texts is anticipated the characterization value of close degree.
In step 140, using the digital number for grinding breath of notifying as node, it is with the close level index obtained
The network of personal connections for breath of notifying is ground described in branch drafting.
This step is to draw to grind step, and drafting has the network of personal connections for grinding report, it is necessary first to data be arranged for the breath of notifying that grinds
Number, data number correspond in it is described grind notify breath and be it is independent unique, to grind the digital number for breath of notifying as relationship
The node of net, the branch between two nodes of network of personal connections are exactly that text is anticipated close level index size, and the length of branch is with index value inverse
Characterization, Wen Yiyue proximal branch is shorter, while also illustrating that two to grind manner of breathing short range degree of notifying bigger.
As shown in connection with fig. 2, including text 1, text 2 and text 3, the length of the branch between text 1 and text 2 are expressed as branch
1, the length of the branch between text 1 and text 3 is expressed as branch 2, and the length of branch 2 is greater than branch 1, then the text 1 and text in Fig. 2
Text Wen Yi between 2 is more close compared with the text Wen Yi between text 1 and text 3.
As shown in connection with fig. 3, the network of personal connections for the present invention after visualization is completed in modeling, can from the Visual Graph of network of personal connections
Find out the higher node of density, the report that grinds corresponding to the higher node of these density can be paid close attention to and study carefully, study effect carefully
Rate is substantially improved.
Method of the invention, based on the basis of Zipf law and sixteen rules, carrying out text conversion respectively, at participle
Reason, word frequency statistics, text are anticipated, and close degree calculates, network of personal connections is drawn, finally obtain can embody word frequency importance and
The close level index of text meaning grinds report network of personal connections, and most of more important and Wen Yi can be filtered out by grinding report network of personal connections according to this
What is be closer to grinds report, greatly improves reading efficiency.
Fig. 4 is the structural schematic diagram that a kind of large sample provided in an embodiment of the present invention grinds report information extracting device, the device
It can be implemented by software and/or hardware, be generally integrated in intelligent terminal, report information extracting method can be ground by large sample come real
It is existing.As shown, the present embodiment can be provided a kind of large sample and be ground report information extracting device based on above-described embodiment,
It mainly includes conversion module 410, word segmentation module 420, statistical module 430 and drafting module 440.
Conversion module 410 therein, for grind notify breath carry out text conversion obtain list data, the list data
It is stored with plain text;
Word segmentation module 420 therein exports each word in the list data for counting to the list data
Word frequency;
Statistical module 430 therein show that each breath of notifying that grinds is ground with remaining for counting to the breath of notifying that grinds
It notifies and the close level index between breath and sorts;
Drafting module 440 therein is described close with what is obtained for using the digital number for grinding breath of notifying as node
Level index is the network of personal connections that breath of notifying is ground described in branch is drawn.
In an embodiment of exemplary embodiment of the present, the word segmentation module is also used to:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1, X2..., XN, corresponding word frequency list is { Y1, Y2..., YN, Yi
For word XiThe number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,
Z2..., ZN, wherein Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
In an embodiment of exemplary embodiment of the present, text 1 and text 2, the statistical module are included at least
Including the first statistic submodule, for executing following equation:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, the knot after arrangement
Fruit is divided into A '1With A '2, It is corresponding
Word frequency percentage list is respectivelyWith
Introduce Filtering system:
NoteWherein, i1< N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is right in text 1 and text 2
The word frequency percentage list answered is respectivelyWithIt willWithIt is regarded as
Two vectors, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two
Text is closer.
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is
It is that the texts of two texts is anticipated the characterization value of close degree.
The large sample provided in above-described embodiment grinds in the executable present invention of report information extracting device institute in any embodiment
The large sample of offer grinds report information extracting method, has and executes the corresponding functional module of this method and beneficial effect, not above-mentioned
The technical detail being described in detail in embodiment, reference can be made to large sample provided in any embodiment of that present invention grinds report information extraction
Method.
It will be appreciated that the present invention also extends to the computer program for being suitable for putting the invention into practice, especially
Computer program on carrier or in carrier.Program can be with source code, object code, code intermediate source and such as part volume
The form of the object code for the form translated, or it is suitble to the shape used in the realization of the method according to the invention with any other
Formula.Also it will be noted that, such program may have many different frame designs.For example, realizing side according to the invention
Functional program code of method or system may be subdivided into one or more subroutine.
For that will be apparent for technical personnel in the functional many different modes of these subroutine intermediate distributions.
Subroutine can be collectively stored in an executable file, to form self-contained program.Such executable file can
To include computer executable instructions, such as processor instruction and/or interpreter instruction (for example, Java interpreter instruction).It can
Alternatively, one or more or all subroutines of subroutine may be stored at least one external library file, and
And it statically or dynamically (such as at runtime between) is linked with main program.Main program contains at least one of subroutine
At least one calling.Subroutine also may include to mutual function call.It is related to the embodiment packet of computer program product
Include the computer executable instructions for corresponding at least one of illustrated method each step of the processing step of method.These refer to
Subroutine can be subdivided into and/or be stored in one or more possible static or dynamic link file by enabling.
Another embodiment for being related to computer program product includes corresponding in illustrated system and/or product at least
The computer executable instructions of each device in one device.These instructions can be subdivided into subroutine and/or be stored
In one or more possible static or dynamic link file.
The carrier of computer program can be any entity or device that can deliver program.For example, carrier can wrap
Containing storage medium, such as (ROM such as CDROM or semiconductor ROM) either magnetic recording media (such as floppy disk or hard disk).Into
One step, carrier can be the carrier that can be transmitted, such as electricity perhaps optical signalling its can via cable or optical cable, or
Person is transmitted by radio or other means.When program is embodied as such signal, carrier can be by such cable
Or device composition.Alternatively, carrier can be the integrated circuit for being wherein embedded with program, and the integrated circuit is suitable for holding
Row correlation technique, or used in execution for correlation technique.
Should be noted that embodiment mentioned above be illustrate the present invention, rather than limit the present invention, and this
The technical staff in field will design many alternate embodiments, without departing from scope of the appended claims.It is weighing
During benefit requires, the reference symbol of any placement between round parentheses is not to be read as being limitations on claims.Verb " packet
Include " and its paradigmatic depositing using the element being not excluded for other than those of recording in the claims or step
?.The article " one " before element or "one" be not excluded for the presence of a plurality of such elements.The present invention can pass through
Hardware including several visibly different components, and realized by properly programmed computer.Enumerating several devices
In device claim, several in these devices can be embodied by the same item of hardware.In mutually different appurtenance
Benefit states that the simple fact of certain measures does not indicate that the combination of these measures cannot be used to benefit in requiring.
If desired, different function discussed herein can be executed with different order and/or be executed simultaneously with one another.
In addition, if one or more functions described above can be optional or can be combined if expectation.
If desired, each step is not limited to the sequence that executes in each embodiment, different step as discussed above
It can be executed with different order and/or be executed simultaneously with one another.In addition, in other embodiments, described above one or more
A step can be optional or can be combined.
Although various aspects of the invention provide in the independent claim, other aspects of the invention include coming from
The combination of the dependent claims of the feature of described embodiment and/or the feature with independent claims, and not only
It is the combination clearly provided in claim.
It is to be noted here that although these descriptions are not the foregoing describe example embodiment of the invention
It should be understood in a limiting sense.It is wanted on the contrary, several change and modification can be carried out without departing from such as appended right
The scope of the present invention defined in asking.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general meter
Device is calculated to realize, each module can concentrate in the group of networks of single computing device or computing device composition, and the present invention is real
The method that the device in example corresponds in previous embodiment is applied, can be realized, can also be led to by executable program code
The mode of integrated circuit combination is crossed to realize, therefore the invention is not limited to specific hardware or software and its combinations.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general shifting
Dynamic terminal realizes that each module can concentrate in the device combination of single mobile terminal or mobile terminal composition, the present invention
Device in embodiment corresponds to the method in previous embodiment, can be realized by editing executable program code,
It can be realized by way of integrated circuit combination, therefore the invention is not limited to specific hardware or softwares and its knot
It closes.
The present embodiment also provides a kind of computer equipment, can such as execute the smart phone, tablet computer, notebook of program
Computer, desktop computer, rack-mount server, blade server, tower server or Cabinet-type server are (including independent
Server cluster composed by server or multiple servers) etc..The computer equipment 20 of the present embodiment includes at least but not
It is limited to: memory 21, the processor 22 of connection can be in communication with each other by system bus, as shown in Figure 5.It is pointed out that Fig. 5
The computer equipment 20 with component 21-22 is illustrated only, it should be understood that being not required for implementing all groups shown
Part, the implementation that can be substituted is more or less component.
In the present embodiment, memory 21 (i.e. readable storage medium storing program for executing) includes flash memory, hard disk, multimedia card, card-type memory
(for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory
(ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic
Disk, CD etc..In some embodiments, memory 21 can be the internal storage unit of computer equipment 20, such as the calculating
The hard disk or memory of machine equipment 20.In further embodiments, memory 21 is also possible to the external storage of computer equipment 20
The plug-in type hard disk being equipped in equipment, such as the computer equipment 20, intelligent memory card (Smart Media Card, SMC), peace
Digital (Secure Digital, SD) card, flash card (FlashCard) etc..Certainly, memory 21 can also both include calculating
The internal storage unit of machine equipment 20 also includes its External memory equipment.In the present embodiment, memory 21 is commonly used in storage peace
Operating system and types of applications software loaded on computer equipment 20, for example, embodiment one RNNs neural network program code
Deng.In addition, memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.
Processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU),
Controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in control computer equipment
20 overall operation.In the present embodiment, program code or processing data of the processor 22 for being stored in run memory 21,
Such as realize each layer structure of deep learning model, to realize that the large sample of above-described embodiment grinds report information extracting method.
The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory
(for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory
(ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic
Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc.
Answer function.The computer readable storage medium of the present embodiment is realized above-mentioned for storing financial small routine when being executed by processor
The large sample of embodiment grinds report information extracting method.
Another embodiment for being related to computer program product includes corresponding in illustrated system and/or product at least
The computer executable instructions of each device in one device.These instructions can be subdivided into subroutine and/or be stored
In one or more possible static or dynamic link file.
The carrier of computer program can be any entity or device that can deliver program.For example, carrier can wrap
Containing storage medium, such as (ROM such as CDROM or semiconductor ROM) either magnetic recording media (such as floppy disk or hard disk).Into
One step, carrier can be the carrier that can be transmitted, such as electricity perhaps optical signalling its can via cable or optical cable, or
Person is transmitted by radio or other means.When program is embodied as such signal, carrier can be by such cable
Or device composition.Alternatively, carrier can be the integrated circuit for being wherein embedded with program, and the integrated circuit is suitable for holding
Row correlation technique, or used in execution for correlation technique.
Should be noted that embodiment mentioned above be illustrate the present invention, rather than limit the present invention, and this
The technical staff in field will design many alternate embodiments, without departing from scope of the appended claims.It is weighing
During benefit requires, the reference symbol of any placement between round parentheses is not to be read as being limitations on claims.Verb " packet
Include " and its paradigmatic depositing using the element being not excluded for other than those of recording in the claims or step
?.The article " one " before element or "one" be not excluded for the presence of a plurality of such elements.The present invention can pass through
Hardware including several visibly different components, and realized by properly programmed computer.Enumerating several devices
In device claim, several in these devices can be embodied by the same item of hardware.In mutually different appurtenance
Benefit states that the simple fact of certain measures does not indicate that the combination of these measures cannot be used to benefit in requiring.
If desired, different function discussed herein can be executed with different order and/or be executed simultaneously with one another.
In addition, if one or more functions described above can be optional or can be combined if expectation.
If desired, each step is not limited to the sequence that executes in each embodiment, different step as discussed above
It can be executed with different order and/or be executed simultaneously with one another.In addition, in other embodiments, described above one or more
A step can be optional or can be combined.
Although various aspects of the invention provide in the independent claim, other aspects of the invention include coming from
The combination of the dependent claims of the feature of described embodiment and/or the feature with independent claims, and not only
It is the combination clearly provided in claim.
It is to be noted here that although these descriptions are not the foregoing describe example embodiment of the invention
It should be understood in a limiting sense.It is wanted on the contrary, several change and modification can be carried out without departing from such as appended right
The scope of the present invention defined in asking.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general meter
Device is calculated to realize, each module can concentrate in the group of networks of single computing device or computing device composition, and the present invention is real
The method that the device in example corresponds in previous embodiment is applied, can be realized, can also be led to by executable program code
The mode of integrated circuit combination is crossed to realize, therefore the invention is not limited to specific hardware or software and its combinations.
Will be appreciated by those skilled in the art that each module in the device of the embodiment of the present invention can use general shifting
Dynamic terminal realizes that each module can concentrate in the device combination of single mobile terminal or mobile terminal composition, the present invention
Device in embodiment corresponds to the method in previous embodiment, can be realized by editing executable program code,
It can be realized by way of integrated circuit combination, therefore the invention is not limited to specific hardware or softwares and its knot
It closes.
Note that above are only exemplary embodiment of the present invention and institute's application technology principle.Those skilled in the art can manage
Solution, the invention is not limited to the specific embodiments described herein, is able to carry out various apparent changes for a person skilled in the art
Change, readjust and substitutes without departing from protection scope of the present invention.Therefore, although by above embodiments to the present invention into
It has gone and has been described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept,
It can also include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (10)
1. a kind of large sample grinds report information extracting method, which is characterized in that the described method includes:
List data is obtained to breath progress text conversion of notifying is ground, the list data is stored with plain text;
The list data is counted, the word frequency of each word in the list data is exported;
The breath of notifying that grinds is counted, obtain it is each grind notify breath and remaining grind close level index between breath of notifying simultaneously
Sequence;
Using the digital number for grinding breath of notifying as node, using the close level index that obtains as branch draw described in grind and notify
The network of personal connections of breath.
2. the method according to claim 1, wherein described count the list data, comprising:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
The word segmentation result list of text is { X1,X2,…,XN, corresponding word frequency list is { Y1,Y2,…,YN, YiFor word Xi?
The number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,Z2,…,ZN, wherein(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
3. described according to the method described in claim 2, it is characterized in that, the text includes at least text 1 and text 2
Out it is each grind notify breath and remaining grind the close level index between breath of notifying and sort, including statistic procedure:
The word segmentation result list of text 1 isThe word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement
As a result it is divided into A '1With A '2,
Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
Note 0.8Wherein, i1<N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:, Zi=Yi/Yall(unit: 0.1%)
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is corresponding in text 1 and text 2
Word frequency percentage list is respectivelyWithIt willWithIt is regarded as two
Vector, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two
Text is closer;
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is two
The text of text is anticipated the characterization value of close degree.
4. according to the method described in claim 2, it is characterized in that, it is described obtain it is each grind breath of notifying ground with remaining notify breath it
Between close level index and sort, including sequence step:
Include: when sequence
Count it is each grind breath of notifying ground with remaining breath of notifying Wen Yixiang short-range order index and, to the index and arranging
Sequence.
5. the method according to claim 1, wherein described using the digital number for grinding breath of notifying as node,
Using the close level index obtained as the network of personal connections for grinding breath of notifying described in branch drafting, comprising:
Node of the digital number as network of personal connections for breath of notifying is ground described in acquisition, the branch between two nodes is the close degree of text meaning
Index, the size of the close Program Index of length characterization text meaning of the branch.
6. a kind of large sample grinds report information extracting device, which is characterized in that described device includes:
Conversion module, for grind notify breath carry out text conversion obtain list data, the list data is with plain text
Storage;
Word segmentation module exports the word frequency of each word in the list data for counting to the list data;
Statistical module, for being counted to the breath of notifying that grinds, obtain it is each grind notify breath and remaining grind between breath of notifying
Close level index simultaneously sorts;
Drafting module, for using the digital number for grinding breath of notifying as node, using the close level index obtained as branch
The network of personal connections for breath of notifying is ground described in drafting.
7. device according to claim 6, which is characterized in that the word segmentation module is also used to:
Word segmentation processing is carried out to the list data of the textual form of input, obtains word segmentation result;
Export the word frequency of each word in the list data, comprising:
With the word segmentation result list of text for { X1,X2,…,XN, corresponding word frequency list is { Y1,Y2,…,YN, YiFor word Xi
The number occurred in text;NoteSegmenting corresponding word frequency percentage list is { Z1,Z2,…,ZN,
In, Zi=Yi/Yall(unit: 0.1%), ZiFor word XiThe accounting of the frequency occurred in text.
8. device according to claim 7, which is characterized in that include at least text 1 and text 2, the statistical module packet
The first statistic submodule is included, for executing following equation:
It is with the word segmentation result list of text 1The word segmentation result list of text 2 isRespectively to A1And A2It is arranged from big to small by corresponding word frequency percentage, after arrangement
As a result it is divided into A '1With A '2,
Corresponding word frequency percentage list is respectivelyWith
Introduce Filtering system:
Note 0.8Wherein, i1<N1, and meet
Calculate the close degree of text meaning of text 1 and text 2:
Remember M=(0.8A '1)∩(0.8A′2), the number of elements of set M is m, this m word is corresponding in text 1 and text 2
Word frequency percentage list is respectivelyWithIt willWithIt is regarded as two
Vector, noteDue toWithRespective component meets regularity, soValue range beThe value range of ω is alsoAnd ω is bigger, and two
Text is closer;
Remember U=(0.8A '1)∪(0.8A′2), the number of elements of set U is denoted as u, definitionρ=aω, index ρ is two
The text of text is anticipated the characterization value of close degree.
9. a kind of computer equipment, can run on a memory and on a processor including memory, processor and storage
Computer program, which is characterized in that the processor realizes any one of claim 1 to 5 institute when executing the computer program
The step of stating method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of any one of claim 1 to 5 the method is realized when being executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271619.9A CN110110295B (en) | 2019-04-04 | 2019-04-04 | Large sample research and report information extraction method, device, equipment and storage medium |
PCT/CN2019/103230 WO2020199482A1 (en) | 2019-04-04 | 2019-08-29 | Large sample research report information extraction method and apparatus, device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271619.9A CN110110295B (en) | 2019-04-04 | 2019-04-04 | Large sample research and report information extraction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110110295A true CN110110295A (en) | 2019-08-09 |
CN110110295B CN110110295B (en) | 2023-10-20 |
Family
ID=67485207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910271619.9A Active CN110110295B (en) | 2019-04-04 | 2019-04-04 | Large sample research and report information extraction method, device, equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110110295B (en) |
WO (1) | WO2020199482A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111694928A (en) * | 2020-05-28 | 2020-09-22 | 平安资产管理有限责任公司 | Data index recommendation method and device, computer equipment and readable storage medium |
WO2020199482A1 (en) * | 2019-04-04 | 2020-10-08 | 平安科技(深圳)有限公司 | Large sample research report information extraction method and apparatus, device, and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160140234A1 (en) * | 2013-07-09 | 2016-05-19 | Universiteit Twente | Method and Computer Server System for Receiving and Presenting Information to a User in a Computer Network |
CN108334494A (en) * | 2018-01-23 | 2018-07-27 | 阿里巴巴集团控股有限公司 | A kind of construction method and device of customer relationship network |
CN108647822A (en) * | 2018-05-10 | 2018-10-12 | 平安科技(深圳)有限公司 | Electronic device, based on the prediction technique and computer storage media for grinding count off evidence |
CN108710613A (en) * | 2018-05-22 | 2018-10-26 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of text similarity |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
CN109284504A (en) * | 2018-10-22 | 2019-01-29 | 平安科技(深圳)有限公司 | It grinds to call the score using the security of deep learning model and analyses method and device |
CN109388804A (en) * | 2018-10-22 | 2019-02-26 | 平安科技(深圳)有限公司 | Report core views extracting method and device are ground using the security of deep learning model |
CN109460550A (en) * | 2018-10-22 | 2019-03-12 | 平安科技(深圳)有限公司 | Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170300564A1 (en) * | 2016-04-19 | 2017-10-19 | Sprinklr, Inc. | Clustering for social media data |
CN106446148B (en) * | 2016-09-21 | 2019-08-09 | 中国运载火箭技术研究院 | A kind of text duplicate checking method based on cluster |
CN109325035A (en) * | 2018-11-29 | 2019-02-12 | 阿里巴巴集团控股有限公司 | The recognition methods of similar table and device |
CN110110295B (en) * | 2019-04-04 | 2023-10-20 | 平安科技(深圳)有限公司 | Large sample research and report information extraction method, device, equipment and storage medium |
-
2019
- 2019-04-04 CN CN201910271619.9A patent/CN110110295B/en active Active
- 2019-08-29 WO PCT/CN2019/103230 patent/WO2020199482A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160140234A1 (en) * | 2013-07-09 | 2016-05-19 | Universiteit Twente | Method and Computer Server System for Receiving and Presenting Information to a User in a Computer Network |
CN108334494A (en) * | 2018-01-23 | 2018-07-27 | 阿里巴巴集团控股有限公司 | A kind of construction method and device of customer relationship network |
CN108647822A (en) * | 2018-05-10 | 2018-10-12 | 平安科技(深圳)有限公司 | Electronic device, based on the prediction technique and computer storage media for grinding count off evidence |
CN108710613A (en) * | 2018-05-22 | 2018-10-26 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of text similarity |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
CN109284504A (en) * | 2018-10-22 | 2019-01-29 | 平安科技(深圳)有限公司 | It grinds to call the score using the security of deep learning model and analyses method and device |
CN109388804A (en) * | 2018-10-22 | 2019-02-26 | 平安科技(深圳)有限公司 | Report core views extracting method and device are ground using the security of deep learning model |
CN109460550A (en) * | 2018-10-22 | 2019-03-12 | 平安科技(深圳)有限公司 | Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020199482A1 (en) * | 2019-04-04 | 2020-10-08 | 平安科技(深圳)有限公司 | Large sample research report information extraction method and apparatus, device, and storage medium |
CN111694928A (en) * | 2020-05-28 | 2020-09-22 | 平安资产管理有限责任公司 | Data index recommendation method and device, computer equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020199482A1 (en) | 2020-10-08 |
CN110110295B (en) | 2023-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108376364B (en) | Payment system account checking method and device and terminal device | |
CN109902250A (en) | Sharing method, sharing means, computer equipment and the storage medium of questionnaire survey | |
CN109191090A (en) | Means of payment recommended method, device, equipment and computer readable storage medium | |
CN105824855B (en) | Method and device for screening and classifying data objects and electronic equipment | |
CN109918678B (en) | Method and device for identifying field meaning | |
CN110264342A (en) | A kind of business audit method and device based on machine learning | |
CN109711733A (en) | For generating method, electronic equipment and the computer-readable medium of Clustering Model | |
CN107330572A (en) | Air control method, apparatus and system | |
CN110110295A (en) | Large sample grinds report information extracting method, device, equipment and storage medium | |
CN112434884A (en) | Method and device for establishing supplier classified portrait | |
CN112052385A (en) | Investment and financing project recommendation method and device, electronic equipment and readable storage medium | |
CN110222286A (en) | Information acquisition method, device, terminal and computer readable storage medium | |
CN108460673A (en) | A kind of processing method and processing device of training data | |
CN105930323A (en) | File generating method and apparatus | |
CN112132690B (en) | Method and device for pushing foreign exchange product information, computer equipment and storage medium | |
CN110533406B (en) | Payment calling method, device and system | |
CN112417018A (en) | Data sharing method and device | |
CN110377269A (en) | Business approval system configuration method, apparatus and storage medium | |
CN113554448A (en) | User loss prediction method and device and electronic equipment | |
CN112579082A (en) | Interactive state data establishing method and device, storage medium and electronic equipment | |
CN110458549B (en) | Classification method and device for mobile payment | |
CN113052675B (en) | Data display method and device | |
CN114267115B (en) | Bill identification method and system | |
CN117688351B (en) | Auxiliary screening method, device and equipment based on model processing result | |
CN108287719A (en) | Call the cut-in method and application server of anti-fake system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |