CN110222249A

CN110222249A - Date storage method, device, computer equipment and storage medium

Info

Publication number: CN110222249A
Application number: CN201910370895.0A
Authority: CN
Inventors: 吴壮伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2019-09-10
Anticipated expiration: 2039-05-06
Also published as: CN110222249B

Abstract

The invention discloses date storage method, device, computer equipment and storage mediums.This method comprises: crawling corresponding set of source data according to the received enterprise name of institute, set of source data is pre-processed, data after handling correspondingly with more text datas are obtained；Data after corresponding processing are subjected to duplicate removal, obtain data set after corresponding duplicate removal；And obtain enterprise's historical information list included in data set after duplicate removal, each company information list is added to time shaft or the corresponding timing node of calendar according to enterprise event time of origin in included text and corresponding enterprise's historical information list correspondence, obtains development track data corresponding with enterprise name；Development track data are shown by time shaft or calendar.This method crawls technology using data, realizes and crawls company-related information from network automatically, after the key message for extracting company-related information, is ranked up to all key messages according to the time, obtains the development track data of enterprise.

Description

Date storage method, device, computer equipment and storage medium

Technical field

The present invention relates to data acquisition technology field more particularly to a kind of date storage method, device, computer equipment and Storage medium.

Background technique

Currently, the major issue information of enterprise, has dispersibility on time dimension, compares and be difficult to collect and concentrate arrangement. And general each enterprise belongs to inclined artificial text operation, lacks automation in the data preparation to itself enterprise development The tool of acquisition.Moreover, general each enterprise is the enterprise's historical data for respectively arranging itself, and not formed one can concentrate Check enterprise's the inquiry of historical data platform of the historical events of all enterprises.

Summary of the invention

The embodiment of the invention provides a kind of date storage method, device, computer equipment and storage mediums, it is intended to solve Enterprise belongs to inclined artificial text operation, lacks automation and obtain in the data preparation to itself enterprise development in the prior art The problem of taking and arrange data, leading to inefficiency.

In a first aspect, the embodiment of the invention provides a kind of date storage methods comprising:

Corresponding set of source data is crawled according to the received enterprise name of institute, the set of source data is pre-processed, is obtained Data after being handled correspondingly with more text datas included by the set of source data；Wherein, data are corresponding after each processing One enterprise's historical information list；

Data carry out duplicate removal after more text datas included in the source data are handled correspondingly, obtain and source Data set after the corresponding duplicate removal of data；And

Enterprise's historical information list included in data set after the duplicate removal is obtained, by each company information list according to institute Including text in enterprise event time of origin and corresponding enterprise's historical information list correspondence be added to time shaft or calendar pair The timing node answered obtains development track data corresponding with enterprise name；Wherein, the title of time shaft or calendar by with each enterprise The corresponding enterprise name of industry information list is generated, and the development track data are to be added to the time of enterprise's historical information list The data of axis form or calendar form；

The development track data are shown by time shaft or calendar.

Second aspect, the embodiment of the invention provides a kind of data storage devices comprising:

Source data crawls unit, for crawling corresponding set of source data according to the received enterprise name of institute, by the source number It is pre-processed according to collection, obtains data after handling correspondingly with more text datas included by the set of source data；Wherein, Data correspond to enterprise's historical information list after each processing；

Duplicate removal unit is carried out for data after handling more text datas included in the source data correspondingly Duplicate removal obtains data set after duplicate removal corresponding with source data；And

Track data acquiring unit, for obtaining enterprise's historical information list included in data set after the duplicate removal, By each company information list according to enterprise event time of origin in included text and corresponding enterprise's historical information list pair It should be added to time shaft or the corresponding timing node of calendar, obtain development track data corresponding with enterprise name；Wherein, the time The title of axis or calendar is generated by enterprise name corresponding with each company information list, and the development track data are to be added to The time shaft form of enterprise's historical information list or the data of calendar form；

Data display unit, for showing the development track data by time shaft or calendar.

The third aspect, the embodiment of the present invention provide a kind of computer equipment again comprising memory, processor and storage On the memory and the computer program that can run on the processor, the processor execute the computer program Date storage method described in the above-mentioned first aspect of Shi Shixian.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer can It reads storage medium and is stored with computer program, it is above-mentioned that the computer program when being executed by a processor executes the processor Date storage method described in first aspect.

The embodiment of the invention provides a kind of date storage method, device, computer equipment and storage mediums.This method packet Include and corresponding set of source data crawled according to the received enterprise name of institute, the set of source data is pre-processed, obtain with it is described Data after more text datas included by set of source data are handled correspondingly；Wherein, the corresponding enterprise of data after each processing Industry historical information list；Data carry out duplicate removal after more text datas included in the source data are handled correspondingly, Obtain data set after duplicate removal corresponding with source data；And obtain enterprise's historical information included in data set after the duplicate removal List arranges each company information list according to enterprise event time of origin in included text and corresponding enterprise's historical information Table correspondence is added to time shaft or the corresponding timing node of calendar, obtains development track data corresponding with enterprise name；By institute Development track data are stated to be shown by time shaft or calendar.This method crawls technology using data, realizes automatically from net Network crawls company-related information, after the key message for extracting company-related information, is carried out to all key messages according to the time Sequence, obtains the development track data of enterprise.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the application scenarios schematic diagram of date storage method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of date storage method provided in an embodiment of the present invention；

Fig. 3 is another flow diagram of date storage method provided in an embodiment of the present invention；

Fig. 4 is the sub-process schematic diagram of date storage method provided in an embodiment of the present invention；

Fig. 5 is another sub-process schematic diagram of date storage method provided in an embodiment of the present invention；

Fig. 6 is another sub-process schematic diagram of date storage method provided in an embodiment of the present invention；

Fig. 7 is the schematic block diagram of data storage device provided in an embodiment of the present invention；

Fig. 8 is another schematic block diagram of data storage device provided in an embodiment of the present invention；

Fig. 9 is the subelement schematic block diagram of data storage device provided in an embodiment of the present invention；

Figure 10 is another subelement schematic block diagram of data storage device provided in an embodiment of the present invention；

Figure 11 is another subelement schematic block diagram of data storage device provided in an embodiment of the present invention；

Figure 12 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios schematic diagram of date storage method provided in an embodiment of the present invention, figure 2 be the flow diagram of date storage method provided in an embodiment of the present invention, which is applied in server, should Method is executed by the application software being installed in server.

As shown in Fig. 2, the method comprising the steps of S110~S140.

S110, corresponding set of source data is crawled according to the received enterprise name of institute, the set of source data is pre-processed, Obtain data after handling correspondingly with more text datas included by the set of source data；Wherein, data after each processing Corresponding enterprise's historical information list.

In the present embodiment, in order to obtain the development track of certain one or more enterprise, sea can be crawled by reptile instrument Amount web page contents relevant to corresponding enterprise.Such as when the relevant web page contents of enterprise A need to be obtained, in the server to deployment Reptile instrument relevant to enterprise A webpage inventory to be crawled is set, by reptile instrument from webpage sheet page to be crawled Web page server in obtain web page contents relevant to enterprise A as set of source data.It needs to carry out set of source data later Pretreatment, obtains history text data relevant to enterprise development process.Specifically, can be by each piece from the text of web page crawl The extraction of data progress title keyword list and text lists of keywords.

In one embodiment, as shown in figure 4, step S110 includes:

S111, the title of each text data and text in more text datas included by the set of source data are led to respectively It crosses word frequency-inverse document frequency model and carries out key word information extraction, obtain title keyword corresponding with each text data List and text lists of keywords；

S112, each text data in more text datas included by the set of source data is delivered into time and enterprise's thing Part time of origin is identified, text material time information corresponding with each text data is obtained；

S113, by the corresponding title keyword list of each text data of the source data set, text lists of keywords It is packaged combination with text material time information, obtains enterprise's historical information list corresponding with each text data, with Data after to processing corresponding with each source data.

In the present embodiment, when pass through TF-IDF model (i.e. term frequency-inverse document Frequency indicates word frequency-inverse document frequency model) to each in more text datas included by the set of source data The title and text of text data carry out key word information extraction, each text data simplification can be expressed as corresponding Lists of keywords.

Later, each text data in more text datas included by the set of source data is delivered into time and enterprise's thing When part time of origin is identified, the time of delivering of text data is that the text below the title of scan text delivers the time (delivering the time for the news in the lower section of the news release in general networking, headline), and the knowledge of enterprise event time of origin Other process is the body matter of scan text data, and X5 divides the text of equi-time point when judging whether comprising X1 X2 month X3 day X4 This content identifies text content if having to obtain enterprise event time of origin.

Since each text data all simplifies for title keyword list, text lists of keywords and text material time The combination of information, therefore the corresponding each text data of each source data can be simplified into expression by enterprise's historical information list, Data can be indicated with corresponding enterprise's historical information list after the corresponding processing of i.e. each source data.

S120, more text datas included in the source data are handled correspondingly after data carry out duplicate removal, obtain Data set after to duplicate removal corresponding with source data.

In the present embodiment, the judgement for why carrying out repeated data is possible to crawler and has crawled the new of multiple webpages Content is heard, is the same enterprise event for describing the enterprise, only the news content is reprinted simultaneously by more home Web sites；Similarly Web page contents, after being pre-processed, data are identical after obtained processing, therefore data only need to protect after multiple identical processing Stay one and remaining carry out duplicate removal deletion, can effectively prevent from obtaining duplicate corporate events information in this way.

In one embodiment, as shown in figure 5, step S120 includes:

S121, data after each processing are obtained, according to title keyword list included in data after each processing and Text lists of keywords, data are with the presence or absence of weight after judgement is handled correspondingly with more text datas included by set of source data Data after multiple processing；

If S122, handled correspondingly with more text datas included by set of source data after there are duplicate places in data Data after reason by one group of preservation arbitrarily selected in data after multiple identical processing and delete number after the processing of unselected preservation According to obtaining data set after duplicate removal corresponding with source data.

In the present embodiment, if more text datas included by set of source data have weight in data after handling correspondingly Data after multiple processing indicate that crawler has crawled the news content of multiple webpages, are the same enterprise's things for describing the enterprise Part need to only save the news content of one of webpage at this time, and data after the processing of other redundances are deleted, Obtain data set after duplicate removal corresponding with source data.If the number after handling correspondingly of more text datas included by set of source data There is no data after duplicate processing in, then retain the set of source data as data set after duplicate removal.By duplicate removal processing, have Effect has simplified the size of data set.

Enterprise's historical information list included in data set after S130, the acquisition duplicate removal, by each company information list According to enterprise event time of origin in included text and corresponding enterprise's historical information list it is corresponding be added to time shaft or The corresponding timing node of calendar obtains development track data corresponding with enterprise name；Wherein, the title of time shaft or calendar by Enterprise name corresponding with each company information list is generated, and the development track data are to be added to enterprise's historical information list Time shaft form or calendar form data.

In the present embodiment, due to after duplicate removal data set delete duplicate enterprise's historical information list, and each enterprise Include enterprise event time of origin in industry historical information list, therefore each enterprise's history can be believed according to enterprise event time of origin Breath list is ranked up, and each enterprise's historical information list is added to corresponding timing node according to enterprise event time of origin On, to obtain the development track data of enterprise.

The time shaft or calendar with name are specifically first generated by enterprise name corresponding with each company information list, so Each enterprise's historical information list is added to time shaft or the corresponding timing node of calendar according to enterprise event time of origin afterwards, It is corresponding when only carrying out highlighted mark on time shaft or the corresponding timing node of calendar, and clicking the timing node to pop up the time The corresponding one or more enterprises historical information list of the node same day.

In one embodiment, as shown in fig. 6, step S130 includes:

S131, according to the enterprise event time of origin for including in each company information list, to each enterprise's historical information list Carry out descending sort, enterprise's historical information list collection after being sorted；

S132, the descending order that will respectively press enterprise event time of origin after the sequence in enterprise's historical information list collection It is successively added on pre-set time shaft or the corresponding timing node of calendar, to obtain hair corresponding with the enterprise name Open up track data.

In the present embodiment, each enterprise's historical information list can be ranked up according to enterprise event time of origin, it will be every The historical information list of one enterprise is added to time shaft according to enterprise event time of origin or on the calendar corresponding date, obtains one Kind using time shaft or calendar as carrier, can select a certain date checks enterprise's historical information list and its specific content of text Development track data.

S140, the development track data are shown by time shaft or calendar.

In the present embodiment, when by each enterprise's historical information list according to enterprise event time of origin be added to time shaft or After person's calendar corresponding date is upper, the development track data are shown by time shaft or calendar.In order to save display space, one As initially only show the time shaft or calendar named with enterprise name, the timing node for being added to enterprise's historical information list is high Bright mark after user clicks the timing node, shows corresponding enterprise's historical information list.

In one embodiment, as shown in figure 3, after step S140 further include:

If S150, the search key for detecting institute's typing, by the search key and the list of enterprise's historical information into Row compares, and obtains and arranges with the Pearson came similarity of the search key beyond enterprise's historical information of preset similarity threshold Table, to push result as retrieval.

In the present embodiment, it after each source data of source data set being carried out pretreatment and duplicate removal processing, obtains and each text The one-to-one enterprise's historical information list of notebook data, in chronological sequence Sequential Mapping arrives for these enterprise's historical information lists at this time The development track data of enterprise are formed after corresponding timing node.It is crucial due in each enterprise's historical information list including title The search key of word list, text lists of keywords and text material time information, institute's typing can be with each enterprise's history Each keyword in information list is compared, and screening obtains the enterprise historical information list conduct closest with search key Retrieval push result.

It is gone through with the Pearson came similarity of the search key beyond the enterprise of preset similarity threshold specifically, obtaining History information list, to push result as retrieval.Each keyword and the retrieval are crucial in calculating the list of enterprise's historical information When Pearson came similarity between word, each keyword and the search key in enterprise's historical information list can be passed through Word2Vec model (Word2Vec model is a kind of for converting word on the model of term vector) is converted into term vector, then Calculate the Pearson came similarity between the corresponding term vector of each keyword term vector corresponding with search key.

The Pearson correlation coefficient between any two vector is calculated, can be calculate by the following formula:

Wherein, E indicates mathematic expectaion；

ρ_X,YValue range be (0,1), work as ρ_X,YIt indicates that the similarity degree of two column vectors is higher closer to 1, works as ρ_X,YMore Indicate that the similarity degree of two vectors is lower close to 0.

The method achieve company-related information is crawled from network automatically, in the key message for extracting company-related information Afterwards, all key messages are ranked up according to the time, obtain the development track data of enterprise.

The embodiment of the present invention also provides a kind of data storage device, and the data storage device is for executing aforementioned data storage Any embodiment of method.Specifically, referring to Fig. 7, Fig. 7 is the schematic of data storage device provided in an embodiment of the present invention Block diagram.The data storage device 100 can be configured in server.

It is obtained as shown in fig. 7, data storage device 100 crawls unit 110, duplicate removal unit 120, track data including source data Take unit 130, data display unit 140.

Source data crawls unit 110, for crawling corresponding set of source data according to the received enterprise name of institute, by the source Data set is pre-processed, and data after handling correspondingly with more text datas included by the set of source data are obtained；Its In, data correspond to enterprise's historical information list after each processing.

In the present embodiment, for the development track of certain one or more enterprise, can be crawled by reptile instrument magnanimity with Correspond to the relevant web page contents of enterprise.Such as when the relevant web page contents of enterprise A need to be obtained, to deployment climbing in the server Webpage inventory to be crawled relevant to enterprise A is arranged in worm tool, by reptile instrument from the net with webpage sheet page to be crawled Web page contents relevant to enterprise A are obtained as set of source data in page server.It needs to locate set of source data in advance later Reason, obtains history text data relevant to enterprise development process.Specifically, can be by each piece from the text data of web page crawl Carry out the extraction of title keyword list and text lists of keywords.

In one embodiment, as shown in figure 9, source data crawls unit 110 includes:

Keyword abstraction unit 111, for by each text data in more text datas included by the set of source data Title and text pass through word frequency-inverse document frequency model respectively and carry out key word information extraction, obtain and each textual data According to corresponding title keyword list and text lists of keywords；

Temporal information recognition unit 112, for by each textual data in more text datas included by the set of source data According to deliver the time and enterprise event time of origin is identified, obtain corresponding with each text data text material time letter Breath；

Data combination unit 113, for by the corresponding title keyword list of each text data of the source data set, Text lists of keywords and text material time information are packaged combination, obtain enterprise corresponding with each text data history Information list, to obtain data after processing corresponding with each source data.

Duplicate removal unit 120, for data after handling more text datas included in the source data correspondingly Duplicate removal is carried out, data set after duplicate removal corresponding with source data is obtained.

In one embodiment, as shown in Figure 10, duplicate removal unit 120 includes:

Comparing unit 121, for obtaining data after each processing, according to mark included in data after each processing Lists of keywords and text lists of keywords are inscribed, after judgement is handled correspondingly with more text datas included by set of source data Data whether there is data after duplicate processing；

Data delete unit 122, if for number after handling correspondingly with more text datas included by set of source data There are data after duplicate processing in, and by one group of preservation arbitrarily selected in data after multiple identical processing and deletion is not chosen Surely data after the processing saved obtain data set after duplicate removal corresponding with source data.

In the present embodiment, if more text datas included by set of source data have weight in data after handling correspondingly Data after multiple processing indicate that crawler has crawled the news content of multiple webpages, are the same enterprise's things for describing the enterprise Part need to only save the news content of one of webpage at this time, and data after the processing of other redundances are deleted, Obtain data set after duplicate removal corresponding with source data.By duplicate removal processing, the size of data set has effectively been simplified.

Track data acquiring unit 130, for obtaining enterprise's historical information column included in data set after the duplicate removal Table, by each company information list according to enterprise event time of origin in included text and corresponding enterprise's historical information list Correspondence is added to time shaft or the corresponding timing node of calendar, obtains development track data corresponding with enterprise name；Wherein, when Between the title of axis or calendar generated by enterprise name corresponding with each company information list, the development track data be addition The time shaft form of enterprise's historical information list or the data of calendar form.

In one embodiment, as shown in figure 11, track data acquiring unit 130 includes:

Descending sort unit 131, for according to the enterprise event time of origin for including in each company information list, to each enterprise Industry historical information list carries out descending sort, enterprise's historical information list collection after being sorted；

Track data map unit 132, for will respectively press enterprise's thing after the sequence in enterprise's historical information list collection The descending order of part time of origin is successively added on pre-set time shaft or the corresponding timing node of calendar, with obtain with The corresponding development track data of the enterprise name.

Data display unit 140, for showing the development track data by time shaft or calendar.

In one embodiment, as shown in figure 8, data storage device 100 further include:

Retrieval unit 150, if for detecting typing search key, by the search key and enterprise's history Information list is compared, and obtains the enterprise for exceeding preset similarity threshold with the Pearson came similarity of the search key Historical information list, to push result as retrieval.

The arrangement achieves company-related information is crawled from network automatically, in the key message for extracting company-related information Afterwards, all key messages are ranked up according to the time, obtain the development track data of enterprise.

Above-mentioned data storage device can be implemented as the form of computer program, which can be in such as Figure 12 institute It is run in the computer equipment shown.

Figure 12 is please referred to, Figure 12 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer is set Standby 500 be server, and server can be independent server, is also possible to the server cluster of multiple server compositions.

Refering to fig. 12, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.

The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 are performed, and processor 502 may make to execute date storage method.

The processor 502 supports the operation of entire computer equipment 500 for providing calculating and control ability.

The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute date storage method.

The network interface 505 is for carrying out network communication, such as the transmission of offer data information.Those skilled in the art can To understand, structure shown in Figure 12, only the block diagram of part-structure relevant to the present invention program, is not constituted to this hair The restriction for the computer equipment 500 that bright scheme is applied thereon, specific computer equipment 500 may include than as shown in the figure More or fewer components perhaps combine certain components or with different component layouts.

Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following function Can: corresponding set of source data is crawled according to the received enterprise name of institute, the set of source data is pre-processed, obtain with it is described Data after more text datas included by set of source data are handled correspondingly；Wherein, the corresponding enterprise of data after each processing Industry historical information list；Data carry out duplicate removal after more text datas included in the source data are handled correspondingly, Obtain data set after duplicate removal corresponding with source data；Obtain enterprise's historical information column included in data set after the duplicate removal Table, by each company information list according to enterprise event time of origin in included text and corresponding enterprise's historical information list Correspondence is added to time shaft or the corresponding timing node of calendar, obtains development track data corresponding with enterprise name；Wherein, when Between the title of axis or calendar generated by enterprise name corresponding with each company information list, the development track data be addition The time shaft form of enterprise's historical information list or the data of calendar form；And the development track data are passed through into the time Axis or calendar are shown.

In one embodiment, the enterprise included in data set after executing the acquisition duplicate removal of processor 502 is gone through History information list, by each company information list according to enterprise event time of origin in included text and corresponding enterprise's history Information list correspondence is added to time shaft or the corresponding timing node of calendar, obtains development track number corresponding with enterprise name According to；It after the step of development track data are shown by time shaft or calendar, also performs the following operations: if detection To the search key of institute's typing, the search key is compared with enterprise historical information list, is obtained and the inspection The Pearson came similarity of rope keyword exceeds enterprise's historical information list of preset similarity threshold, to tie as retrieval push Fruit.

In one embodiment, processor 502 execute it is described the set of source data is pre-processed, obtain and the source It after more text datas included by data set are handled correspondingly when the step of data, performs the following operations: by the source number Pass through word frequency-inverse document frequency mould respectively according to the title and text of each text data in more included text datas of collection Type carries out key word information extraction, obtains title keyword list corresponding with each text data and text lists of keywords；It will Each text data delivers time and the progress of enterprise event time of origin in more text datas included by the set of source data Identification, obtains text material time information corresponding with each text data；By each text data pair of the source data set Title keyword list, text lists of keywords and the text material time information answered are packaged combination, obtain and each text The corresponding enterprise's historical information list of notebook data, to obtain data after processing corresponding with each source data.

In one embodiment, processor 502 execute it is described by more text datas included in the source data one by one Data carry out duplicate removal after corresponding processing, when obtaining the step of data set after duplicate removal corresponding with source data, perform the following operations: Data after each processing are obtained, are arranged according to title keyword list included in data after each processing and text keyword Table, data are with the presence or absence of number after duplicate processing after judgement is handled correspondingly with more text datas included by set of source data According to；If there are data after duplicate processing in data after being handled correspondingly with more text datas included by set of source data, By one group of preservation arbitrarily selected in data after multiple identical processing and delete data after the processing of unselected preservation, obtain with Data set after the corresponding duplicate removal of source data.

In one embodiment, processor 502 described is looked forward to each company information list according in included text executing Industry Time To Event and corresponding enterprise's historical information list correspondence are added to time shaft or the corresponding timing node of calendar, obtain To development track data corresponding with enterprise name step when, perform the following operations: including according in each company information list Enterprise event time of origin, to each enterprise's historical information list carry out descending sort, after sort enterprise's historical information column Table set；Respectively will successively it add by the descending order of enterprise event time of origin in enterprise's historical information list collection after the sequence It adds on pre-set time shaft or the corresponding timing node of calendar, to obtain development track corresponding with the enterprise name Data.

It will be understood by those skilled in the art that the embodiment of computer equipment shown in Figure 12 is not constituted to computer The restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, or Person combines certain components or different component layouts.For example, in some embodiments, computer equipment can only include depositing Reservoir and processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 12, Details are not described herein.

It should be appreciated that in embodiments of the present invention, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or The processor is also possible to any conventional processor etc..

Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be with For non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculating Machine program performs the steps of when being executed by processor crawls corresponding set of source data according to the received enterprise name of institute, by institute It states set of source data to be pre-processed, obtains number after handling correspondingly with more text datas included by the set of source data According to；Wherein, data correspond to enterprise's historical information list after each processing；By more textual datas included in the source data Duplicate removal is carried out according to data after one-to-one processing, obtains data set after duplicate removal corresponding with source data；After obtaining the duplicate removal Included enterprise's historical information list, each company information list is sent out according to enterprise event in included text in data set Raw time and corresponding enterprise's historical information list correspondence are added to time shaft or the corresponding timing node of calendar, obtain and enterprise The corresponding development track data of title；Wherein, the title of time shaft or calendar is by enterprise's name corresponding with each company information list Title is generated, and the development track data are the time shaft form for being added to enterprise's historical information list or the number of calendar form According to；And the development track data are shown by time shaft or calendar.

In one embodiment, enterprise's historical information list included in data set after the acquisition duplicate removal, will be each Company information list adds according to enterprise event time of origin in included text and corresponding enterprise's historical information list correspondence Time shaft or the corresponding timing node of calendar are added to, development track data corresponding with enterprise name are obtained；By the development rail After mark data are shown by time shaft or calendar, further includes: if detecting the search key of institute's typing, by the inspection Rope keyword is compared with enterprise's historical information list, obtains with the Pearson came similarity of the search key beyond default Similarity threshold enterprise's historical information list, using as retrieval push result.

In one embodiment, described to pre-process the set of source data, obtain with it is more included by the set of source data Data after piece text data is handled correspondingly, comprising: by each text in more text datas included by the set of source data The title and text of notebook data pass through word frequency-inverse document frequency model respectively and carry out key word information extraction, obtain with respectively The corresponding title keyword list of text data and text lists of keywords；By more textual datas included by the set of source data In each text data deliver the time and enterprise event time of origin is identified, obtain text corresponding with each text data This material time information；The corresponding title keyword list of each text data of the source data set, text keyword are arranged Table and text material time information are packaged combination, obtain enterprise's historical information list corresponding with each text data, with Obtain data after processing corresponding with each source data.

In one embodiment, it is described more text datas included in the source data are handled correspondingly after data Duplicate removal is carried out, obtains data set after duplicate removal corresponding with source data, comprising: data after each processing are obtained, according to each processing Title keyword list and text lists of keywords included in data afterwards, judgement and more textual datas included by set of source data According to data after one-to-one processing with the presence or absence of data after duplicate processing；If with more text datas included by set of source data There are data after duplicate processing in data after one-to-one processing, will be arbitrarily selected one in data after multiple identical processing Group saves and deletes data after the processing of unselected preservation, obtains data set after duplicate removal corresponding with source data.

In one embodiment, it is described by each company information list according to enterprise event time of origin in included text and Corresponding enterprise's historical information list correspondence is added to time shaft or the corresponding timing node of calendar, obtains corresponding with enterprise name Development track data, comprising: according to the enterprise event time of origin for including in each company information list, each enterprise's history is believed It ceases list and carries out descending sort, enterprise's historical information list collection after being sorted；Enterprise's historical information after the sequence is arranged Pre-set time shaft is successively respectively added to by the descending order of enterprise event time of origin in table set or calendar is corresponding On timing node, to obtain development track data corresponding with the enterprise name.

It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is set The specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein. Those of ordinary skill in the art may be aware that unit described in conjunction with the examples disclosed in the embodiments of the present disclosure and algorithm Step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and software Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully Unexpectedly the specific application and design constraint depending on technical solution are implemented in hardware or software.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.

In several embodiments provided by the present invention, it should be understood that disclosed unit and method, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only logical function partition, there may be another division manner in actual implementation, can also will be with the same function Unit set is at a unit, such as multiple units or components can be combined or can be integrated into another system or some Feature can be ignored, or not execute.In addition, shown or discussed mutual coupling, direct-coupling or communication connection can Be through some interfaces, the indirect coupling or communication connection of device or unit, be also possible to electricity, mechanical or other shapes Formula connection.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needs Purpose.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing The all or part of part or the technical solution that technology contributes can be embodied in the form of software products, should Computer software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be Personal computer, server or network equipment etc.) execute all or part of step of each embodiment the method for the present invention Suddenly.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or The various media that can store program code such as person's CD.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims

1. a kind of date storage method characterized by comprising

Corresponding set of source data is crawled according to the received enterprise name of institute, the set of source data is pre-processed, is obtained and institute State data after more text datas included by set of source data are handled correspondingly；Wherein, data are one corresponding after each processing Enterprise's historical information list；

Data carry out duplicate removal after more text datas included in the source data are handled correspondingly, obtain and source data Data set after corresponding duplicate removal；

Enterprise's historical information list included in data set after the duplicate removal is obtained, by each company information list according to included by Text in enterprise event time of origin and corresponding enterprise's historical information list are corresponding is added to time shaft or calendar is corresponding Timing node obtains development track data corresponding with enterprise name；Wherein, the title of time shaft or calendar with each enterprise by believing The corresponding enterprise name of breath list is generated, and the development track data are to be added to the time shaft shape of enterprise's historical information list The data of formula or calendar form；And

The development track data are shown by time shaft or calendar.

2. date storage method according to claim 1, which is characterized in that institute in data set after the acquisition duplicate removal Including enterprise's historical information list, by each company information list according to enterprise event time of origin in included text and right The enterprise's historical information list correspondence answered is added to time shaft or the corresponding timing node of calendar, obtains corresponding with enterprise name Development track data；After the development track data are shown by time shaft or calendar, further includes:

If detecting the search key of institute's typing, the search key is compared with enterprise historical information list, is obtained Take with the Pearson came similarity of the search key exceed preset similarity threshold enterprise's historical information list, using as Retrieval push result.

3. date storage method according to claim 1, which is characterized in that described to be located the set of source data in advance Reason, obtains data after handling correspondingly with more text datas included by the set of source data, comprising:

It is inverse that the title of each text data and text in more text datas included by the set of source data are passed through into word frequency-respectively Text frequency index model carries out key word information extraction, obtains title keyword list corresponding with each text data and text Lists of keywords；

By when delivering time and enterprise event generation of each text data in more text datas included by the set of source data Between identified, obtain text material time information corresponding with each text data；

The corresponding title keyword list of each text data of the source data set, text lists of keywords and text is crucial Temporal information is packaged combination, obtains enterprise's historical information list corresponding with each text data, to obtain and each source Data after the corresponding processing of data.

4. date storage method according to claim 1, which is characterized in that described by included more in the source data Data carry out duplicate removal after text data is handled correspondingly, obtain data set after duplicate removal corresponding with source data, comprising:

Data after each processing are obtained, according to title keyword list included in data after each processing and text keyword List, data are with the presence or absence of after duplicate processing after judgement is handled correspondingly with more text datas included by set of source data Data；

If there are data after duplicate processing in data after being handled correspondingly with more text datas included by set of source data, By one group of preservation arbitrarily selected in data after multiple identical processing and delete data after the processing of unselected preservation, obtain with Data set after the corresponding duplicate removal of source data.

5. date storage method according to claim 1, which is characterized in that it is described by each company information list according to being wrapped Enterprise event time of origin and corresponding enterprise's historical information list correspondence are added to time shaft in the text included or calendar is corresponding Timing node, obtain development track data corresponding with enterprise name, comprising:

According to the enterprise event time of origin for including in each company information list, descending row is carried out to each enterprise's historical information list Sequence, enterprise's historical information list collection after being sorted；

It respectively will successively be added by the descending order of enterprise event time of origin in enterprise's historical information list collection after the sequence To pre-set time shaft or the corresponding timing node of calendar, to obtain development track number corresponding with the enterprise name According to.

6. a kind of data storage device characterized by comprising

Source data crawls unit, for crawling corresponding set of source data according to the received enterprise name of institute, by the set of source data It is pre-processed, obtains data after handling correspondingly with more text datas included by the set of source data；Wherein, each Data correspond to enterprise's historical information list after processing；

Duplicate removal unit is gone for data after handling more text datas included in the source data correspondingly Weight, obtains data set after duplicate removal corresponding with source data；And

Track data acquiring unit will be each for obtaining enterprise's historical information list included in data set after the duplicate removal Company information list adds according to enterprise event time of origin in included text and corresponding enterprise's historical information list correspondence Time shaft or the corresponding timing node of calendar are added to, development track data corresponding with enterprise name are obtained；Wherein, time shaft or The title of calendar is generated by enterprise name corresponding with each company information list, and the development track data are to be added to enterprise The time shaft form of historical information list or the data of calendar form；

7. data storage device according to claim 6, which is characterized in that further include:

Retrieval unit, if for detecting typing search key, the search key and enterprise's historical information are arranged Table is compared, and obtains and believes with the Pearson came similarity of the search key beyond enterprise's history of preset similarity threshold List is ceased, to push result as retrieval.

8. data storage device according to claim 6, which is characterized in that the track data acquiring unit, comprising:

Descending sort unit, for according to the enterprise event time of origin for including in each company information list, to each enterprise's history Information list carries out descending sort, enterprise's historical information list collection after being sorted；

Track data map unit, when for by respectively occurring by enterprise event in enterprise's historical information list collection after the sequence Between descending order be successively added on pre-set time shaft or the corresponding timing node of calendar, to obtain and the enterprise The corresponding development track data of title.

9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 5 when executing the computer program Any one of described in date storage method.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor execute such as number described in any one of claim 1 to 5 when being executed by a processor According to storage method.