CN105243099A

CN105243099A - Large data real-time storage method based on translation document

Info

Publication number: CN105243099A
Application number: CN201510592464.0A
Authority: CN
Inventors: 王榆升; 张马成; 王兴强
Original assignee: CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Current assignee: CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Priority date: 2015-09-17
Filing date: 2015-09-17
Publication date: 2016-01-13

Abstract

The present invention discloses a large data real-time storage method based on a translation document. The method comprises: acquiring aligned corpus data; establishing a database index for the aligned corpus data; and according to the database index, carrying out distributed storage on the aligned corpus data. The method disclosed by the present invention is not only fast in storage speed but also high in calling efficiency.

Description

A kind of large data real-time storage method based on translated document

Technical field

The present invention relates to storage means field, particularly, relate to a kind of large data real-time storage method based on translated document.

Background technology

Along with the continuous progress of science and technology, international exchange is more and more frequent, and the more and more opening of world economy, globalizes more and more deep, and the translation between various language file material also gets more and more, especially between English, the Chinese.Translated document relates to the every aspect of life: the every field such as trade, law, electronics, communication, computing machine, machinery, chemical industry, oil, medicine, food.

Translation belongs to service sector, and service sector will customer-orientation all the time., file number of words increasing in translation amount increasing today, how improving translation speed, the demand meeting client is very important.The popular translation speed that makes of CAT technology improves greatly.In order to further improve translation speed, the data sectional subordinate sentence translated is made into alignment language material, so that directly calling to the translation statement of existing repetition in translation process.In translation process, alignment language material is more and more many, how to realize, to the storage of alignment language material, being convenient to subsequent calls, seeming very important.Existing translated document storage mode, it directly stores data in storer, when calling, directly carrying out search to the data in storer and calling.But directly store data in storer, storage speed is low; Search for hard disc data successively until search identical content and call, its efficiency is very low.

Summary of the invention

The present invention provides a kind of large data real-time storage method based on translated document to solve the problems of the technologies described above, and not only storage speed is fast for it, and it is high to call efficiency.

The present invention's adopted technical scheme that solves the problem is:

Based on a large data real-time storage method for translated document, comprising:

A, acquisition alignment corpus data;

B, be alignment corpus data building database index;

C, according to database index to alignment corpus data carry out distributed storage.

In continuous translation process, the accumulative number of words of translation gets more and more, and cypher text is converted to alignment language material and stores, so that subsequent calls.The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.

As preferably, in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.When storing alignment corpus data, adopting multi-thread concurrent processing mode, alignment corpus data being almost stored on multiple stage memory device simultaneously, directly improving storage speed.

Further, the data structure of described database index adopts mongodb.Mongodb is a database stored based on distributed document, for WEB application provides extendible high-performance data storage solution.

Further, step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.

As preferably, in order to further strengthen the speed to data call, in stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Database index is classified, when calling data, directly searching the region that classification is corresponding, further can shorten the time of searching, improve data call speed.

To sum up, the invention has the beneficial effects as follows:

The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.

Embodiment

Below in conjunction with embodiment, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Embodiment

A, acquisition alignment corpus data;

B, be alignment corpus data building database index;

For example: system receives one group of alignment corpus data, these data submit the data to the index server of rear end by front end page, after index server receives these data, call multiple write data-interfaces, enable multiple threads mode, data are write different memory machines.Adopt which, can guarantee data integrity, data sequence write performance on average can reach 7w/s, and the random reading performance of data on average can reach 1.6w/s, can reach 5K/s when concurrent reading and writing 1:10 reads, and can reach 5W/s when writing.

Embodiment

On the basis of a kind of above-mentioned large data real-time storage method based on translated document, the present embodiment is optimized, and namely in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.

The data structure of described database index adopts mongodb.Adopt the structure of mongodb, compare and traditional relevant database, the data store organisation of mongodb does not have complicated relation.The index (as: FirstName=" Sameer ", Address=" 8GandhiRoad ") that such as can arrange any attribute in MongoDB record realizes sorting faster.If need more storage space and stronger processing power, i.e. the increase of load, on other nodes that it can distribute in a computer network, i.e. so-called burst, it can adapt to data management mode more flexibly under large data.

Step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.

In stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Index is classified, such as Building class, life kind, physics class, chemical classes, computer, GT grand touring etc., when calling alignment corpus data, directly searching index by classification, shortening and searching the time.

As mentioned above, the present invention can be realized preferably.

Claims

1., based on a large data real-time storage method for translated document, it is characterized in that, comprising:

A, acquisition alignment corpus data;

B, be alignment corpus data building database index;

2. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in step C, alignment corpus data, when storing, uses multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.

3. a kind of large data real-time storage method based on translated document according to claim 2, is characterized in that: the data structure of described database index adopts mongodb.

4. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index.

5. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in stepb, index in database index is classified, and sets up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.