CN105243099A - Large data real-time storage method based on translation document - Google Patents

Large data real-time storage method based on translation document Download PDF

Info

Publication number
CN105243099A
CN105243099A CN201510592464.0A CN201510592464A CN105243099A CN 105243099 A CN105243099 A CN 105243099A CN 201510592464 A CN201510592464 A CN 201510592464A CN 105243099 A CN105243099 A CN 105243099A
Authority
CN
China
Prior art keywords
data
index
storage method
method based
time storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510592464.0A
Other languages
Chinese (zh)
Inventor
王榆升
张马成
王兴强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd filed Critical CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510592464.0A priority Critical patent/CN105243099A/en
Publication of CN105243099A publication Critical patent/CN105243099A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses a large data real-time storage method based on a translation document. The method comprises: acquiring aligned corpus data; establishing a database index for the aligned corpus data; and according to the database index, carrying out distributed storage on the aligned corpus data. The method disclosed by the present invention is not only fast in storage speed but also high in calling efficiency.

Description

A kind of large data real-time storage method based on translated document
Technical field
The present invention relates to storage means field, particularly, relate to a kind of large data real-time storage method based on translated document.
Background technology
Along with the continuous progress of science and technology, international exchange is more and more frequent, and the more and more opening of world economy, globalizes more and more deep, and the translation between various language file material also gets more and more, especially between English, the Chinese.Translated document relates to the every aspect of life: the every field such as trade, law, electronics, communication, computing machine, machinery, chemical industry, oil, medicine, food.
Translation belongs to service sector, and service sector will customer-orientation all the time., file number of words increasing in translation amount increasing today, how improving translation speed, the demand meeting client is very important.The popular translation speed that makes of CAT technology improves greatly.In order to further improve translation speed, the data sectional subordinate sentence translated is made into alignment language material, so that directly calling to the translation statement of existing repetition in translation process.In translation process, alignment language material is more and more many, how to realize, to the storage of alignment language material, being convenient to subsequent calls, seeming very important.Existing translated document storage mode, it directly stores data in storer, when calling, directly carrying out search to the data in storer and calling.But directly store data in storer, storage speed is low; Search for hard disc data successively until search identical content and call, its efficiency is very low.
Summary of the invention
The present invention provides a kind of large data real-time storage method based on translated document to solve the problems of the technologies described above, and not only storage speed is fast for it, and it is high to call efficiency.
The present invention's adopted technical scheme that solves the problem is:
Based on a large data real-time storage method for translated document, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
In continuous translation process, the accumulative number of words of translation gets more and more, and cypher text is converted to alignment language material and stores, so that subsequent calls.The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.
As preferably, in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.When storing alignment corpus data, adopting multi-thread concurrent processing mode, alignment corpus data being almost stored on multiple stage memory device simultaneously, directly improving storage speed.
Further, the data structure of described database index adopts mongodb.Mongodb is a database stored based on distributed document, for WEB application provides extendible high-performance data storage solution.
Further, step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.
As preferably, in order to further strengthen the speed to data call, in stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Database index is classified, when calling data, directly searching the region that classification is corresponding, further can shorten the time of searching, improve data call speed.
To sum up, the invention has the beneficial effects as follows:
The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.
Embodiment
Below in conjunction with embodiment, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment
Based on a large data real-time storage method for translated document, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
For example: system receives one group of alignment corpus data, these data submit the data to the index server of rear end by front end page, after index server receives these data, call multiple write data-interfaces, enable multiple threads mode, data are write different memory machines.Adopt which, can guarantee data integrity, data sequence write performance on average can reach 7w/s, and the random reading performance of data on average can reach 1.6w/s, can reach 5K/s when concurrent reading and writing 1:10 reads, and can reach 5W/s when writing.
Embodiment
On the basis of a kind of above-mentioned large data real-time storage method based on translated document, the present embodiment is optimized, and namely in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.
The data structure of described database index adopts mongodb.Adopt the structure of mongodb, compare and traditional relevant database, the data store organisation of mongodb does not have complicated relation.The index (as: FirstName=" Sameer ", Address=" 8GandhiRoad ") that such as can arrange any attribute in MongoDB record realizes sorting faster.If need more storage space and stronger processing power, i.e. the increase of load, on other nodes that it can distribute in a computer network, i.e. so-called burst, it can adapt to data management mode more flexibly under large data.
Step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.
In stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Index is classified, such as Building class, life kind, physics class, chemical classes, computer, GT grand touring etc., when calling alignment corpus data, directly searching index by classification, shortening and searching the time.
As mentioned above, the present invention can be realized preferably.

Claims (5)

1., based on a large data real-time storage method for translated document, it is characterized in that, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
2. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in step C, alignment corpus data, when storing, uses multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.
3. a kind of large data real-time storage method based on translated document according to claim 2, is characterized in that: the data structure of described database index adopts mongodb.
4. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index.
5. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in stepb, index in database index is classified, and sets up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.
CN201510592464.0A 2015-09-17 2015-09-17 Large data real-time storage method based on translation document Pending CN105243099A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510592464.0A CN105243099A (en) 2015-09-17 2015-09-17 Large data real-time storage method based on translation document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510592464.0A CN105243099A (en) 2015-09-17 2015-09-17 Large data real-time storage method based on translation document

Publications (1)

Publication Number Publication Date
CN105243099A true CN105243099A (en) 2016-01-13

Family

ID=55040748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510592464.0A Pending CN105243099A (en) 2015-09-17 2015-09-17 Large data real-time storage method based on translation document

Country Status (1)

Country Link
CN (1) CN105243099A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055543A (en) * 2016-05-23 2016-10-26 南京大学 Spark-based training method of large-scale phrase translation model
CN107203637A (en) * 2017-06-08 2017-09-26 恒生电子股份有限公司 A kind of data analysing method and system
CN107590140A (en) * 2017-10-17 2018-01-16 语联网(武汉)信息技术有限公司 Entry process method is translated in a kind of document leakage

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650741A (en) * 2009-08-27 2010-02-17 中国电信股份有限公司 Method and system for updating index of distributed full-text search in real time
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN104239377A (en) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 Platform-crossing data retrieval method and device
CN104346347A (en) * 2013-07-25 2015-02-11 深圳市腾讯计算机系统有限公司 Data storage method, device, server and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650741A (en) * 2009-08-27 2010-02-17 中国电信股份有限公司 Method and system for updating index of distributed full-text search in real time
CN104346347A (en) * 2013-07-25 2015-02-11 深圳市腾讯计算机系统有限公司 Data storage method, device, server and system
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN104239377A (en) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 Platform-crossing data retrieval method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055543A (en) * 2016-05-23 2016-10-26 南京大学 Spark-based training method of large-scale phrase translation model
CN106055543B (en) * 2016-05-23 2019-04-09 南京大学 The training method of extensive phrase translation model based on Spark
CN107203637A (en) * 2017-06-08 2017-09-26 恒生电子股份有限公司 A kind of data analysing method and system
CN107203637B (en) * 2017-06-08 2020-04-24 恒生电子股份有限公司 Data analysis method and system
CN107590140A (en) * 2017-10-17 2018-01-16 语联网(武汉)信息技术有限公司 Entry process method is translated in a kind of document leakage
CN107590140B (en) * 2017-10-17 2020-09-25 语联网(武汉)信息技术有限公司 Document missing item processing method

Similar Documents

Publication Publication Date Title
CN107247808B (en) Distributed NewSQL database system and picture data query method
CN105117417B (en) A kind of memory database Trie tree indexing means for reading optimization
JP6639420B2 (en) Method for flash-optimized data layout, apparatus for flash-optimized storage, and computer program
WO2015106711A1 (en) Method and device for constructing nosql database index for semi-structured data
CN103544261B (en) A kind of magnanimity structuring daily record data global index's management method and device
US20160080303A1 (en) Determining topic relevance of an email thread
US20140046928A1 (en) Query plans with parameter markers in place of object identifiers
CN101996067A (en) Data export method and device
CN104750681A (en) Method and device for processing mass data
US9262511B2 (en) System and method for indexing streams containing unstructured text data
CN103914483B (en) File memory method, device and file reading, device
CN103150395B (en) Directory path analysis method of solid state drive (SSD)-based file system
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
CN106066895A (en) A kind of intelligent inquiry system
CN109460404A (en) A kind of efficient Hbase paging query method based on redis
CN105138649A (en) Data search method and device and terminal
CN105243099A (en) Large data real-time storage method based on translation document
CN105912696A (en) DNS (Domain Name System) index creating method and query method based on logarithm merging
CN103942301A (en) Distributed file system oriented to access and application of multiple data types
CN105068941A (en) Cache page replacing method and cache page replacing device
CN104571946A (en) Memory device supporting quick query of logical circuit and access method of memory device
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
CN111831691A (en) Data reading and writing method and device, electronic equipment and storage medium
CN110781101A (en) One-to-many mapping relation storage method and device, electronic equipment and medium
US20180276290A1 (en) Relevance optimized representative content associated with a data storage system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 610000 B, building 4, building 200, Tianfu five street, Chengdu hi tech Zone, Sichuan,

Applicant after: Chengdu excellent translation information technology Limited by Share Ltd

Address before: 610000, No. 1, building 107, 1 West Bauhinia Road, Chengdu hi tech Zone, Sichuan, 6

Applicant before: Chengdu Urelite Information technology Co., Ltd.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20160113

RJ01 Rejection of invention patent application after publication