CN105243099A - Large data real-time storage method based on translation document - Google Patents
Large data real-time storage method based on translation document Download PDFInfo
- Publication number
- CN105243099A CN105243099A CN201510592464.0A CN201510592464A CN105243099A CN 105243099 A CN105243099 A CN 105243099A CN 201510592464 A CN201510592464 A CN 201510592464A CN 105243099 A CN105243099 A CN 105243099A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- storage method
- method based
- time storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention discloses a large data real-time storage method based on a translation document. The method comprises: acquiring aligned corpus data; establishing a database index for the aligned corpus data; and according to the database index, carrying out distributed storage on the aligned corpus data. The method disclosed by the present invention is not only fast in storage speed but also high in calling efficiency.
Description
Technical field
The present invention relates to storage means field, particularly, relate to a kind of large data real-time storage method based on translated document.
Background technology
Along with the continuous progress of science and technology, international exchange is more and more frequent, and the more and more opening of world economy, globalizes more and more deep, and the translation between various language file material also gets more and more, especially between English, the Chinese.Translated document relates to the every aspect of life: the every field such as trade, law, electronics, communication, computing machine, machinery, chemical industry, oil, medicine, food.
Translation belongs to service sector, and service sector will customer-orientation all the time., file number of words increasing in translation amount increasing today, how improving translation speed, the demand meeting client is very important.The popular translation speed that makes of CAT technology improves greatly.In order to further improve translation speed, the data sectional subordinate sentence translated is made into alignment language material, so that directly calling to the translation statement of existing repetition in translation process.In translation process, alignment language material is more and more many, how to realize, to the storage of alignment language material, being convenient to subsequent calls, seeming very important.Existing translated document storage mode, it directly stores data in storer, when calling, directly carrying out search to the data in storer and calling.But directly store data in storer, storage speed is low; Search for hard disc data successively until search identical content and call, its efficiency is very low.
Summary of the invention
The present invention provides a kind of large data real-time storage method based on translated document to solve the problems of the technologies described above, and not only storage speed is fast for it, and it is high to call efficiency.
The present invention's adopted technical scheme that solves the problem is:
Based on a large data real-time storage method for translated document, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
In continuous translation process, the accumulative number of words of translation gets more and more, and cypher text is converted to alignment language material and stores, so that subsequent calls.The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.
As preferably, in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.When storing alignment corpus data, adopting multi-thread concurrent processing mode, alignment corpus data being almost stored on multiple stage memory device simultaneously, directly improving storage speed.
Further, the data structure of described database index adopts mongodb.Mongodb is a database stored based on distributed document, for WEB application provides extendible high-performance data storage solution.
Further, step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.
As preferably, in order to further strengthen the speed to data call, in stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Database index is classified, when calling data, directly searching the region that classification is corresponding, further can shorten the time of searching, improve data call speed.
To sum up, the invention has the beneficial effects as follows:
The alignment corpus data entering into system utilizes index to carry out distributed storage by the present invention, not only storage speed is fast for it, and when calling alignment corpus data, working in coordination with and providing service, improve data call speed, solve the high and data of giving of large data and store.
Embodiment
Below in conjunction with embodiment, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment
Based on a large data real-time storage method for translated document, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
For example: system receives one group of alignment corpus data, these data submit the data to the index server of rear end by front end page, after index server receives these data, call multiple write data-interfaces, enable multiple threads mode, data are write different memory machines.Adopt which, can guarantee data integrity, data sequence write performance on average can reach 7w/s, and the random reading performance of data on average can reach 1.6w/s, can reach 5K/s when concurrent reading and writing 1:10 reads, and can reach 5W/s when writing.
Embodiment
On the basis of a kind of above-mentioned large data real-time storage method based on translated document, the present embodiment is optimized, and namely in step C, data, when storing, use multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.
The data structure of described database index adopts mongodb.Adopt the structure of mongodb, compare and traditional relevant database, the data store organisation of mongodb does not have complicated relation.The index (as: FirstName=" Sameer ", Address=" 8GandhiRoad ") that such as can arrange any attribute in MongoDB record realizes sorting faster.If need more storage space and stronger processing power, i.e. the increase of load, on other nodes that it can distribute in a computer network, i.e. so-called burst, it can adapt to data management mode more flexibly under large data.
Step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index, even if there is new data source like this, also latest data can be retrieved in time, be convenient to call in time, adopt distributed scheme greatly to have submitted data retrieval usefulness.
In stepb, the index in database index is classified, and set up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.Index is classified, such as Building class, life kind, physics class, chemical classes, computer, GT grand touring etc., when calling alignment corpus data, directly searching index by classification, shortening and searching the time.
As mentioned above, the present invention can be realized preferably.
Claims (5)
1., based on a large data real-time storage method for translated document, it is characterized in that, comprising:
A, acquisition alignment corpus data;
B, be alignment corpus data building database index;
C, according to database index to alignment corpus data carry out distributed storage.
2. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in step C, alignment corpus data, when storing, uses multi-thread concurrent treatment mechanism that this alignment corpus data is stored in different memory devices simultaneously.
3. a kind of large data real-time storage method based on translated document according to claim 2, is characterized in that: the data structure of described database index adopts mongodb.
4. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: step B is specially: select to increase income distributed data base as accumulation layer, the pattern of database index is: first submit to data to Service service layer by client layer, service layer calls after taking data and writes data to storage medium, store analyzing stored data simultaneously, by the data real-time update after analysis to data buffer storage index.
5. a kind of large data real-time storage method based on translated document according to claim 1, it is characterized in that: in stepb, index in database index is classified, and sets up some index starting points, utilize index starting point to carry out classification to all kinds of index and store.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510592464.0A CN105243099A (en) | 2015-09-17 | 2015-09-17 | Large data real-time storage method based on translation document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510592464.0A CN105243099A (en) | 2015-09-17 | 2015-09-17 | Large data real-time storage method based on translation document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105243099A true CN105243099A (en) | 2016-01-13 |
Family
ID=55040748
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510592464.0A Pending CN105243099A (en) | 2015-09-17 | 2015-09-17 | Large data real-time storage method based on translation document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243099A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055543A (en) * | 2016-05-23 | 2016-10-26 | 南京大学 | Spark-based training method of large-scale phrase translation model |
CN107203637A (en) * | 2017-06-08 | 2017-09-26 | 恒生电子股份有限公司 | A kind of data analysing method and system |
CN107590140A (en) * | 2017-10-17 | 2018-01-16 | 语联网(武汉)信息技术有限公司 | Entry process method is translated in a kind of document leakage |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650741A (en) * | 2009-08-27 | 2010-02-17 | 中国电信股份有限公司 | Method and system for updating index of distributed full-text search in real time |
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN104239377A (en) * | 2013-11-12 | 2014-12-24 | 新华瑞德(北京)网络科技有限公司 | Platform-crossing data retrieval method and device |
CN104346347A (en) * | 2013-07-25 | 2015-02-11 | 深圳市腾讯计算机系统有限公司 | Data storage method, device, server and system |
-
2015
- 2015-09-17 CN CN201510592464.0A patent/CN105243099A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650741A (en) * | 2009-08-27 | 2010-02-17 | 中国电信股份有限公司 | Method and system for updating index of distributed full-text search in real time |
CN104346347A (en) * | 2013-07-25 | 2015-02-11 | 深圳市腾讯计算机系统有限公司 | Data storage method, device, server and system |
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN104239377A (en) * | 2013-11-12 | 2014-12-24 | 新华瑞德(北京)网络科技有限公司 | Platform-crossing data retrieval method and device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055543A (en) * | 2016-05-23 | 2016-10-26 | 南京大学 | Spark-based training method of large-scale phrase translation model |
CN106055543B (en) * | 2016-05-23 | 2019-04-09 | 南京大学 | The training method of extensive phrase translation model based on Spark |
CN107203637A (en) * | 2017-06-08 | 2017-09-26 | 恒生电子股份有限公司 | A kind of data analysing method and system |
CN107203637B (en) * | 2017-06-08 | 2020-04-24 | 恒生电子股份有限公司 | Data analysis method and system |
CN107590140A (en) * | 2017-10-17 | 2018-01-16 | 语联网(武汉)信息技术有限公司 | Entry process method is translated in a kind of document leakage |
CN107590140B (en) * | 2017-10-17 | 2020-09-25 | 语联网(武汉)信息技术有限公司 | Document missing item processing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107247808B (en) | Distributed NewSQL database system and picture data query method | |
CN105117417B (en) | A kind of memory database Trie tree indexing means for reading optimization | |
JP6639420B2 (en) | Method for flash-optimized data layout, apparatus for flash-optimized storage, and computer program | |
WO2015106711A1 (en) | Method and device for constructing nosql database index for semi-structured data | |
CN103544261B (en) | A kind of magnanimity structuring daily record data global index's management method and device | |
US20160080303A1 (en) | Determining topic relevance of an email thread | |
US20140046928A1 (en) | Query plans with parameter markers in place of object identifiers | |
CN101996067A (en) | Data export method and device | |
CN104750681A (en) | Method and device for processing mass data | |
US9262511B2 (en) | System and method for indexing streams containing unstructured text data | |
CN103914483B (en) | File memory method, device and file reading, device | |
CN103150395B (en) | Directory path analysis method of solid state drive (SSD)-based file system | |
CN107391544B (en) | Processing method, device and equipment of column type storage data and computer storage medium | |
CN106066895A (en) | A kind of intelligent inquiry system | |
CN109460404A (en) | A kind of efficient Hbase paging query method based on redis | |
CN105138649A (en) | Data search method and device and terminal | |
CN105243099A (en) | Large data real-time storage method based on translation document | |
CN105912696A (en) | DNS (Domain Name System) index creating method and query method based on logarithm merging | |
CN103942301A (en) | Distributed file system oriented to access and application of multiple data types | |
CN105068941A (en) | Cache page replacing method and cache page replacing device | |
CN104571946A (en) | Memory device supporting quick query of logical circuit and access method of memory device | |
CN114281989A (en) | Data deduplication method and device based on text similarity, storage medium and server | |
CN111831691A (en) | Data reading and writing method and device, electronic equipment and storage medium | |
CN110781101A (en) | One-to-many mapping relation storage method and device, electronic equipment and medium | |
US20180276290A1 (en) | Relevance optimized representative content associated with a data storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 610000 B, building 4, building 200, Tianfu five street, Chengdu hi tech Zone, Sichuan, Applicant after: Chengdu excellent translation information technology Limited by Share Ltd Address before: 610000, No. 1, building 107, 1 West Bauhinia Road, Chengdu hi tech Zone, Sichuan, 6 Applicant before: Chengdu Urelite Information technology Co., Ltd. |
|
COR | Change of bibliographic data | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160113 |
|
RJ01 | Rejection of invention patent application after publication |