CN105373605A - Batch storage method and system for data files - Google Patents
Batch storage method and system for data files Download PDFInfo
- Publication number
- CN105373605A CN105373605A CN201510767586.9A CN201510767586A CN105373605A CN 105373605 A CN105373605 A CN 105373605A CN 201510767586 A CN201510767586 A CN 201510767586A CN 105373605 A CN105373605 A CN 105373605A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- text data
- datas
- storage means
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
Abstract
The invention provides a batch storage method and system for data files. The method comprises the following steps: collecting multiple pieces of text data and multimedia data matched with each piece of text data; carrying out duplicate checking on all text data to obtain multiple groups of duplicated text data and correspondingly matched multimedia data; reserving one piece of text data in each group of duplicated text data and carrying out identifying storage on the text data and initial single text data; and sorting and storing the multimedia data matched with each piece of text data with the same identification code. According to the batch storage method and system, storage is carried out on a large number of scientific and technological achievement texts, pictures and audio and video data; each text data is identified after duplicate checking and deleting are carried out on the text data; and the pictures and the audio or video data matched with the text data are endowed with the same identification code and are stored into a database respectively, so that effective storage of the data files is finished.
Description
Technical field
The present invention relates to computer application field, particularly relating to a kind of data file for managing agricultural science and technology achievement storage means and system in batches.
Background technology
In the last few years, country is annual all in concern rural economy, rural development and rural demography, investment in agriculture is also in steady increasing, scientific research institutions and universities and colleges are all further exploring agricultural and are researching and developing, annual Technology value is the scientific and technological achievement data that quantity is various, these achievement datas comprise text, picture, the various ways such as Voice & Video and form, how these a large amount of random performance data are carried out effective store and management, how to screen and import the key factor becoming restriction performance data rapid saving, therefore a kind of more effective mode is needed to carry out the importing of data.
Summary of the invention
The invention provides a kind of data file batch storage means and system, for solving in prior art the problem that random redundant data in enormous quantities imports.
On the one hand, the invention provides a kind of data file batch storage means, the step of described storage means comprises:
Gather many text datas and the multi-medium data with every bar matches text data;
All text datas are looked into heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching;
Store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
The multi-medium data of every bar matches text data is given same identification code classification to store.
Further, all text datas are looked into heavily comprise: the multiple typing conditions according to text data are looked into heavily the content of text under each typing condition.
Further, comprise there is the step that the many groups of text datas repeated each other process:
The text data often organizing typing at first in text data and the multi-medium data that mates with text data are retained;
Delete by all the other text datas and with the multi-medium data of all the other matches text data.
Further, described multi-medium data comprises image data, voice data and/or video data.
Further, described typing condition is scientific and technological achievement title, author, unit, research beginning and ending time, text.
Further, described textual data be it is investigated that major punishment is broken and be there is the condition of text data repeated each other and comprise:
The ratio of the title number of words that continuous repetition number of words is less with number of words is greater than preset ratio;
And/or,
Author all identical or exist an author identical;
And/or,
Unit all identical or exist a unit identical;
And/or,
The research beginning and ending time is identical or overlapping;
And/or,
Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.
Further, also comprise the step of data search, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in multimedia database, search out corresponding multi-medium data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
On the other hand, the invention provides a kind of data file batch storage system, comprising:
Acquisition module, for gathering many text datas and the multi-medium data with every bar matches text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas;
Code storage module, stores often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, the multi-medium data of every bar matches text data is given same identification code classification simultaneously and stores.
Further, also comprise typing condition memory module, for editing or store the typing condition of text data.
Further, also comprising repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and all the other text datas being deleted.
As shown from the above technical solution, the present invention stores a large amount of text data and multi-medium data, after heavily deletion is looked into text data, each text data is identified, and same identification code is given to the multi-medium data that text data matches, be stored in respectively in taxonomy database afterwards, complete paired data file effectively stores.In addition, in search procedure, search for text data, determine text data, then obtain multi-medium data with Search Flags coding mode, complete paired data file effectively searches for display.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of storage means described in the embodiment of the present invention 1;
Fig. 2 is the concrete implementing procedure figure of one of storage means shown in the embodiment of the present invention 1;
Fig. 3 is the structured flowchart of storage system described in the embodiment of the present invention 2.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 indicates a kind of data file batch storage means that the embodiment of the present invention 1 provides, and the step of described storage means comprises:
1, many text datas and the multi-medium data with every bar matches text data is gathered;
2, look into heavily with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching all text datas, wherein, multi-medium data can comprise image data, voice data and video data;
3, store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
4, the multi-medium data of every bar matches text data is given same identification code classification to store.
Be illustrated in figure 2 the specific embodiment made for above-mentioned storage means:
S1, the image data, voice data and/or the video data that gather many text datas and match with every bar text data;
Multiple typing conditions of S2, foundation text data are looked into the content of text of all text datas under each typing condition and are weighed and judge whether to there are repeated text data;
If there are the many groups of text datas repeated each other in S21, then the text data often organizing typing at first in text data and the image data, voice data and/or the video data that mate with text data are retained, all the other text datas and the image data, voice data and/or the video data that mate with these all the other text datas are deleted;
Otherwise S22, retains all text datas, image data, voice data and video data;
S3, all retain a text data by often organizing in the text data that repeats each other and all carry out identifying stored in text database with initially single text data, wherein, initially single text data refers to the text data not having to repeat, and does not repeat to be single;
S4, image data, voice data and/or video data that bar text data every after mark mates are given same identification code be stored in respectively in picture database, audio database and video database.
Explain explanation further to said method, all agricultural science and technology achievement text datas can be carried out typing according to the excel template of systemic presupposition when image data file by the storage means described in the present embodiment.In the process of typing text data, each text data has unique numbering.After typing, the template having all text datas is submitted to system, system can be looked into all text datas and heavily process.Looking in heavy process, needing the multiple typing conditions according to text data to look into heavily the content of text of all text datas under each typing condition.This typing condition is the content typing criterion in Input Process.Look into heavily to the content of text under each typing condition, from all angles, scientific and technological achievement text data is looked into heavily, the text data that may duplicate is not also missed, improve and look into heavy accuracy.In the present embodiment, typing condition can be scientific and technological achievement title, author, unit, research beginning and ending time, text.When looking into heavy successively according to scientific and technological achievement title-author-unit-research beginning and ending time-key word of the text sequence secondary ordered pair content separately of carrying out looking into weight compares, and need judge whether to there are repeated text data in comparison process according to default judgment criterion.Default judgment criterion can be: the ratio repeating the number of words title number of words less with number of words is continuously greater than preset ratio; Author all identical or exist an author identical; Unit all identical or exist a unit identical; The research beginning and ending time is identical or overlapping; Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.Above-mentioned judgment criterion can be meet one or more criterions can judge that text data is the text data repeated each other, look into heavily from minimum judge point to scientific and technological achievement text data, the text data that may duplicate one is not also missed, improves and look into heavy accuracy.
There are the many groups of text datas repeated each other if judge, then the text data often organizing typing at first in the text data (having two sections of text datas at least) that repeats each other and the image data, voice data and/or the video data that mate with text data are retained, all the other text datas and the image data, voice data and/or the video data that mate with these all the other text datas are deleted.In such situation, identical text data is just only left a text data, avoids data redundancy.
The repeated text data retained and the text data that there is not repetition are identified according to default identification means, makes each text data have uniqueness.Because text data may wear picture, audio or video data.Therefore, in order to ensure integrality and the correspondence of whole scientific and technological achievement, the image data of correspondence, voice data or video data can be kept consistency with the identification code of text data.For this reason, the data after all identifying can be stored into separately in corresponding database.
Do specific explanations with following table below to illustrate:
Scientific and technological achievement title | Author | Unit | The research beginning and ending time | Text | |
1 | Wine-growing technology | C | P university | 2014.05.06-2015.03.08 | This shows slightly |
2 | Peach Apricot graft technology | E、F | L company | 2013.01.15-2015.08.16 | This shows slightly |
3 | Large output wine-growing | C、B | P university, G research | 2014.09.13-2015.01.20 | This shows slightly |
Method | Institute | ||||
4 | Corn variety | A | Q research institute | 2013.12.01-2014.12.10 | This shows slightly |
5 | Corn variety | A | Q research institute | 2013.12.01-2014.12.10 | This shows slightly |
In table, through looking into heavily, the text data being numbered 1 and 3 meets above-mentioned default judgment criterion on title, author, research institute, research beginning and ending time, through looking into heavily, text also meets presets judgment criterion, the text data being then numbered 1 and 3 is the two sections of text datas repeated each other, need retain the text data being numbered 1, be numbered 3 text data deleted.
In table, through looking into heavily, the text data being numbered 4 and 5 meets above-mentioned default judgment criterion on title, author, research institute, research beginning and ending time, through looking into heavily, text also meets presets judgment criterion, the text data being then numbered 4 and 5 is the two sections of text datas repeated each other, need retain the text data being numbered 4, be numbered 5 text data deleted.
In table, through looking into heavily, the text data that the text data being numbered 2 does not repeat each other, then continue to retain the text data being numbered 2.
The present invention also comprises the step of searching for text data, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in picture database, audio database and/or video database, search out corresponding image data, voice data and/or video data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
The invention provides a kind of data file batch storage system according to above-mentioned storage method, as shown in Figure 3, this system comprises:
Acquisition module, for the image data, voice data and/or the video data that gather many text datas and match with every bar text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas.
Code storage module, for storing often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, giving same identification code classification simultaneously and storing by the multi-medium data of every bar matches text data.
Further, also comprising typing condition memory module, for storing or the typing condition of Edit Text data, and providing typing condition to acquisition module in Input Process.
Further, also comprising repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and all the other text datas being deleted.
As can be known from the above technical solutions, the present invention stores a large amount of text data and multi-medium data, after heavily deletion is looked into text data, each text data is identified, and same identification code is given to the multi-medium data that text data matches, be stored in respectively in taxonomy database afterwards, complete paired data file effectively stores.In addition, in search procedure, search for text data, determine text data, then obtain multi-medium data with Search Flags coding mode, complete paired data file effectively searches for display.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
One of ordinary skill in the art will appreciate that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of the claims in the present invention.
Claims (10)
1. a data file batch storage means, it is characterized in that, the step of described storage means comprises:
Gather many text datas and the multi-medium data with every bar matches text data;
All text datas are looked into heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching;
Store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
The multi-medium data of every bar matches text data is given same identification code classification to store.
2. storage means according to claim 1, is characterized in that, looks into heavily comprise all text datas: the multiple typing conditions according to text data are looked into heavily the content of text under each typing condition.
3. storage means according to claim 2, is characterized in that, described typing condition is scientific and technological achievement title, author, unit, research beginning and ending time, text.
4. storage means according to claim 1, is characterized in that, comprises there is the step that the many groups of text datas repeated each other process:
The text data often organizing typing at first in text data and the multi-medium data that mates with text data are retained;
Delete by all the other text datas and with the multi-medium data of all the other matches text data.
5. storage means according to claim 4, is characterized in that, it is investigated that major punishment is broken and be there is the condition of text data repeated each other and comprise to described textual data:
The ratio of the title number of words that continuous repetition number of words is less with number of words is greater than preset ratio;
And/or,
Author all identical or exist an author identical;
And/or,
Unit all identical or exist a unit identical;
And/or,
The research beginning and ending time is identical or overlapping;
And/or,
Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.
6. storage means according to claim 1, is characterized in that, described multi-medium data comprises image data, voice data and/or video data.
7. storage means according to claim 1, is characterized in that, also comprises the step of data search, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in multimedia database, search out corresponding multi-medium data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
8. a data file batch storage system, is characterized in that, comprising:
Acquisition module, for gathering many text datas and the multi-medium data with every bar matches text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas;
Code storage module, stores often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, the multi-medium data of every bar matches text data is given same identification code classification simultaneously and stores.
9. storage system according to claim 8, is characterized in that, also comprises typing condition memory module, for editing or store the typing condition of text data.
10. storage system according to claim 8, is characterized in that, also comprises repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and is deleted by all the other text datas.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510767586.9A CN105373605A (en) | 2015-11-11 | 2015-11-11 | Batch storage method and system for data files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510767586.9A CN105373605A (en) | 2015-11-11 | 2015-11-11 | Batch storage method and system for data files |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105373605A true CN105373605A (en) | 2016-03-02 |
Family
ID=55375804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510767586.9A Pending CN105373605A (en) | 2015-11-11 | 2015-11-11 | Batch storage method and system for data files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105373605A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446077A (en) * | 2016-09-07 | 2017-02-22 | 乐视控股(北京)有限公司 | Object uploading method and electronic device |
CN106469195A (en) * | 2016-08-31 | 2017-03-01 | 国信优易数据有限公司 | Based on conforming data file Valuation Method and system |
CN106649641A (en) * | 2016-12-08 | 2017-05-10 | 北京五八信息技术有限公司 | Method and device for processing database object set schema information and management system |
CN107764948A (en) * | 2017-11-20 | 2018-03-06 | 中国农业大学 | Ethylene gas content monitoring device and method |
CN112528114A (en) * | 2019-09-17 | 2021-03-19 | 北京国双科技有限公司 | Article duplicate removal method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1747390A (en) * | 2005-10-14 | 2006-03-15 | 北京金山软件有限公司 | Method and system for processing real-time multi-media information in instant telecommunication |
CN102280104A (en) * | 2010-06-11 | 2011-12-14 | 北大方正集团有限公司 | File phoneticization processing method and system based on intelligent indexing |
CN102937959A (en) * | 2011-06-03 | 2013-02-20 | 苹果公司 | Automatically creating a mapping between text data and audio data |
CN103678702A (en) * | 2013-12-30 | 2014-03-26 | 优视科技有限公司 | Video duplicate removal method and device |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
-
2015
- 2015-11-11 CN CN201510767586.9A patent/CN105373605A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1747390A (en) * | 2005-10-14 | 2006-03-15 | 北京金山软件有限公司 | Method and system for processing real-time multi-media information in instant telecommunication |
CN102280104A (en) * | 2010-06-11 | 2011-12-14 | 北大方正集团有限公司 | File phoneticization processing method and system based on intelligent indexing |
CN102937959A (en) * | 2011-06-03 | 2013-02-20 | 苹果公司 | Automatically creating a mapping between text data and audio data |
CN103678702A (en) * | 2013-12-30 | 2014-03-26 | 优视科技有限公司 | Video duplicate removal method and device |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
Non-Patent Citations (2)
Title |
---|
曾理: ""Hadoop的重复数据清理模型研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
潘鑫: ""基于相似度估计文档复制检测系统的设计与实现"", 《万方中国学位论文全文数据库》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106469195A (en) * | 2016-08-31 | 2017-03-01 | 国信优易数据有限公司 | Based on conforming data file Valuation Method and system |
CN106446077A (en) * | 2016-09-07 | 2017-02-22 | 乐视控股(北京)有限公司 | Object uploading method and electronic device |
CN106649641A (en) * | 2016-12-08 | 2017-05-10 | 北京五八信息技术有限公司 | Method and device for processing database object set schema information and management system |
CN106649641B (en) * | 2016-12-08 | 2020-05-26 | 北京五八信息技术有限公司 | Method, device and management system for processing schema information of database object set |
CN107764948A (en) * | 2017-11-20 | 2018-03-06 | 中国农业大学 | Ethylene gas content monitoring device and method |
CN112528114A (en) * | 2019-09-17 | 2021-03-19 | 北京国双科技有限公司 | Article duplicate removal method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104679778B (en) | A kind of generation method and device of search result | |
US9727782B2 (en) | Method for organizing large numbers of documents | |
CN105373605A (en) | Batch storage method and system for data files | |
US20120131022A1 (en) | Methods and systems for merging data sets | |
CN103049568A (en) | Method for classifying documents in mass document library | |
CN108897761A (en) | A kind of clustering storage method and device | |
WO2014058711A1 (en) | Creation of inverted index system, and data processing method and apparatus | |
CN106326475A (en) | High-efficiency static hash table implement method and system | |
CN104731945A (en) | Full-text searching method and device based on HBase | |
CN101751475B (en) | Method for compressing section records and device therefor | |
US11132345B2 (en) | Real time indexing | |
CN106874425A (en) | Real time critical word approximate search algorithm based on Storm | |
KR101955376B1 (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
KR102325249B1 (en) | Method for providing enhanced search result by fusioning passage-based and document-based information retrievals | |
WO2013097065A1 (en) | Index data processing method and device | |
KR102324571B1 (en) | Method for providing enhanced search result in passage-based information retrieval | |
US20070198567A1 (en) | File storage and retrieval method | |
WO2022153287A1 (en) | Clustering of structured and semi-structured data | |
Megharaja et al. | Significance of searching and sorting in data structures | |
Oktoveri et al. | Non-relevant document reduction in anti-plagiarism using asymmetric similarity and AVL tree index | |
Нікітін et al. | Combined indexing method in nosql databases | |
CN114785727B (en) | Calculation method for eliminating repeated route | |
Kathuria et al. | Context indexing in search engine using binary search tree | |
Al-Rasbi et al. | Can We Build a Search Engine over Spark? | |
Adhikari et al. | Enhancing quality of knowledge synthesized from multi-database mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160302 |