CN105373605A - Batch storage method and system for data files - Google Patents

Batch storage method and system for data files Download PDF

Info

Publication number
CN105373605A
CN105373605A CN201510767586.9A CN201510767586A CN105373605A CN 105373605 A CN105373605 A CN 105373605A CN 201510767586 A CN201510767586 A CN 201510767586A CN 105373605 A CN105373605 A CN 105373605A
Authority
CN
China
Prior art keywords
data
text
text data
datas
storage means
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510767586.9A
Other languages
Chinese (zh)
Inventor
高万林
赵龙
任延昭
陈雪瑞
段晶洁
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN201510767586.9A priority Critical patent/CN105373605A/en
Publication of CN105373605A publication Critical patent/CN105373605A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data

Abstract

The invention provides a batch storage method and system for data files. The method comprises the following steps: collecting multiple pieces of text data and multimedia data matched with each piece of text data; carrying out duplicate checking on all text data to obtain multiple groups of duplicated text data and correspondingly matched multimedia data; reserving one piece of text data in each group of duplicated text data and carrying out identifying storage on the text data and initial single text data; and sorting and storing the multimedia data matched with each piece of text data with the same identification code. According to the batch storage method and system, storage is carried out on a large number of scientific and technological achievement texts, pictures and audio and video data; each text data is identified after duplicate checking and deleting are carried out on the text data; and the pictures and the audio or video data matched with the text data are endowed with the same identification code and are stored into a database respectively, so that effective storage of the data files is finished.

Description

Data file batch storage means and system
Technical field
The present invention relates to computer application field, particularly relating to a kind of data file for managing agricultural science and technology achievement storage means and system in batches.
Background technology
In the last few years, country is annual all in concern rural economy, rural development and rural demography, investment in agriculture is also in steady increasing, scientific research institutions and universities and colleges are all further exploring agricultural and are researching and developing, annual Technology value is the scientific and technological achievement data that quantity is various, these achievement datas comprise text, picture, the various ways such as Voice & Video and form, how these a large amount of random performance data are carried out effective store and management, how to screen and import the key factor becoming restriction performance data rapid saving, therefore a kind of more effective mode is needed to carry out the importing of data.
Summary of the invention
The invention provides a kind of data file batch storage means and system, for solving in prior art the problem that random redundant data in enormous quantities imports.
On the one hand, the invention provides a kind of data file batch storage means, the step of described storage means comprises:
Gather many text datas and the multi-medium data with every bar matches text data;
All text datas are looked into heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching;
Store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
The multi-medium data of every bar matches text data is given same identification code classification to store.
Further, all text datas are looked into heavily comprise: the multiple typing conditions according to text data are looked into heavily the content of text under each typing condition.
Further, comprise there is the step that the many groups of text datas repeated each other process:
The text data often organizing typing at first in text data and the multi-medium data that mates with text data are retained;
Delete by all the other text datas and with the multi-medium data of all the other matches text data.
Further, described multi-medium data comprises image data, voice data and/or video data.
Further, described typing condition is scientific and technological achievement title, author, unit, research beginning and ending time, text.
Further, described textual data be it is investigated that major punishment is broken and be there is the condition of text data repeated each other and comprise:
The ratio of the title number of words that continuous repetition number of words is less with number of words is greater than preset ratio;
And/or,
Author all identical or exist an author identical;
And/or,
Unit all identical or exist a unit identical;
And/or,
The research beginning and ending time is identical or overlapping;
And/or,
Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.
Further, also comprise the step of data search, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in multimedia database, search out corresponding multi-medium data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
On the other hand, the invention provides a kind of data file batch storage system, comprising:
Acquisition module, for gathering many text datas and the multi-medium data with every bar matches text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas;
Code storage module, stores often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, the multi-medium data of every bar matches text data is given same identification code classification simultaneously and stores.
Further, also comprise typing condition memory module, for editing or store the typing condition of text data.
Further, also comprising repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and all the other text datas being deleted.
As shown from the above technical solution, the present invention stores a large amount of text data and multi-medium data, after heavily deletion is looked into text data, each text data is identified, and same identification code is given to the multi-medium data that text data matches, be stored in respectively in taxonomy database afterwards, complete paired data file effectively stores.In addition, in search procedure, search for text data, determine text data, then obtain multi-medium data with Search Flags coding mode, complete paired data file effectively searches for display.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of storage means described in the embodiment of the present invention 1;
Fig. 2 is the concrete implementing procedure figure of one of storage means shown in the embodiment of the present invention 1;
Fig. 3 is the structured flowchart of storage system described in the embodiment of the present invention 2.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 indicates a kind of data file batch storage means that the embodiment of the present invention 1 provides, and the step of described storage means comprises:
1, many text datas and the multi-medium data with every bar matches text data is gathered;
2, look into heavily with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching all text datas, wherein, multi-medium data can comprise image data, voice data and video data;
3, store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
4, the multi-medium data of every bar matches text data is given same identification code classification to store.
Be illustrated in figure 2 the specific embodiment made for above-mentioned storage means:
S1, the image data, voice data and/or the video data that gather many text datas and match with every bar text data;
Multiple typing conditions of S2, foundation text data are looked into the content of text of all text datas under each typing condition and are weighed and judge whether to there are repeated text data;
If there are the many groups of text datas repeated each other in S21, then the text data often organizing typing at first in text data and the image data, voice data and/or the video data that mate with text data are retained, all the other text datas and the image data, voice data and/or the video data that mate with these all the other text datas are deleted;
Otherwise S22, retains all text datas, image data, voice data and video data;
S3, all retain a text data by often organizing in the text data that repeats each other and all carry out identifying stored in text database with initially single text data, wherein, initially single text data refers to the text data not having to repeat, and does not repeat to be single;
S4, image data, voice data and/or video data that bar text data every after mark mates are given same identification code be stored in respectively in picture database, audio database and video database.
Explain explanation further to said method, all agricultural science and technology achievement text datas can be carried out typing according to the excel template of systemic presupposition when image data file by the storage means described in the present embodiment.In the process of typing text data, each text data has unique numbering.After typing, the template having all text datas is submitted to system, system can be looked into all text datas and heavily process.Looking in heavy process, needing the multiple typing conditions according to text data to look into heavily the content of text of all text datas under each typing condition.This typing condition is the content typing criterion in Input Process.Look into heavily to the content of text under each typing condition, from all angles, scientific and technological achievement text data is looked into heavily, the text data that may duplicate is not also missed, improve and look into heavy accuracy.In the present embodiment, typing condition can be scientific and technological achievement title, author, unit, research beginning and ending time, text.When looking into heavy successively according to scientific and technological achievement title-author-unit-research beginning and ending time-key word of the text sequence secondary ordered pair content separately of carrying out looking into weight compares, and need judge whether to there are repeated text data in comparison process according to default judgment criterion.Default judgment criterion can be: the ratio repeating the number of words title number of words less with number of words is continuously greater than preset ratio; Author all identical or exist an author identical; Unit all identical or exist a unit identical; The research beginning and ending time is identical or overlapping; Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.Above-mentioned judgment criterion can be meet one or more criterions can judge that text data is the text data repeated each other, look into heavily from minimum judge point to scientific and technological achievement text data, the text data that may duplicate one is not also missed, improves and look into heavy accuracy.
There are the many groups of text datas repeated each other if judge, then the text data often organizing typing at first in the text data (having two sections of text datas at least) that repeats each other and the image data, voice data and/or the video data that mate with text data are retained, all the other text datas and the image data, voice data and/or the video data that mate with these all the other text datas are deleted.In such situation, identical text data is just only left a text data, avoids data redundancy.
The repeated text data retained and the text data that there is not repetition are identified according to default identification means, makes each text data have uniqueness.Because text data may wear picture, audio or video data.Therefore, in order to ensure integrality and the correspondence of whole scientific and technological achievement, the image data of correspondence, voice data or video data can be kept consistency with the identification code of text data.For this reason, the data after all identifying can be stored into separately in corresponding database.
Do specific explanations with following table below to illustrate:
Scientific and technological achievement title Author Unit The research beginning and ending time Text
1 Wine-growing technology C P university 2014.05.06-2015.03.08 This shows slightly
2 Peach Apricot graft technology E、F L company 2013.01.15-2015.08.16 This shows slightly
3 Large output wine-growing C、B P university, G research 2014.09.13-2015.01.20 This shows slightly
Method Institute
4 Corn variety A Q research institute 2013.12.01-2014.12.10 This shows slightly
5 Corn variety A Q research institute 2013.12.01-2014.12.10 This shows slightly
In table, through looking into heavily, the text data being numbered 1 and 3 meets above-mentioned default judgment criterion on title, author, research institute, research beginning and ending time, through looking into heavily, text also meets presets judgment criterion, the text data being then numbered 1 and 3 is the two sections of text datas repeated each other, need retain the text data being numbered 1, be numbered 3 text data deleted.
In table, through looking into heavily, the text data being numbered 4 and 5 meets above-mentioned default judgment criterion on title, author, research institute, research beginning and ending time, through looking into heavily, text also meets presets judgment criterion, the text data being then numbered 4 and 5 is the two sections of text datas repeated each other, need retain the text data being numbered 4, be numbered 5 text data deleted.
In table, through looking into heavily, the text data that the text data being numbered 2 does not repeat each other, then continue to retain the text data being numbered 2.
The present invention also comprises the step of searching for text data, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in picture database, audio database and/or video database, search out corresponding image data, voice data and/or video data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
The invention provides a kind of data file batch storage system according to above-mentioned storage method, as shown in Figure 3, this system comprises:
Acquisition module, for the image data, voice data and/or the video data that gather many text datas and match with every bar text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas.
Code storage module, for storing often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, giving same identification code classification simultaneously and storing by the multi-medium data of every bar matches text data.
Further, also comprising typing condition memory module, for storing or the typing condition of Edit Text data, and providing typing condition to acquisition module in Input Process.
Further, also comprising repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and all the other text datas being deleted.
As can be known from the above technical solutions, the present invention stores a large amount of text data and multi-medium data, after heavily deletion is looked into text data, each text data is identified, and same identification code is given to the multi-medium data that text data matches, be stored in respectively in taxonomy database afterwards, complete paired data file effectively stores.In addition, in search procedure, search for text data, determine text data, then obtain multi-medium data with Search Flags coding mode, complete paired data file effectively searches for display.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
One of ordinary skill in the art will appreciate that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of the claims in the present invention.

Claims (10)

1. a data file batch storage means, it is characterized in that, the step of described storage means comprises:
Gather many text datas and the multi-medium data with every bar matches text data;
All text datas are looked into heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching;
Store often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data;
The multi-medium data of every bar matches text data is given same identification code classification to store.
2. storage means according to claim 1, is characterized in that, looks into heavily comprise all text datas: the multiple typing conditions according to text data are looked into heavily the content of text under each typing condition.
3. storage means according to claim 2, is characterized in that, described typing condition is scientific and technological achievement title, author, unit, research beginning and ending time, text.
4. storage means according to claim 1, is characterized in that, comprises there is the step that the many groups of text datas repeated each other process:
The text data often organizing typing at first in text data and the multi-medium data that mates with text data are retained;
Delete by all the other text datas and with the multi-medium data of all the other matches text data.
5. storage means according to claim 4, is characterized in that, it is investigated that major punishment is broken and be there is the condition of text data repeated each other and comprise to described textual data:
The ratio of the title number of words that continuous repetition number of words is less with number of words is greater than preset ratio;
And/or,
Author all identical or exist an author identical;
And/or,
Unit all identical or exist a unit identical;
And/or,
The research beginning and ending time is identical or overlapping;
And/or,
Every section, text repeats number of words continuously and is greater than preset ratio with the ratio of every section of total number of word.
6. storage means according to claim 1, is characterized in that, described multi-medium data comprises image data, voice data and/or video data.
7. storage means according to claim 1, is characterized in that, also comprises the step of data search, comprising:
According to multi-field retrieval method, the text data in text database is searched for, and judge whether to there is search text data;
If exist, then determine the identification code searching for text data, and in multimedia database, search out corresponding multi-medium data with this identification code, then Search Results is shown;
Otherwise, not display of search results.
8. a data file batch storage system, is characterized in that, comprising:
Acquisition module, for gathering many text datas and the multi-medium data with every bar matches text data;
Textual data it is investigated that major punishment is broken module, heavy with the multi-medium data obtaining the many groups of text datas repeated each other and Corresponding matching for looking into all text datas;
Code storage module, stores often organizing in the text data that repeats each other all to retain a text data and carry out mark with initially single text data, the multi-medium data of every bar matches text data is given same identification code classification simultaneously and stores.
9. storage system according to claim 8, is characterized in that, also comprises typing condition memory module, for editing or store the typing condition of text data.
10. storage system according to claim 8, is characterized in that, also comprises repeated text data processing module, for often organizing the text data retaining typing at first in the text data that repeats each other, and is deleted by all the other text datas.
CN201510767586.9A 2015-11-11 2015-11-11 Batch storage method and system for data files Pending CN105373605A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510767586.9A CN105373605A (en) 2015-11-11 2015-11-11 Batch storage method and system for data files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510767586.9A CN105373605A (en) 2015-11-11 2015-11-11 Batch storage method and system for data files

Publications (1)

Publication Number Publication Date
CN105373605A true CN105373605A (en) 2016-03-02

Family

ID=55375804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510767586.9A Pending CN105373605A (en) 2015-11-11 2015-11-11 Batch storage method and system for data files

Country Status (1)

Country Link
CN (1) CN105373605A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446077A (en) * 2016-09-07 2017-02-22 乐视控股(北京)有限公司 Object uploading method and electronic device
CN106469195A (en) * 2016-08-31 2017-03-01 国信优易数据有限公司 Based on conforming data file Valuation Method and system
CN106649641A (en) * 2016-12-08 2017-05-10 北京五八信息技术有限公司 Method and device for processing database object set schema information and management system
CN107764948A (en) * 2017-11-20 2018-03-06 中国农业大学 Ethylene gas content monitoring device and method
CN112528114A (en) * 2019-09-17 2021-03-19 北京国双科技有限公司 Article duplicate removal method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1747390A (en) * 2005-10-14 2006-03-15 北京金山软件有限公司 Method and system for processing real-time multi-media information in instant telecommunication
CN102280104A (en) * 2010-06-11 2011-12-14 北大方正集团有限公司 File phoneticization processing method and system based on intelligent indexing
CN102937959A (en) * 2011-06-03 2013-02-20 苹果公司 Automatically creating a mapping between text data and audio data
CN103678702A (en) * 2013-12-30 2014-03-26 优视科技有限公司 Video duplicate removal method and device
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1747390A (en) * 2005-10-14 2006-03-15 北京金山软件有限公司 Method and system for processing real-time multi-media information in instant telecommunication
CN102280104A (en) * 2010-06-11 2011-12-14 北大方正集团有限公司 File phoneticization processing method and system based on intelligent indexing
CN102937959A (en) * 2011-06-03 2013-02-20 苹果公司 Automatically creating a mapping between text data and audio data
CN103678702A (en) * 2013-12-30 2014-03-26 优视科技有限公司 Video duplicate removal method and device
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曾理: ""Hadoop的重复数据清理模型研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
潘鑫: ""基于相似度估计文档复制检测系统的设计与实现"", 《万方中国学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469195A (en) * 2016-08-31 2017-03-01 国信优易数据有限公司 Based on conforming data file Valuation Method and system
CN106446077A (en) * 2016-09-07 2017-02-22 乐视控股(北京)有限公司 Object uploading method and electronic device
CN106649641A (en) * 2016-12-08 2017-05-10 北京五八信息技术有限公司 Method and device for processing database object set schema information and management system
CN106649641B (en) * 2016-12-08 2020-05-26 北京五八信息技术有限公司 Method, device and management system for processing schema information of database object set
CN107764948A (en) * 2017-11-20 2018-03-06 中国农业大学 Ethylene gas content monitoring device and method
CN112528114A (en) * 2019-09-17 2021-03-19 北京国双科技有限公司 Article duplicate removal method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104679778B (en) A kind of generation method and device of search result
US9727782B2 (en) Method for organizing large numbers of documents
CN105373605A (en) Batch storage method and system for data files
US20120131022A1 (en) Methods and systems for merging data sets
CN103049568A (en) Method for classifying documents in mass document library
CN108897761A (en) A kind of clustering storage method and device
WO2014058711A1 (en) Creation of inverted index system, and data processing method and apparatus
CN106326475A (en) High-efficiency static hash table implement method and system
CN104731945A (en) Full-text searching method and device based on HBase
CN101751475B (en) Method for compressing section records and device therefor
US11132345B2 (en) Real time indexing
CN106874425A (en) Real time critical word approximate search algorithm based on Storm
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
KR102325249B1 (en) Method for providing enhanced search result by fusioning passage-based and document-based information retrievals
WO2013097065A1 (en) Index data processing method and device
KR102324571B1 (en) Method for providing enhanced search result in passage-based information retrieval
US20070198567A1 (en) File storage and retrieval method
WO2022153287A1 (en) Clustering of structured and semi-structured data
Megharaja et al. Significance of searching and sorting in data structures
Oktoveri et al. Non-relevant document reduction in anti-plagiarism using asymmetric similarity and AVL tree index
Нікітін et al. Combined indexing method in nosql databases
CN114785727B (en) Calculation method for eliminating repeated route
Kathuria et al. Context indexing in search engine using binary search tree
Al-Rasbi et al. Can We Build a Search Engine over Spark?
Adhikari et al. Enhancing quality of knowledge synthesized from multi-database mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160302