CN110275978A - Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method - Google Patents

Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method Download PDF

Info

Publication number
CN110275978A
CN110275978A CN201910585870.2A CN201910585870A CN110275978A CN 110275978 A CN110275978 A CN 110275978A CN 201910585870 A CN201910585870 A CN 201910585870A CN 110275978 A CN110275978 A CN 110275978A
Authority
CN
China
Prior art keywords
data
voice
access
voice data
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910585870.2A
Other languages
Chinese (zh)
Inventor
游萌
何云鹏
高君效
许兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Leader Technology Co Ltd
Chipintelli Technology Co Ltd
Original Assignee
Chengdu Leader Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Leader Technology Co Ltd filed Critical Chengdu Leader Technology Co Ltd
Priority to CN201910585870.2A priority Critical patent/CN110275978A/en
Publication of CN110275978A publication Critical patent/CN110275978A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying

Abstract

Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, include the following steps: to filter out voice data, do following processing to the memory of voice data and its occupancy: carrying out 4K alignment to memory;Voice data is ranked up according to two-way bubbling algorithm;Voice data is counted and managed according to name, the management includes access modification operation.Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, reading and writing data and recognition speed are improved in terms of two, are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify the data for voice training for being often updated replacement, and carry out above-mentioned improved data management measure for it, improve resource utilization.

Description

Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method
Technical field
The invention belongs to field of artificial intelligence, are related to a kind of voice data management method, and in particular to a kind of voice Quick storage of the big data on redundant arrays of inexpensive disks and access amending method.
Background technique
Voice document is that continuously the record for digital signal, voice document storage are digital signal records with tray A kind of storage mode of the form of formula in hard disk and other media, storage file is frequently not continuous on physical medium on hard disk Storage, and be dispersed in disk different subregions and different physical sectors.There are many kinds of method, common storage sides for data storage Formula is that the different piece for constituting complete file is distributed to different physical address.In the backup of mainstream hicap instantly In big data management system, usually using disc redundancy array system and method.
The identification of voice data depends on the continuous training repeatedly of artificial intelligence neural networks, needs the voice number of magnanimity According to voice data classifies generally according to different enunciators, tone period, acquisition distance etc., these voice data are believed The management of breath needs to fully consider the complexity of storage architecture spatially, because of the size of big data field space storage, place Manage that the memory space that identical data use is bigger, and maintenance cost is higher, access data scheduling time is also longer, cause data at This unit storage medium composite price is also higher.
Summary of the invention
To overcome technological deficiency of the existing technology, the invention discloses a kind of voice big datas in redundant arrays of inexpensive disks On quick storage and access amending method.
It is of the present invention.
Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, adopt With 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, improved in terms of two Reading and writing data and recognition speed are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify It is often updated the data for voice training of replacement out, and carries out above-mentioned improved data management measure for it, improves Resource utilization.
Specific embodiment
Specific embodiments of the present invention will be described in further detail below.
Quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, including it is as follows Step:
Voice data is filtered out, following processing is done to the memory of voice data and its occupancy:
4K alignment is carried out to memory;
Voice data is ranked up according to two-way bubbling algorithm;
Voice data is counted and managed according to name, the management includes access modification operation.
Wherein, the threshold value of one storage time length is set according to the storage time of data to the screening of data, such as 7 days, priority is then lost when storage time is more than threshold value, it is new no more than threshold value that System Priority filters out storage time Increase data and optimizes processing.
Data screening is directed to the data batch in period short period, needs newer data to be used that can do for training Necessary reparation, the original data warehouse of synchronized callback after reparation.The data reparation priority for storing the short period is also higher, repairs Redoubling is synchronized to original data warehouse, and the data of long period storage can slowly synchronize simultaneously after a few wheel iteration cycles Retract original data warehouse.
For voice data, trained and renewal time is very fast and system data or other popular software data renewal speed Very slowly, it is the voice data newly stored due to frequently call reading in the present invention, is sieved according to storage time length Choosing.
The memory that the data filtered out occupy data itself and data carries out following optimization processing.
4K alignment carried out to memory first, 4K alignment is exactly to allow the smallest allocation unit and Hash memory pages pair of operating system It should get up, can once be completed when 4KB data being written in this way.
Voice training data are usually the larger file that is made of multiple scattered small documents, each small documents dispersion of when storage It is stored in different fdisks, this data format, which has hard disc data read or write speed using 4K alignment, more significantly to be mentioned It rises.
Such as in terms of important file replicate data, big file and scattered is tested under identical hard disk and different hard disks respectively The test of small documents, big file are about 20 GB data filing categorical datas, and scattered small documents are 200,000 texts for being not more than 2M Part is obviously improved, especially through test comparison it can be found that in terms of data duplication using the hard disk performance that 4K is aligned It is in terms of scattered small documents, speed improves by about one time.
Actual measurement discovery, when using solid state hard disk as temporary storage medium, does quick number in provisional exchange area According to exchange, the efficiency highest of critical processing 4KB random read-write in actual use.
After handling memory, data are ranked up according to two-way bubbling algorithm, two-way bubbling algorithm is tradition The two-way progress of bubble sort, first allows bubble sort to carry out from left to right, then bubble sort is allowed to be turned left progress by the right side, so complete At the movement of a minor sort.
The principle of specific two-way bubble sort is to be directed to the two column sequences for needing to sort to carry out two layers of searching loop.Just Direction allows lesser data in sequence that can be moved to the top of array, opposite direction, i.e., with respect to another phase of positive direction quickly Allow biggish data in sequence that can be moved to the bottom of array quickly in direction, so that sequence is at the end of working as two layers of circulation Achievable, relatively unidirectional bubbling, two-way bubbling Algorithms T-cbmplexity is smaller.
It is ranked up using two-way bubbling algorithm preferably for the name of file, the read-write of mass data is called, is changed The naming method of kind data file is conducive to quick identifying call.
Such as include C1 enunciator for a certain voice data D1, associated parameter Ci, C2 acquisition time, C3 adopt sound away from From, C4 enunciator's gender, C5 enunciator's age bracket.
Assuming that the state when data under voice is as follows:
Enunciator's name Acquisition time Adopt sound distance Enunciator's gender Enunciator's age bracket
D1 Zhang San (ZS) Daytime (day) 0.5 meter Male (ma) 10-18
D2 Zhang San (ZS) Night (ngt) 1 meter Male (ma) 10-18
D3 Li Si (LS) Daytime (day) 1 meter Male (ma) 18-30
D4 King five (WW) Night (ngt) 3 meters Female (fem) 40-50
The wherein corresponding text of D1 and D2 voice data are as follows: open air-conditioning.
The wherein corresponding text of D3 and D4 voice data are as follows: open TV.
It then can be as follows to the name of voice data D1 to D4:
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
With the corresponding relationship of text data are as follows:
Open air-conditioning-ZS-day-05m-ma-1018;
Open air-conditioning-ZS-ngt-1m-ma-1018;
Open TV-LS-day-1m-ma-1830;
Open TV-WW-ngt-3m-fem-4050;
Data are directly stored and called according to above-mentioned naming rule, and memory space is big, calls speed slow.
It using two-way bubbling algorithm, lays down a regulation and filename is ranked up, such as arranged according to alphabetic order A-Z Sequence, the identical then more next letter of letter, then following four filename:
ZS-day-05m-ma-1018
ZS-ngt-1m-ma-1018
LS-day-1m-ma-1830
WW-ngt-3m-fem-4050;
Sequence after being ranked up are as follows:
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
After being ranked up using two-way bubbling algorithm to each name, each associated data storage region is more concentrated, when calling It was found that recognition speed is faster.
And index database can be established to the name of each voice data, it is successively index with each parameter, layering is established more Layer index.
Such as ZS-day-05m-ma-1018, each information respectively correspond: enunciator's name, tone period, distance of pronouncing, Enunciator's gender, enunciator's age;
Such as with enunciator's name for the first layer index: the index information of voice data are as follows:
"ZS" day-05m-ma-1018;
ngt-1m-ma-1018;
"LS" day-1m-ma-1830;
"WW" ngt-3m-fem-4050;
Under first layer index of " ZS ", using tone period as the second layer index;
"ZS"-"day"05m-ma-1018;
"ZS"-"ngt"05m-ma-1018;
And so on.
As to pronounce apart from for the first layer index: being if the index information of voice data
"05m"ZS-day-ma-1018;
"1m"ZS-ngt-ma-1018;
LS-day-ma-1830;
"3m"WW-ngt-fem-4050;
Secondary index and so on.
It can be completely according to each voice number when organizing number and arranging data using the indexed mode of above-mentioned stratification It is defined according to associated one or more parameter attributes, chooses related data content.
Such as in speech recognition training, needs only Near-field Data to be selected to do the preparation before training, only use pronunciation The data content of 0.5 and 1m of distance, then the data for meeting this condition are instructed under the guidance of index data base into some The working directory of data preparation does the processing of early period before practicing, and at this moment data only remain a part of negligible amounts, phase accordingly Original total amount of data memory space is reduced.
Rule is indexed according to above-mentioned name, memory space is reduced.File designation is simpler, from using, reads, arrange and Scheduling etc. is more intuitive in sequence of operations, easy to maintain and convenient for operation.Also closer to actually making in file designation With file storage is smaller, safeguards conducive to the entirety of data.
Such as the data of 1 meter of distance of pronunciation are deleted in following data:
ZS-day-05m;
ZS-ngt-1m;
LS-day-1m;
WW-ngt-3m;
The then corresponding total data for deleting index pronunciation distance=1m, remaining data is as follows after deletion:
ZS-day-05m;
WW-ngt-3m;
Above-mentioned file designation mode effectively reduces file designation and carries out storage size required for data management, number According to search and call it is quicker, meanwhile, modification newly-increased to the mass of mass data and delete also more convenient.
Such as the modification to the 1m data in upper example: 3m is changed to by 1m batch.
All names containing 1m are retrieved first, are then modified.Modified database is
ZS-day-05m;
ZS-ngt-3m;
LS-day-3m;
WW-ngt-3m;
In a kind of preferred naming method, at least one parameter Ck, is associated with unique other parameters, then in the parameter Ci Associated other parameters are omitted in voice data name,
It is index with associated each parameter for not associated parameter, the multilayer index to text data is established in layering;
It is index with associated each parameter for associated parameter, the multilayer index to parameter Ck is suggested in layering.
For example, enunciator's name and enunciator's age, enunciator's gender is corresponding, can only retain pronunciation in name at this time Person's name, and the person's of eliding age and gender, such as the aforementioned name to voice data D1 to D4:
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
The enunciator that enunciator's name is ZS is 15 years old male, and LS is 20 years old male, and WW is 45 years old women, then above-mentioned name can be with It is further simplified as,
ZS-day-05m;
ZS-ngt-1m;
LS-day-1m;
WW-ngt-3m;
It is index with associated each parameter further to not associated parameter such as tone period, pronunciation distance etc., layering is established Multilayer index;
Such as
"ZS" day-05m;
ngt-1m;
"LS" day-1m;
"WW" ngt-3m;
Under first layer index of " ZS ", using tone period as the second layer index;
《ZS》-《day》05m;
《ZS》-《ngt》05m;
Since the Sex, Age of each enunciator uniquely determines, can be indexed by different levels of gender and age, establish for The index of enunciator's name facilitates according to enunciator's name and age called data;
Such as
"ma"1018- ZS;
1830-LS;
"fem"4050-WW;
Above embodiment further simplifies the memory space of name.
Stratification index not only retrieve by fast and easy, also facilitates deletion and modifies certain class data, such as finds some pronunciation Person's name input error directly by enunciator's name index modification, and does not have to each file and modifies one by one, deletes certain class text Part is also in this way, for example to delete 3 meters of distance of pronunciation, and tone period is the data in evening, then corresponding delete is sent out in index simultaneously Total data under sound distance 3m and tone period ngt.
Using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, from Two aspects improve reading and writing data and recognition speed, are conducive to quick calling and management.
Meanwhile data screening management is further used, it can identify that be often updated replacement is used for voice training Data, and above-mentioned improved data management measure is carried out for it, improve resource utilization.
Previously described is each preferred embodiment of the invention, if the preferred embodiment in each preferred embodiment It is not obvious contradictory or premised on a certain preferred embodiment, each preferred embodiment can any stack combinations Use, the design parameter in the embodiment and embodiment only for the purpose of clearly stating the inventor's invention verification process, and It is non-to limit scope of patent protection of the invention, scope of patent protection of the invention is still subject to the claims, all It is that similarly should be included within the scope of the present invention with the variation of equivalent structure made by description of the invention.

Claims (6)

1. quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, it is characterised in that, including it is as follows Step:
Voice data is filtered out, following processing is done to the memory of voice data and its occupancy:
4K alignment is carried out to memory;
Voice data is ranked up according to two-way bubbling algorithm;
Voice data is counted and managed according to name, the management includes access modification operation.
2. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special Sign is, the foundation of the screening is time data memory, and what time data memory was shorter than pre-set threshold time regards as voice Data.
3. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special Sign is, described to be ranked up to voice data according to two-way bubbling algorithm specifically: to be pressed with the filename of voice data file It is ranked up according to two-way bubbling algorithm.
4. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, the name of the voice data file name and management rule are as follows:
It include multiple parameters Ci in the name of voice data, each corresponding unique text data of voice data name, the text Data are the correspondence text of voice data;
Following index is established for name:
Using text data as foundation, corresponding voice data name index database under each text data is established;
Index database is established to the name of each voice data, is successively index with each parameter, multilayer index is established in layering.
5. quick storage of the voice big data as claimed in claim 4 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, at least one parameter Ck, is associated with unique other parameters in the parameter Ci, then saves in voice data name Slightly associated other parameters,
It is index with associated each parameter for not associated parameter, the multilayer index to text data is established in layering;
It is index with associated each parameter for associated parameter, the multilayer index to parameter Ck is suggested in layering.
6. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, the access modification operation is the access modification behaviour of all files under the manipulative indexing carried out according to parameter reference Make.
CN201910585870.2A 2019-07-01 2019-07-01 Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method Pending CN110275978A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910585870.2A CN110275978A (en) 2019-07-01 2019-07-01 Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910585870.2A CN110275978A (en) 2019-07-01 2019-07-01 Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method

Publications (1)

Publication Number Publication Date
CN110275978A true CN110275978A (en) 2019-09-24

Family

ID=67963800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910585870.2A Pending CN110275978A (en) 2019-07-01 2019-07-01 Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method

Country Status (1)

Country Link
CN (1) CN110275978A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010057276A (en) * 1999-12-21 2001-07-04 윤종용 Apparatus and method for receiving voice data
CN101996195A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Searching method and device of voice information in audio files and equipment
CN102999601A (en) * 2012-11-20 2013-03-27 广东欧珀移动通信有限公司 Method for sorting files, and multimedia terminal
CN103974143A (en) * 2014-05-20 2014-08-06 北京速能数码网络技术有限公司 Method and device for generating media data
CN106448709A (en) * 2016-09-26 2017-02-22 珠海格力电器股份有限公司 Automatic record extraction control method and device, and mobile terminal
CN106844041A (en) * 2016-12-29 2017-06-13 华为技术有限公司 The method and internal storage management system of memory management
US20180288595A1 (en) * 2016-12-09 2018-10-04 Riedel Communications International GmbH Intercom network, mobile terminal, and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010057276A (en) * 1999-12-21 2001-07-04 윤종용 Apparatus and method for receiving voice data
CN101996195A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Searching method and device of voice information in audio files and equipment
CN102999601A (en) * 2012-11-20 2013-03-27 广东欧珀移动通信有限公司 Method for sorting files, and multimedia terminal
CN103974143A (en) * 2014-05-20 2014-08-06 北京速能数码网络技术有限公司 Method and device for generating media data
CN106448709A (en) * 2016-09-26 2017-02-22 珠海格力电器股份有限公司 Automatic record extraction control method and device, and mobile terminal
US20180288595A1 (en) * 2016-12-09 2018-10-04 Riedel Communications International GmbH Intercom network, mobile terminal, and method
CN106844041A (en) * 2016-12-29 2017-06-13 华为技术有限公司 The method and internal storage management system of memory management

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾家智 等: "《电子科技大学出版社》" *

Similar Documents

Publication Publication Date Title
US10496621B2 (en) Columnar storage of a database index
CN102663090B (en) Method and device for inquiry metadata
US7689574B2 (en) Index and method for extending and querying index
CN101719141B (en) File processing method and system based on directory object
CN105912687B (en) Magnanimity distributed data base storage unit
CN107491487B (en) Full-text database architecture and bitmap index creation and data query method, server and medium
US9996557B2 (en) Database storage system based on optical disk and method using the system
JP2770855B2 (en) Digital information storage and retrieval method and apparatus
JPH09212528A (en) Method for storing data base, method for retrieving record from data base, and data base storage and retrieval system
JP2005267600A5 (en)
CN103176754A (en) Reading and storing method for massive amounts of small files
CN107451138A (en) A kind of distributed file system storage method and system
CN108255966A (en) A kind of data migration method and storage medium
CN113535670A (en) Virtual resource mirror image storage system and implementation method thereof
CN107391769A (en) A kind of search index method and device
CN110515897B (en) Method and system for optimizing reading performance of LSM storage system
CN110275978A (en) Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method
JP4699469B2 (en) Database management program
WO2023116828A1 (en) Method and system for hard disk data storing and access
Iwata et al. A simulation result of replicating data with another layout for reducing media exchange of cold storage
US8812453B2 (en) Database archiving using clusters
RU2389066C2 (en) Multidimensional database and method of managing multidimensional database
CN106469174A (en) Method for reading data and device
JPS5851348A (en) High-speed access system for variable-length record
CN115374127B (en) Data storage method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190924