CN110275978A - Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method - Google Patents
Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method Download PDFInfo
- Publication number
- CN110275978A CN110275978A CN201910585870.2A CN201910585870A CN110275978A CN 110275978 A CN110275978 A CN 110275978A CN 201910585870 A CN201910585870 A CN 201910585870A CN 110275978 A CN110275978 A CN 110275978A
- Authority
- CN
- China
- Prior art keywords
- data
- voice
- access
- voice data
- name
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000003491 array Methods 0.000 title claims abstract description 14
- 230000005587 bubbling Effects 0.000 claims abstract description 15
- 238000012986 modification Methods 0.000 claims abstract description 8
- 230000004048 modification Effects 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 6
- 230000006399 behavior Effects 0.000 claims 1
- 238000007726 management method Methods 0.000 abstract description 9
- 238000012549 training Methods 0.000 abstract description 8
- 238000013523 data management Methods 0.000 abstract description 6
- 241001269238 Data Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000013517 stratification Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000004378 air conditioning Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/61—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
Abstract
Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, include the following steps: to filter out voice data, do following processing to the memory of voice data and its occupancy: carrying out 4K alignment to memory;Voice data is ranked up according to two-way bubbling algorithm;Voice data is counted and managed according to name, the management includes access modification operation.Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, reading and writing data and recognition speed are improved in terms of two, are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify the data for voice training for being often updated replacement, and carry out above-mentioned improved data management measure for it, improve resource utilization.
Description
Technical field
The invention belongs to field of artificial intelligence, are related to a kind of voice data management method, and in particular to a kind of voice
Quick storage of the big data on redundant arrays of inexpensive disks and access amending method.
Background technique
Voice document is that continuously the record for digital signal, voice document storage are digital signal records with tray
A kind of storage mode of the form of formula in hard disk and other media, storage file is frequently not continuous on physical medium on hard disk
Storage, and be dispersed in disk different subregions and different physical sectors.There are many kinds of method, common storage sides for data storage
Formula is that the different piece for constituting complete file is distributed to different physical address.In the backup of mainstream hicap instantly
In big data management system, usually using disc redundancy array system and method.
The identification of voice data depends on the continuous training repeatedly of artificial intelligence neural networks, needs the voice number of magnanimity
According to voice data classifies generally according to different enunciators, tone period, acquisition distance etc., these voice data are believed
The management of breath needs to fully consider the complexity of storage architecture spatially, because of the size of big data field space storage, place
Manage that the memory space that identical data use is bigger, and maintenance cost is higher, access data scheduling time is also longer, cause data at
This unit storage medium composite price is also higher.
Summary of the invention
To overcome technological deficiency of the existing technology, the invention discloses a kind of voice big datas in redundant arrays of inexpensive disks
On quick storage and access amending method.
It is of the present invention.
Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, adopt
With 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, improved in terms of two
Reading and writing data and recognition speed are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify
It is often updated the data for voice training of replacement out, and carries out above-mentioned improved data management measure for it, improves
Resource utilization.
Specific embodiment
Specific embodiments of the present invention will be described in further detail below.
Quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, including it is as follows
Step:
Voice data is filtered out, following processing is done to the memory of voice data and its occupancy:
4K alignment is carried out to memory;
Voice data is ranked up according to two-way bubbling algorithm;
Voice data is counted and managed according to name, the management includes access modification operation.
Wherein, the threshold value of one storage time length is set according to the storage time of data to the screening of data, such as
7 days, priority is then lost when storage time is more than threshold value, it is new no more than threshold value that System Priority filters out storage time
Increase data and optimizes processing.
Data screening is directed to the data batch in period short period, needs newer data to be used that can do for training
Necessary reparation, the original data warehouse of synchronized callback after reparation.The data reparation priority for storing the short period is also higher, repairs
Redoubling is synchronized to original data warehouse, and the data of long period storage can slowly synchronize simultaneously after a few wheel iteration cycles
Retract original data warehouse.
For voice data, trained and renewal time is very fast and system data or other popular software data renewal speed
Very slowly, it is the voice data newly stored due to frequently call reading in the present invention, is sieved according to storage time length
Choosing.
The memory that the data filtered out occupy data itself and data carries out following optimization processing.
4K alignment carried out to memory first, 4K alignment is exactly to allow the smallest allocation unit and Hash memory pages pair of operating system
It should get up, can once be completed when 4KB data being written in this way.
Voice training data are usually the larger file that is made of multiple scattered small documents, each small documents dispersion of when storage
It is stored in different fdisks, this data format, which has hard disc data read or write speed using 4K alignment, more significantly to be mentioned
It rises.
Such as in terms of important file replicate data, big file and scattered is tested under identical hard disk and different hard disks respectively
The test of small documents, big file are about 20 GB data filing categorical datas, and scattered small documents are 200,000 texts for being not more than 2M
Part is obviously improved, especially through test comparison it can be found that in terms of data duplication using the hard disk performance that 4K is aligned
It is in terms of scattered small documents, speed improves by about one time.
Actual measurement discovery, when using solid state hard disk as temporary storage medium, does quick number in provisional exchange area
According to exchange, the efficiency highest of critical processing 4KB random read-write in actual use.
After handling memory, data are ranked up according to two-way bubbling algorithm, two-way bubbling algorithm is tradition
The two-way progress of bubble sort, first allows bubble sort to carry out from left to right, then bubble sort is allowed to be turned left progress by the right side, so complete
At the movement of a minor sort.
The principle of specific two-way bubble sort is to be directed to the two column sequences for needing to sort to carry out two layers of searching loop.Just
Direction allows lesser data in sequence that can be moved to the top of array, opposite direction, i.e., with respect to another phase of positive direction quickly
Allow biggish data in sequence that can be moved to the bottom of array quickly in direction, so that sequence is at the end of working as two layers of circulation
Achievable, relatively unidirectional bubbling, two-way bubbling Algorithms T-cbmplexity is smaller.
It is ranked up using two-way bubbling algorithm preferably for the name of file, the read-write of mass data is called, is changed
The naming method of kind data file is conducive to quick identifying call.
Such as include C1 enunciator for a certain voice data D1, associated parameter Ci, C2 acquisition time, C3 adopt sound away from
From, C4 enunciator's gender, C5 enunciator's age bracket.
Assuming that the state when data under voice is as follows:
Enunciator's name | Acquisition time | Adopt sound distance | Enunciator's gender | Enunciator's age bracket | |
D1 | Zhang San (ZS) | Daytime (day) | 0.5 meter | Male (ma) | 10-18 |
D2 | Zhang San (ZS) | Night (ngt) | 1 meter | Male (ma) | 10-18 |
D3 | Li Si (LS) | Daytime (day) | 1 meter | Male (ma) | 18-30 |
D4 | King five (WW) | Night (ngt) | 3 meters | Female (fem) | 40-50 |
The wherein corresponding text of D1 and D2 voice data are as follows: open air-conditioning.
The wherein corresponding text of D3 and D4 voice data are as follows: open TV.
It then can be as follows to the name of voice data D1 to D4:
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
With the corresponding relationship of text data are as follows:
Open air-conditioning-ZS-day-05m-ma-1018;
Open air-conditioning-ZS-ngt-1m-ma-1018;
Open TV-LS-day-1m-ma-1830;
Open TV-WW-ngt-3m-fem-4050;
Data are directly stored and called according to above-mentioned naming rule, and memory space is big, calls speed slow.
It using two-way bubbling algorithm, lays down a regulation and filename is ranked up, such as arranged according to alphabetic order A-Z
Sequence, the identical then more next letter of letter, then following four filename:
ZS-day-05m-ma-1018
ZS-ngt-1m-ma-1018
LS-day-1m-ma-1830
WW-ngt-3m-fem-4050;
Sequence after being ranked up are as follows:
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
After being ranked up using two-way bubbling algorithm to each name, each associated data storage region is more concentrated, when calling
It was found that recognition speed is faster.
And index database can be established to the name of each voice data, it is successively index with each parameter, layering is established more
Layer index.
Such as ZS-day-05m-ma-1018, each information respectively correspond: enunciator's name, tone period, distance of pronouncing,
Enunciator's gender, enunciator's age;
Such as with enunciator's name for the first layer index: the index information of voice data are as follows:
"ZS" day-05m-ma-1018;
ngt-1m-ma-1018;
"LS" day-1m-ma-1830;
"WW" ngt-3m-fem-4050;
Under first layer index of " ZS ", using tone period as the second layer index;
"ZS"-"day"05m-ma-1018;
"ZS"-"ngt"05m-ma-1018;
And so on.
As to pronounce apart from for the first layer index: being if the index information of voice data
"05m"ZS-day-ma-1018;
"1m"ZS-ngt-ma-1018;
LS-day-ma-1830;
"3m"WW-ngt-fem-4050;
Secondary index and so on.
It can be completely according to each voice number when organizing number and arranging data using the indexed mode of above-mentioned stratification
It is defined according to associated one or more parameter attributes, chooses related data content.
Such as in speech recognition training, needs only Near-field Data to be selected to do the preparation before training, only use pronunciation
The data content of 0.5 and 1m of distance, then the data for meeting this condition are instructed under the guidance of index data base into some
The working directory of data preparation does the processing of early period before practicing, and at this moment data only remain a part of negligible amounts, phase accordingly
Original total amount of data memory space is reduced.
Rule is indexed according to above-mentioned name, memory space is reduced.File designation is simpler, from using, reads, arrange and
Scheduling etc. is more intuitive in sequence of operations, easy to maintain and convenient for operation.Also closer to actually making in file designation
With file storage is smaller, safeguards conducive to the entirety of data.
Such as the data of 1 meter of distance of pronunciation are deleted in following data:
ZS-day-05m;
ZS-ngt-1m;
LS-day-1m;
WW-ngt-3m;
The then corresponding total data for deleting index pronunciation distance=1m, remaining data is as follows after deletion:
ZS-day-05m;
WW-ngt-3m;
Above-mentioned file designation mode effectively reduces file designation and carries out storage size required for data management, number
According to search and call it is quicker, meanwhile, modification newly-increased to the mass of mass data and delete also more convenient.
Such as the modification to the 1m data in upper example: 3m is changed to by 1m batch.
All names containing 1m are retrieved first, are then modified.Modified database is
ZS-day-05m;
ZS-ngt-3m;
LS-day-3m;
WW-ngt-3m;
In a kind of preferred naming method, at least one parameter Ck, is associated with unique other parameters, then in the parameter Ci
Associated other parameters are omitted in voice data name,
It is index with associated each parameter for not associated parameter, the multilayer index to text data is established in layering;
It is index with associated each parameter for associated parameter, the multilayer index to parameter Ck is suggested in layering.
For example, enunciator's name and enunciator's age, enunciator's gender is corresponding, can only retain pronunciation in name at this time
Person's name, and the person's of eliding age and gender, such as the aforementioned name to voice data D1 to D4:
ZS-day-05m-ma-1018;
ZS-ngt-1m-ma-1018;
LS-day-1m-ma-1830;
WW-ngt-3m-fem-4050;
The enunciator that enunciator's name is ZS is 15 years old male, and LS is 20 years old male, and WW is 45 years old women, then above-mentioned name can be with
It is further simplified as,
ZS-day-05m;
ZS-ngt-1m;
LS-day-1m;
WW-ngt-3m;
It is index with associated each parameter further to not associated parameter such as tone period, pronunciation distance etc., layering is established
Multilayer index;
Such as
"ZS" day-05m;
ngt-1m;
"LS" day-1m;
"WW" ngt-3m;
Under first layer index of " ZS ", using tone period as the second layer index;
《ZS》-《day》05m;
《ZS》-《ngt》05m;
Since the Sex, Age of each enunciator uniquely determines, can be indexed by different levels of gender and age, establish for
The index of enunciator's name facilitates according to enunciator's name and age called data;
Such as
"ma"1018- ZS;
1830-LS;
"fem"4050-WW;
Above embodiment further simplifies the memory space of name.
Stratification index not only retrieve by fast and easy, also facilitates deletion and modifies certain class data, such as finds some pronunciation
Person's name input error directly by enunciator's name index modification, and does not have to each file and modifies one by one, deletes certain class text
Part is also in this way, for example to delete 3 meters of distance of pronunciation, and tone period is the data in evening, then corresponding delete is sent out in index simultaneously
Total data under sound distance 3m and tone period ngt.
Using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, from
Two aspects improve reading and writing data and recognition speed, are conducive to quick calling and management.
Meanwhile data screening management is further used, it can identify that be often updated replacement is used for voice training
Data, and above-mentioned improved data management measure is carried out for it, improve resource utilization.
Previously described is each preferred embodiment of the invention, if the preferred embodiment in each preferred embodiment
It is not obvious contradictory or premised on a certain preferred embodiment, each preferred embodiment can any stack combinations
Use, the design parameter in the embodiment and embodiment only for the purpose of clearly stating the inventor's invention verification process, and
It is non-to limit scope of patent protection of the invention, scope of patent protection of the invention is still subject to the claims, all
It is that similarly should be included within the scope of the present invention with the variation of equivalent structure made by description of the invention.
Claims (6)
1. quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, it is characterised in that, including it is as follows
Step:
Voice data is filtered out, following processing is done to the memory of voice data and its occupancy:
4K alignment is carried out to memory;
Voice data is ranked up according to two-way bubbling algorithm;
Voice data is counted and managed according to name, the management includes access modification operation.
2. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special
Sign is, the foundation of the screening is time data memory, and what time data memory was shorter than pre-set threshold time regards as voice
Data.
3. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special
Sign is, described to be ranked up to voice data according to two-way bubbling algorithm specifically: to be pressed with the filename of voice data file
It is ranked up according to two-way bubbling algorithm.
4. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method,
It is characterized in that, the name of the voice data file name and management rule are as follows:
It include multiple parameters Ci in the name of voice data, each corresponding unique text data of voice data name, the text
Data are the correspondence text of voice data;
Following index is established for name:
Using text data as foundation, corresponding voice data name index database under each text data is established;
Index database is established to the name of each voice data, is successively index with each parameter, multilayer index is established in layering.
5. quick storage of the voice big data as claimed in claim 4 on redundant arrays of inexpensive disks and access amending method,
It is characterized in that, at least one parameter Ck, is associated with unique other parameters in the parameter Ci, then saves in voice data name
Slightly associated other parameters,
It is index with associated each parameter for not associated parameter, the multilayer index to text data is established in layering;
It is index with associated each parameter for associated parameter, the multilayer index to parameter Ck is suggested in layering.
6. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method,
It is characterized in that, the access modification operation is the access modification behaviour of all files under the manipulative indexing carried out according to parameter reference
Make.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910585870.2A CN110275978A (en) | 2019-07-01 | 2019-07-01 | Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910585870.2A CN110275978A (en) | 2019-07-01 | 2019-07-01 | Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110275978A true CN110275978A (en) | 2019-09-24 |
Family
ID=67963800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910585870.2A Pending CN110275978A (en) | 2019-07-01 | 2019-07-01 | Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110275978A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20010057276A (en) * | 1999-12-21 | 2001-07-04 | 윤종용 | Apparatus and method for receiving voice data |
CN101996195A (en) * | 2009-08-28 | 2011-03-30 | 中国移动通信集团公司 | Searching method and device of voice information in audio files and equipment |
CN102999601A (en) * | 2012-11-20 | 2013-03-27 | 广东欧珀移动通信有限公司 | Method for sorting files, and multimedia terminal |
CN103974143A (en) * | 2014-05-20 | 2014-08-06 | 北京速能数码网络技术有限公司 | Method and device for generating media data |
CN106448709A (en) * | 2016-09-26 | 2017-02-22 | 珠海格力电器股份有限公司 | Automatic record extraction control method and device, and mobile terminal |
CN106844041A (en) * | 2016-12-29 | 2017-06-13 | 华为技术有限公司 | The method and internal storage management system of memory management |
US20180288595A1 (en) * | 2016-12-09 | 2018-10-04 | Riedel Communications International GmbH | Intercom network, mobile terminal, and method |
-
2019
- 2019-07-01 CN CN201910585870.2A patent/CN110275978A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20010057276A (en) * | 1999-12-21 | 2001-07-04 | 윤종용 | Apparatus and method for receiving voice data |
CN101996195A (en) * | 2009-08-28 | 2011-03-30 | 中国移动通信集团公司 | Searching method and device of voice information in audio files and equipment |
CN102999601A (en) * | 2012-11-20 | 2013-03-27 | 广东欧珀移动通信有限公司 | Method for sorting files, and multimedia terminal |
CN103974143A (en) * | 2014-05-20 | 2014-08-06 | 北京速能数码网络技术有限公司 | Method and device for generating media data |
CN106448709A (en) * | 2016-09-26 | 2017-02-22 | 珠海格力电器股份有限公司 | Automatic record extraction control method and device, and mobile terminal |
US20180288595A1 (en) * | 2016-12-09 | 2018-10-04 | Riedel Communications International GmbH | Intercom network, mobile terminal, and method |
CN106844041A (en) * | 2016-12-29 | 2017-06-13 | 华为技术有限公司 | The method and internal storage management system of memory management |
Non-Patent Citations (1)
Title |
---|
曾家智 等: "《电子科技大学出版社》" * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10496621B2 (en) | Columnar storage of a database index | |
CN102663090B (en) | Method and device for inquiry metadata | |
US7689574B2 (en) | Index and method for extending and querying index | |
CN101719141B (en) | File processing method and system based on directory object | |
CN105912687B (en) | Magnanimity distributed data base storage unit | |
CN107491487B (en) | Full-text database architecture and bitmap index creation and data query method, server and medium | |
US9996557B2 (en) | Database storage system based on optical disk and method using the system | |
JP2770855B2 (en) | Digital information storage and retrieval method and apparatus | |
JPH09212528A (en) | Method for storing data base, method for retrieving record from data base, and data base storage and retrieval system | |
JP2005267600A5 (en) | ||
CN103176754A (en) | Reading and storing method for massive amounts of small files | |
CN107451138A (en) | A kind of distributed file system storage method and system | |
CN108255966A (en) | A kind of data migration method and storage medium | |
CN113535670A (en) | Virtual resource mirror image storage system and implementation method thereof | |
CN107391769A (en) | A kind of search index method and device | |
CN110515897B (en) | Method and system for optimizing reading performance of LSM storage system | |
CN110275978A (en) | Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method | |
JP4699469B2 (en) | Database management program | |
WO2023116828A1 (en) | Method and system for hard disk data storing and access | |
Iwata et al. | A simulation result of replicating data with another layout for reducing media exchange of cold storage | |
US8812453B2 (en) | Database archiving using clusters | |
RU2389066C2 (en) | Multidimensional database and method of managing multidimensional database | |
CN106469174A (en) | Method for reading data and device | |
JPS5851348A (en) | High-speed access system for variable-length record | |
CN115374127B (en) | Data storage method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190924 |