CN110275978A

CN110275978A - Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method

Info

Publication number: CN110275978A
Application number: CN201910585870.2A
Authority: CN
Inventors: 游萌; 何云鹏; 高君效; 许兵
Original assignee: Chengdu Leader Technology Co Ltd
Current assignee: Chengdu Leader Technology Co Ltd; Chipintelli Technology Co Ltd
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-09-24

Abstract

Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, include the following steps: to filter out voice data, do following processing to the memory of voice data and its occupancy: carrying out 4K alignment to memory；Voice data is ranked up according to two-way bubbling algorithm；Voice data is counted and managed according to name, the management includes access modification operation.Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, reading and writing data and recognition speed are improved in terms of two, are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify the data for voice training for being often updated replacement, and carry out above-mentioned improved data management measure for it, improve resource utilization.

Description

Quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method

Technical field

The invention belongs to field of artificial intelligence, are related to a kind of voice data management method, and in particular to a kind of voice Quick storage of the big data on redundant arrays of inexpensive disks and access amending method.

Background technique

Voice document is that continuously the record for digital signal, voice document storage are digital signal records with tray A kind of storage mode of the form of formula in hard disk and other media, storage file is frequently not continuous on physical medium on hard disk Storage, and be dispersed in disk different subregions and different physical sectors.There are many kinds of method, common storage sides for data storage Formula is that the different piece for constituting complete file is distributed to different physical address.In the backup of mainstream hicap instantly In big data management system, usually using disc redundancy array system and method.

The identification of voice data depends on the continuous training repeatedly of artificial intelligence neural networks, needs the voice number of magnanimity According to voice data classifies generally according to different enunciators, tone period, acquisition distance etc., these voice data are believed The management of breath needs to fully consider the complexity of storage architecture spatially, because of the size of big data field space storage, place Manage that the memory space that identical data use is bigger, and maintenance cost is higher, access data scheduling time is also longer, cause data at This unit storage medium composite price is also higher.

Summary of the invention

To overcome technological deficiency of the existing technology, the invention discloses a kind of voice big datas in redundant arrays of inexpensive disks On quick storage and access amending method.

It is of the present invention.

Using quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, adopt With 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, improved in terms of two Reading and writing data and recognition speed are conducive to quick calling and management.Meanwhile data screening management is further used, it can identify It is often updated the data for voice training of replacement out, and carries out above-mentioned improved data management measure for it, improves Resource utilization.

Specific embodiment

Specific embodiments of the present invention will be described in further detail below.

Quick storage of the voice big data of the present invention on redundant arrays of inexpensive disks and access amending method, including it is as follows Step:

Voice data is filtered out, following processing is done to the memory of voice data and its occupancy:

4K alignment is carried out to memory；

Voice data is ranked up according to two-way bubbling algorithm；

Voice data is counted and managed according to name, the management includes access modification operation.

Wherein, the threshold value of one storage time length is set according to the storage time of data to the screening of data, such as 7 days, priority is then lost when storage time is more than threshold value, it is new no more than threshold value that System Priority filters out storage time Increase data and optimizes processing.

Data screening is directed to the data batch in period short period, needs newer data to be used that can do for training Necessary reparation, the original data warehouse of synchronized callback after reparation.The data reparation priority for storing the short period is also higher, repairs Redoubling is synchronized to original data warehouse, and the data of long period storage can slowly synchronize simultaneously after a few wheel iteration cycles Retract original data warehouse.

For voice data, trained and renewal time is very fast and system data or other popular software data renewal speed Very slowly, it is the voice data newly stored due to frequently call reading in the present invention, is sieved according to storage time length Choosing.

The memory that the data filtered out occupy data itself and data carries out following optimization processing.

4K alignment carried out to memory first, 4K alignment is exactly to allow the smallest allocation unit and Hash memory pages pair of operating system It should get up, can once be completed when 4KB data being written in this way.

Voice training data are usually the larger file that is made of multiple scattered small documents, each small documents dispersion of when storage It is stored in different fdisks, this data format, which has hard disc data read or write speed using 4K alignment, more significantly to be mentioned It rises.

Such as in terms of important file replicate data, big file and scattered is tested under identical hard disk and different hard disks respectively The test of small documents, big file are about 20 GB data filing categorical datas, and scattered small documents are 200,000 texts for being not more than 2M Part is obviously improved, especially through test comparison it can be found that in terms of data duplication using the hard disk performance that 4K is aligned It is in terms of scattered small documents, speed improves by about one time.

Actual measurement discovery, when using solid state hard disk as temporary storage medium, does quick number in provisional exchange area According to exchange, the efficiency highest of critical processing 4KB random read-write in actual use.

After handling memory, data are ranked up according to two-way bubbling algorithm, two-way bubbling algorithm is tradition The two-way progress of bubble sort, first allows bubble sort to carry out from left to right, then bubble sort is allowed to be turned left progress by the right side, so complete At the movement of a minor sort.

The principle of specific two-way bubble sort is to be directed to the two column sequences for needing to sort to carry out two layers of searching loop.Just Direction allows lesser data in sequence that can be moved to the top of array, opposite direction, i.e., with respect to another phase of positive direction quickly Allow biggish data in sequence that can be moved to the bottom of array quickly in direction, so that sequence is at the end of working as two layers of circulation Achievable, relatively unidirectional bubbling, two-way bubbling Algorithms T-cbmplexity is smaller.

It is ranked up using two-way bubbling algorithm preferably for the name of file, the read-write of mass data is called, is changed The naming method of kind data file is conducive to quick identifying call.

Such as include C1 enunciator for a certain voice data D1, associated parameter Ci, C2 acquisition time, C3 adopt sound away from From, C4 enunciator's gender, C5 enunciator's age bracket.

Assuming that the state when data under voice is as follows:

	Enunciator's name	Acquisition time	Adopt sound distance	Enunciator's gender	Enunciator's age bracket
						D1	Zhang San (ZS)	Daytime (day)	0.5 meter	Male (ma)	10-18
D2	Zhang San (ZS)	Night (ngt)	1 meter	Male (ma)	10-18
						D3	Li Si (LS)	Daytime (day)	1 meter	Male (ma)	18-30
D4	King five (WW)	Night (ngt)	3 meters	Female (fem)	40-50

The wherein corresponding text of D1 and D2 voice data are as follows: open air-conditioning.

The wherein corresponding text of D3 and D4 voice data are as follows: open TV.

It then can be as follows to the name of voice data D1 to D4:

ZS-day-05m-ma-1018；

ZS-ngt-1m-ma-1018；

LS-day-1m-ma-1830；

WW-ngt-3m-fem-4050；

With the corresponding relationship of text data are as follows:

Open air-conditioning-ZS-day-05m-ma-1018；

Open air-conditioning-ZS-ngt-1m-ma-1018；

Open TV-LS-day-1m-ma-1830；

Open TV-WW-ngt-3m-fem-4050；

Data are directly stored and called according to above-mentioned naming rule, and memory space is big, calls speed slow.

It using two-way bubbling algorithm, lays down a regulation and filename is ranked up, such as arranged according to alphabetic order A-Z Sequence, the identical then more next letter of letter, then following four filename:

ZS-day-05m-ma-1018

ZS-ngt-1m-ma-1018

LS-day-1m-ma-1830

WW-ngt-3m-fem-4050；

Sequence after being ranked up are as follows:

LS-day-1m-ma-1830；

WW-ngt-3m-fem-4050；

ZS-day-05m-ma-1018；

ZS-ngt-1m-ma-1018；

After being ranked up using two-way bubbling algorithm to each name, each associated data storage region is more concentrated, when calling It was found that recognition speed is faster.

And index database can be established to the name of each voice data, it is successively index with each parameter, layering is established more Layer index.

Such as ZS-day-05m-ma-1018, each information respectively correspond: enunciator's name, tone period, distance of pronouncing, Enunciator's gender, enunciator's age；

Such as with enunciator's name for the first layer index: the index information of voice data are as follows:

"ZS" day-05m-ma-1018；

ngt-1m-ma-1018；

"LS" day-1m-ma-1830；

"WW" ngt-3m-fem-4050；

Under first layer index of " ZS ", using tone period as the second layer index；

"ZS"-"day"05m-ma-1018；

"ZS"-"ngt"05m-ma-1018；

And so on.

As to pronounce apart from for the first layer index: being if the index information of voice data

"05m"ZS-day-ma-1018；

"1m"ZS-ngt-ma-1018；

LS-day-ma-1830；

"3m"WW-ngt-fem-4050；

Secondary index and so on.

It can be completely according to each voice number when organizing number and arranging data using the indexed mode of above-mentioned stratification It is defined according to associated one or more parameter attributes, chooses related data content.

Such as in speech recognition training, needs only Near-field Data to be selected to do the preparation before training, only use pronunciation The data content of 0.5 and 1m of distance, then the data for meeting this condition are instructed under the guidance of index data base into some The working directory of data preparation does the processing of early period before practicing, and at this moment data only remain a part of negligible amounts, phase accordingly Original total amount of data memory space is reduced.

Rule is indexed according to above-mentioned name, memory space is reduced.File designation is simpler, from using, reads, arrange and Scheduling etc. is more intuitive in sequence of operations, easy to maintain and convenient for operation.Also closer to actually making in file designation With file storage is smaller, safeguards conducive to the entirety of data.

Such as the data of 1 meter of distance of pronunciation are deleted in following data:

ZS-day-05m；

ZS-ngt-1m；

LS-day-1m；

WW-ngt-3m；

The then corresponding total data for deleting index pronunciation distance=1m, remaining data is as follows after deletion:

ZS-day-05m；

WW-ngt-3m；

Above-mentioned file designation mode effectively reduces file designation and carries out storage size required for data management, number According to search and call it is quicker, meanwhile, modification newly-increased to the mass of mass data and delete also more convenient.

Such as the modification to the 1m data in upper example: 3m is changed to by 1m batch.

All names containing 1m are retrieved first, are then modified.Modified database is

ZS-day-05m；

ZS-ngt-3m；

LS-day-3m；

WW-ngt-3m；

In a kind of preferred naming method, at least one parameter Ck, is associated with unique other parameters, then in the parameter Ci Associated other parameters are omitted in voice data name,

It is index with associated each parameter for not associated parameter, the multilayer index to text data is established in layering；

It is index with associated each parameter for associated parameter, the multilayer index to parameter Ck is suggested in layering.

For example, enunciator's name and enunciator's age, enunciator's gender is corresponding, can only retain pronunciation in name at this time Person's name, and the person's of eliding age and gender, such as the aforementioned name to voice data D1 to D4:

ZS-day-05m-ma-1018；

ZS-ngt-1m-ma-1018；

LS-day-1m-ma-1830；

WW-ngt-3m-fem-4050；

The enunciator that enunciator's name is ZS is 15 years old male, and LS is 20 years old male, and WW is 45 years old women, then above-mentioned name can be with It is further simplified as,

ZS-day-05m；

ZS-ngt-1m；

LS-day-1m；

WW-ngt-3m；

It is index with associated each parameter further to not associated parameter such as tone period, pronunciation distance etc., layering is established Multilayer index；

Such as

"ZS" day-05m；

ngt-1m；

"LS" day-1m；

"WW" ngt-3m；

《ZS》-《day》05m;

《ZS》-《ngt》05m;

Since the Sex, Age of each enunciator uniquely determines, can be indexed by different levels of gender and age, establish for The index of enunciator's name facilitates according to enunciator's name and age called data；

Such as

"ma"1018- ZS；

1830-LS；

"fem"4050-WW；

Above embodiment further simplifies the memory space of name.

Stratification index not only retrieve by fast and easy, also facilitates deletion and modifies certain class data, such as finds some pronunciation Person's name input error directly by enunciator's name index modification, and does not have to each file and modifies one by one, deletes certain class text Part is also in this way, for example to delete 3 meters of distance of pronunciation, and tone period is the data in evening, then corresponding delete is sent out in index simultaneously Total data under sound distance 3m and tone period ngt.

Using 4K alignment and two-way bubbling algorithm, storage region is optimized respectively and numerical nomenclature arranges, from Two aspects improve reading and writing data and recognition speed, are conducive to quick calling and management.

Meanwhile data screening management is further used, it can identify that be often updated replacement is used for voice training Data, and above-mentioned improved data management measure is carried out for it, improve resource utilization.

Previously described is each preferred embodiment of the invention, if the preferred embodiment in each preferred embodiment It is not obvious contradictory or premised on a certain preferred embodiment, each preferred embodiment can any stack combinations Use, the design parameter in the embodiment and embodiment only for the purpose of clearly stating the inventor's invention verification process, and It is non-to limit scope of patent protection of the invention, scope of patent protection of the invention is still subject to the claims, all It is that similarly should be included within the scope of the present invention with the variation of equivalent structure made by description of the invention.

Claims

1. quick storage of the voice big data on redundant arrays of inexpensive disks and access amending method, it is characterised in that, including it is as follows Step:

4K alignment is carried out to memory；

Voice data is ranked up according to two-way bubbling algorithm；

2. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special Sign is, the foundation of the screening is time data memory, and what time data memory was shorter than pre-set threshold time regards as voice Data.

3. quick storage of the voice big data as described in claim 1 on redundant arrays of inexpensive disks and access amending method, special Sign is, described to be ranked up to voice data according to two-way bubbling algorithm specifically: to be pressed with the filename of voice data file It is ranked up according to two-way bubbling algorithm.

4. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, the name of the voice data file name and management rule are as follows:

It include multiple parameters Ci in the name of voice data, each corresponding unique text data of voice data name, the text Data are the correspondence text of voice data；

Following index is established for name:

Using text data as foundation, corresponding voice data name index database under each text data is established；

Index database is established to the name of each voice data, is successively index with each parameter, multilayer index is established in layering.

5. quick storage of the voice big data as claimed in claim 4 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, at least one parameter Ck, is associated with unique other parameters in the parameter Ci, then saves in voice data name Slightly associated other parameters,

6. quick storage of the voice big data as claimed in claim 3 on redundant arrays of inexpensive disks and access amending method, It is characterized in that, the access modification operation is the access modification behaviour of all files under the manipulative indexing carried out according to parameter reference Make.