CN107342079A

CN107342079A - A kind of acquisition system of the true voice based on internet

Info

Publication number: CN107342079A
Application number: CN201710543472.5A
Authority: CN
Inventors: 谌勋
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2017-11-10

Abstract

A kind of acquisition system of the true voice based on internet, including：Server and client side；Server and client side's network connection；Server performs：The text material to prestore is divided into statement text of its length suitable for deep neural network training；Statement text to be read aloud is sent to the unspecific user for obtaining client access authority；Receive speech data corresponding with statement text；Client executing：The bright read request of statement text is initiated to server and receives statement text to be read aloud；Collection user reads aloud the speech data of statement text, and the speech data collected is sent into server.The acquisition mode of the true voice of this example is compared with existing voice collecting mode：Substantial amounts of post-processing and proof-reading need not be carried out to the audio file of collection, moreover, the audio file of collection is corresponding with the text material read aloud；In addition, by evaluating the speech data of collection, further, the high quality collection of true voice is realized.

Description

A kind of acquisition system of the true voice based on internet

Technical field

The present invention relates to voice collecting technical field, and in particular to a kind of collection system of the true voice based on internet System.

Background technology

Quick development is being obtained in recent years and is being widely applied based on the speech recognition technology of deep neural network. The speech data (data for possessing word-voice control) that this technology needs to have marked in advance is input to a nerve net Among network, neutral net is trained.The quality and quantity of the speech data marked extremely closes for the effect of speech recognition Important, the data marked are more, and the effect of training is better.The speech data quality marked is higher, closer to real people Class voice, it is better to the recognition effect of real human's voice to train the deep neural network come.

At present, the main acquisition source of the tagged speech data set used in terms of deep learning includes the following aspects：

A. specially recruit personnel and read aloud text material and recorded, to gather voice sample；

B. using the audio file in open field to obtain voice sample；

C. phonitic entry method is developed, gathers the voice sample of user, such as interrogates rumours phonetic input method；

D., the voice assistant of operating system is provided, interacted therewith by client, gathers voice sample, such as Microsoft The Cortana of the Win10 desktop versions and Siri of Apple Inc.；

E. directly synthesized according to text material using speech synthesis technique.

There are the following problems for above-mentioned voice acquisition technique：

1. recruiting the mode that people reads aloud text material, the audio file collected must be divided into 10 seconds or so by the later stage Small documents, and need split text material correspond to therewith, these are required for substantial amounts of post-processing and proof-reading.And The scope of collection is small, the limited sample size that can be collected every time；

2. the mode of the audio file using open field, these usual audio files all lack corresponding text material Material, file size are generally also excessive, it is necessary to substantial amounts of later stage word dictation, dividing processing and proof-reading；

3. the mode gathered with phonitic entry method, the voice sample collected are not ensured that with accurate word therewith It is corresponding.Meanwhile the voice sample collected is uneven in length, also there are a large amount of useless samples to mix period, sample quality can not be protected Card is, it is necessary to substantial amounts of post-processing and proof-reading；

It is 4. identical with the mode gathered with phonitic entry method with the mode of voice assistant, shortcoming；

5. with the mode of speech synthesis technique, the voice of synthesis has larger difference with true voice, is unfavorable for depth nerve Study of the network to real speech.

The content of the invention

The application provides a kind of acquisition system of the true voice based on internet, including server and client side；

Server and client side's network connection；

Server performs：

The text material to prestore is divided into statement text of its length suitable for deep neural network training；

Statement text to be read aloud is sent to the unspecific user for obtaining client access authority；

Receive speech data corresponding with statement text；

Client executing：

The bright read request of statement text is initiated to server and receives statement text to be read aloud；

Collection user reads aloud the speech data of statement text, and the speech data collected is sent into server.

In a kind of embodiment, in addition to Speech Assessment module, Speech Assessment module perform and the speech data are commented Valency.

In a kind of embodiment, Speech Assessment module performs speech data and evaluated, and is specially：According to noise level and use Whether family carries out reading aloud the scoring for calculating the speech data according to statement text.

In a kind of embodiment, Speech Assessment module is integrated in server, and server also performs：Speech data is commented Valency, significant notation will be carried out by the speech data of evaluation and be saved to the memory bank of statement text corresponding thereto In, otherwise, the speech data not by evaluation is subjected to invalid flag.

In a kind of embodiment, Speech Assessment module is integrated in client, and client also performs：Speech data is commented Valency, significant notation will be carried out by the speech data of evaluation and be sent to server, otherwise, will not pass through the voice of evaluation Data carry out invalid flag.

In a kind of embodiment, in addition to third party's detection platform, third party's detection platform respectively with client and server Network connection；

Client executing：

The speech data collected is sent to third party's detection platform；

Third party's detection platform is built-in with Speech Assessment module, and third party's detection platform performs：

Speech data is evaluated, significant notation will be carried out by the speech data of evaluation and transmit it to service Device, otherwise, the speech data not by evaluation is subjected to invalid flag.

In a kind of embodiment, server is integrated with selective examination module, and server also performs：To the effective speech data of preservation Carry out random artificial selective examination.

In a kind of embodiment, the program of client executing at least depends on one of them：Smart machine, PC and clear Look at device webpage.

According to the acquisition system of above-described embodiment, due to the text material to prestore is divided into its length suitable for depth god Statement text through network training, statement text to be read aloud is sent to according to the bright read request of user, language is read aloud to user The voice of sentence text is acquired, and the acquisition mode of the true voice of this example is compared with existing voice collecting mode：This example is not Need to carry out the audio file of collection later stage segmentation, do not need substantial amounts of post-processing and proof-reading, moreover, the sound of collection Frequency file is corresponding with the text material read aloud；In addition, by evaluating the speech data of collection, the language that passes through will be evaluated Sound data carry out storage and effective mark, further, realize the high quality collection of true voice.

Brief description of the drawings

Fig. 1 is the acquisition system operating diagram of embodiment one；

Fig. 2 is the acquisition system operating diagram of embodiment two；

Fig. 3 is the acquisition system operating diagram of embodiment three.

Embodiment

The present invention is described in further detail below by embodiment combination accompanying drawing.

In embodiments of the present invention, it is true voice of the solution currently used for the tape label of training deep neural network The problem of data sample is less, and speech data sample acquisition cost is higher, this example provide a kind of true voice based on internet Acquisition system, its gather speech data can after simple process for deep learning neutral net training, checking and Test.

Embodiment one：

The acquisition system of the true voice based on internet of this example includes server 1 and client 2, its operating diagram As shown in figure 1, server 1 and client 2 establish network connection, the language that client 2 is read aloud non-user-specific based on internet Sentence text is recorded, and recording is sent to server 1, to realize the collection of true voice, wherein, what non-user-specific referred to It is any one user, i.e. any one user can ask read aloud statement text by registering to server 1, thus, Extend the scope of speech sample.

Specifically, in order to avoid carrying out dividing processing, the audio file more gathered for convenience to the audio file of collection Suitable for the sample of deep neural network speech recognition, the substantial amounts of text material to prestore is divided into its length and fitted by server 1 For the statement text of deep neural network training, e.g., the length of statement text is approximately equal to the bright read time of 10 seconds or so.

After non-user-specific registers an account by client 2, you can as the specific user of statement text is read aloud, Such as, when non-user-specific is according to the registration of client 2 prompting, after progressively agreeing to client 2 using clause, the nonspecific use Family is with regard to that can obtain the access rights of client 2, and then, user can just initiate to server 1 read aloud statement text by client 2 Request, client 2 receives statement text to be read aloud, and user reads aloud the statement text of acquisition, and now, client 2 is subsidiary Recording hardware device is triggered, and the true voice read aloud user is recorded, and after treating that user reads aloud, client 2 will adopt The speech data collected is sent to server 1, and server 1 receives and stores the speech data, so that server 1 can gather To with word corresponding to voice messaging.

In order to ensure to obtain the validity of voice messaging, avoid obtaining useless sample, the language collected for each Sound data carry out automatic detection and provide evaluation score, and only evaluation score can just be stored more than the speech data of default value To server 1.This example also includes Speech Assessment module 3, and Speech Assessment module 3 is evaluated speech data, according to evaluation point Whether the speech data that number judges to collect meets the requirements, wherein, evaluating the key element of speech data includes noise level and user Whether read aloud according to statement text, therefore, whether Speech Assessment module 3 is according to noise level and user according to statement text Carry out reading aloud the scoring for calculating speech data.

Wherein, whether read aloud for user according to statement text, specific evaluation method is Speech Assessment module 3 To use reference voice corresponding to the statement text, the reference voice can be existing collection tagged speech or The similarity of artificial speech, the comparison reference voice of Speech Assessment module 3 and the voice gathered is synthesized, is commented according to similarity Point.

The similarity of comparison reference voice and the voice of collection is achieved in that：First looked for using dynamic time warping algorithm To between the language of collection the best alignment with reference to the feature between language, then using Levenshtein distance algorithms come calculate this two The distance between individual sequence, the similarity between being obtained by distance between two languages, is scored according to the similarity.

In this example, Speech Assessment module 3 is integrated in server 1, and server 1 receives speech data corresponding to statement text Afterwards, server 1 is evaluated speech data by Speech Assessment module 3, and server 1 will be carried out by the speech data of evaluation Significant notation is simultaneously saved in the memory bank 4 of statement text corresponding thereto, otherwise, will not pass through the voice number of evaluation According to invalid flag is carried out, the speech data of invalid flag can be saved in the memory bank 4 in server 1, also may be used by server 1 To be abandoned.

Further, in order to ensure the quality of the significant notation speech data of upload, server 1 is integrated with selective examination module, takes Business device 1 carries out random artificial selective examination by spot-check the speech data of significant notation of the module to being stored in memory bank 4, to obtain Take relevant information to be used for the evaluating for adjusting speech data, and the not detectable invalid sample of automatic detection institute can be rejected, Further improve the quality of true voice sample set.

It should be noted that in order to extend the sample range of true voice sample set, the program of the execution of client 2 of this example Existing forms can be mobile phone, tablet personal computer, the stand-alone utility on PC or be integrated in other application Accommodation function module program or browsing device net page application program or the specialized hardware of customization inside program Configuration processor, i.e. the program of client executing at least depends on one of them：Smart machine, PC and browser net Page, wherein, smart machine includes but is not limited to：Smart mobile phone, tablet personal computer, intelligent watch, game machine, special phonographic recorder, The intelligent domestic appliance controller etc..Accordingly, the program that server 1 performs can be deployed in property server, can also be deployed in High in the clouds, i.e. server 1 can be cloud server or in general server.

Embodiment two：

Based on embodiment one, this example is unlike embodiment one：Speech Assessment module 3 is integrated in client 2 by this example, Its operating diagram will be also commented by voice as shown in Fig. 2 client 2 is gathered after user reads aloud the speech data of statement text Valency module 3 is evaluated speech data, will be carried out significant notation by the speech data of evaluation and is sent to server 1, server 1 will carry out significant notation by the speech data of evaluation and be saved to depositing for statement text corresponding thereto Store up in body 4, otherwise, the speech data not by evaluation is subjected to invalid flag, client 2 will can not pass through the voice of evaluation Data directly abandon, and can also be transmitted and are saved in the memory bank 4 in server 1.

Embodiment three：

Based on embodiment one, for this example unlike embodiment one, this example also includes third party's detection platform 5, third party Detection platform 5 is respectively with client 2 and the network connection of server 1, and its operating diagram will be as shown in figure 3, client 2 will collect Speech data be conveyed directly to third party's detection platform 5；Third party's detection platform 5 is built-in with Speech Assessment module 3, by the 3rd Square detection platform 5 is evaluated speech data by Speech Assessment module 3, will carry out criterion by the speech data of evaluation Remember and transmit it to server 1, server 1 significant notation will be carried out by the speech data of evaluation and be saved to and its In the memory bank 4 of corresponding statement text, otherwise, invalid flag, third party's inspection will not carried out by the speech data of evaluation Surveying platform 5 can directly abandon the speech data not by evaluation, can also be transmitted and be saved in depositing in server 1 Store up in body 4.

Use above specific case is illustrated to the present invention, is only intended to help and is understood the present invention, not limiting The system present invention.For those skilled in the art, according to the thought of the present invention, can also make some simple Deduce, deform or replace.

Claims

A kind of 1. acquisition system of the true voice based on internet, it is characterised in that including：Server and client side；

Server and client side's network connection；

The server performs：

The text material to prestore is divided into statement text of its length suitable for deep neural network training；

Statement text to be read aloud is sent to the unspecific user for obtaining client access authority；

Receive speech data corresponding with the statement text；

The client executing：

The bright read request of statement text is initiated to the server and receives statement text to be read aloud；

Collection user reads aloud the speech data of statement text, and the speech data collected is sent into the server.
2. acquisition system as claimed in claim 1, it is characterised in that also including Speech Assessment module, the Speech Assessment mould Block performs and the speech data is evaluated.
3. acquisition system as claimed in claim 2, it is characterised in that the Speech Assessment module performs the speech data and entered Row evaluation, it is specially：Commenting for the speech data is calculated according to whether noise level and user according to statement text read aloud Point.
4. acquisition system as claimed in claim 2, it is characterised in that the Speech Assessment module is integrated in the server, The server also performs：The speech data is evaluated, significant notation will be carried out by the speech data of evaluation and incited somebody to action It is preserved into the memory bank of statement text corresponding thereto, and otherwise, the speech data not by evaluation is carried out without criterion Note.
5. acquisition system as claimed in claim 2, it is characterised in that the Speech Assessment module is integrated in the client, The client also performs：The speech data is evaluated, significant notation will be carried out by the speech data of evaluation and incited somebody to action It is sent to the server, otherwise, the speech data not by evaluation is carried out into invalid flag.
6. acquisition system as claimed in claim 2, it is characterised in that also including third party's detection platform, third party's inspection Survey platform respectively with the client and server network connection；

The client executing：

The speech data collected is sent to third party's detection platform；

Third party's detection platform is built-in with the Speech Assessment module, and third party's detection platform performs：

The speech data is evaluated, significant notation will be carried out by the speech data of evaluation and transmit it to the clothes Business device, otherwise, the speech data that will not pass through evaluation carry out invalid flag.
7. the acquisition system as described in claim any one of 4-6, it is characterised in that the server is integrated with selective examination module, The server also performs：Random artificial selective examination is carried out to the effective speech data of preservation.
8. acquisition system as claimed in claim 1, it is characterised in that the program of the client executing at least depends on wherein One of：Smart machine, PC and browsing device net page.