CN109902199A

CN109902199A - A kind of near field corpus acquisition method and device

Info

Publication number: CN109902199A
Application number: CN201910156714.4A
Authority: CN
Inventors: 丁伟; 曾敏; 谢世波
Original assignee: Shenzhen Wewins Wireless Communication Technology Co Ltd
Current assignee: Shenzhen Wewins Wireless Communication Technology Co Ltd
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2019-06-18

Abstract

The present invention relates to technical field of voice recognition, a kind of near field corpus acquisition method and device are disclosed, this method comprises: obtaining acquisition tasks；Voice needed for acquiring the acquisition tasks by voice program；The voice is uploaded to server, and converts the voice to the corpus file of preset format；By voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to the corpus file of predetermined format, and then convenient for automatic audit, are improved corpus acquisition time and quality, improved corpus collecting efficiency.

Description

A kind of near field corpus acquisition method and device

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of near field corpus acquisition methods and device.

Background technique

Currently, the corpus acquisition method of speech recognition is multifarious, acquire the equipment used be also it is varied, due to saying The randomness of words and sound it is of different sizes, collected corpus quality cannot ensure.

The quality of the result of speech recognition is largely dependent upon the quality of corpus, cannot ensure in corpus quality In the case of, no matter what improves gimmick using and is all difficult to promote the effect of identification again.

Summary of the invention

It is a primary object of the present invention to propose a kind of near field corpus acquisition method and device, pass through the voices journey such as wechat Sequence can acquire rapidly a large amount of corpus, and be automatically converted to the corpus file of predetermined format, and then convenient for automatic audit, improve Corpus acquisition time and quality, improve corpus collecting efficiency.

To achieve the above object, a kind of near field corpus acquisition method provided by the invention, comprising:

Obtain acquisition tasks；

Voice needed for acquiring the acquisition tasks by voice program；

The voice is uploaded to server, and converts the voice to the corpus file of preset format.

Optionally, the acquisition tasks include: entry, entry classification and task control information.

Optionally, it is described the acquisition tasks are acquired by voice program needed for voice include:

The acquisition tasks are got by voice program；

The task control information in the acquisition tasks is read, the task control information includes voice object information and adopts Collect quantity；

Voice needed for acquiring the acquisition tasks according to the collecting quantity.

Optionally, described that the voice is uploaded to server, and convert the voice to the corpus text of preset format After part further include:

The corpus file is identified by the online speech recognition platforms of third party；

The corpus file is audited automatically according to the task control information, by the corpus if audit passes through File stores to corpus and otherwise carries out manual examination and verification.

Optionally, the voice object information includes accent, age bracket and gender.

As another aspect of the present invention, a kind of near field corpus acquisition device for providing, comprising:

Module is obtained, for obtaining acquisition tasks；

Acquisition module, for voice needed for acquiring the acquisition tasks by voice program；

Conversion module for the voice to be uploaded to server, and converts the voice to the corpus of preset format File.

Optionally, the acquisition module includes:

Unit is got, for getting the acquisition tasks by voice program；

Reading unit, for reading the task control information in the acquisition tasks, the task control information includes language Sound object information and collecting quantity；

Acquisition unit, for voice needed for acquiring the acquisition tasks according to the collecting quantity.

Optionally, further includes:

Auditing module, for identifying the corpus file by the online speech recognition platforms of third party；According to the task Control information audits the corpus file automatically, stores the corpus file to corpus if if audit, no Then, manual examination and verification are carried out.

A kind of near field corpus acquisition method and device proposed by the present invention, this method comprises: obtaining acquisition tasks；Pass through language Voice needed for acquisition tasks described in sound programmed acquisition；The voice is uploaded to server, and is converted the voice to pre- If the corpus file of format；By voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to predetermined format Corpus file improve corpus acquisition time and quality, improve corpus collecting efficiency and then convenient for automatic audit.

Detailed description of the invention

Fig. 1 is a kind of flow chart near field corpus acquisition method that the embodiment of the present invention one provides；

Fig. 2 is the flow chart of step S20 in Fig. 1；

Fig. 3 is the flow chart for another near field corpus acquisition method that the embodiment of the present invention one provides；

Fig. 4 is a kind of exemplary block diagram of near field corpus acquisition device provided by Embodiment 2 of the present invention；

Fig. 5 is the exemplary block diagram of acquisition module in Fig. 4；

Fig. 6 is the exemplary block diagram of another near field corpus acquisition device provided by Embodiment 2 of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

In subsequent description, it is only using the suffix for indicating such as " module ", " component " or " unit " of element Be conducive to explanation of the invention, itself there is no specific meanings.Therefore, " module " can mixedly make with " component " With.

Embodiment one

As shown in Figure 1, in the present embodiment, a kind of near field corpus acquisition method, comprising:

S10, acquisition tasks are obtained；

S20, the acquisition tasks are acquired by voice program needed for voice；

S30, the voice is uploaded to server, and converts the voice to the corpus file of preset format.

In the present embodiment, by voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to make a reservation for The corpus file of format, and then convenient for automatic audit, corpus acquisition time and quality are improved, corpus collecting efficiency is improved.

In the present embodiment, the acquisition tasks include: entry, entry classification and task control information.

In the present embodiment, the corresponding entry maintenance module of entry, for the maintenance of entry, entry maintenance module is divided into word Item classification, short entry and recording three submodules of entry, wherein short entry is that mark need to be to be used, and it is desirable for recording entry Record corpus, it is necessary to belong in entry classification, such as " opening air-conditioning " is recording entry, short entry be " opening " and " air-conditioning " adheres to two class entries separately.

In the present embodiment, as described in step S10, after only entry to be collected is issued into task, system can just be adopted Collect the corpus of voice object (speaker), the task of publication can be shown by voice programs such as wechats and gets task Deng.

In the present embodiment, the voice programs user group such as wechat, QQ is huge, can acquire a large amount of corpus rapidly, at double Shorten corpus acquisition time, accelerate project interaction cycle.

Before issuing acquisition tasks, backstage manager can also by entry maintenance module or task acquisition module come pair Entry and acquisition tasks are safeguarded.

As shown in Fig. 2, in the present embodiment, the step S20 includes:

S21, the acquisition tasks are got by voice program；

Task control information in S22, the reading acquisition tasks, the task control information includes voice object information And collecting quantity；

S23, the acquisition tasks are acquired according to the collecting quantity needed for voice.

In the present embodiment, the voice object information includes accent, age bracket and gender.The collecting quantity is to need The number of acquisition, the two information constitute task control information, with assisted acquisition task.

In the present embodiment, the file after the recording of wechat program cannot be used directly for speech recognition, and voice is needed to convert The wav file for being 16bit for sample rate 16k, sample format, and remove mute before and after voice, promote the quality of corpus.

As shown in figure 3, in the present embodiment, after the step S30 further include:

S40, the corpus file is identified by the online speech recognition platforms of third party；

S50, the corpus file is audited automatically according to the task control information, it will be described if if auditing Corpus file stores to corpus and otherwise carries out manual examination and verification.

In the present embodiment, pass through audit storage if recognition result and mark are consistent；The inconsistent elder generation of recognition result It is identified, is put in storage again after manual examination and verification pass through；It is resurveyed if manual examination and verification are unacceptable.Due to collected big portion Divide corpus that can be audited with online recognition, the work difficulty of manual examination and verification is greatly reduced, can be reduced in implementing sample The cost of 90% manual examination and verification.

Embodiment two

As shown in figure 4, in the present embodiment, a kind of near field corpus acquisition device, comprising:

Module 10 is obtained, for obtaining acquisition tasks；

Acquisition module 20, for voice needed for acquiring the acquisition tasks by voice program；

Conversion module 30 for the voice to be uploaded to server, and converts the voice to the language of preset format Expect file.

In the present embodiment, after only entry to be collected being issued into task, system can just collect voice object and (say Talk about people) corpus, can by the voice programs such as wechat show publication task and get task dispatching.

As shown in figure 5, in the present embodiment, the acquisition module includes:

Unit 21 is got, for getting the acquisition tasks by voice program；

Reading unit 22, for reading the task control information in the acquisition tasks, the task control information includes Voice object information and collecting quantity；

Acquisition unit 23, for voice needed for acquiring the acquisition tasks according to the collecting quantity.

As shown in fig. 6, in the present embodiment, near field corpus acquisition device further include:

Auditing module 40, for identifying the corpus file by the online speech recognition platforms of third party；According to described Business control information audits the corpus file automatically, stores the corpus file to corpus if if audit, Otherwise, manual examination and verification are carried out.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Business device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of near field corpus acquisition method characterized by comprising

Obtain acquisition tasks；

Voice needed for acquiring the acquisition tasks by voice program；

2. a kind of near field corpus acquisition method according to claim 1, which is characterized in that the acquisition tasks include: word Item, entry classification and task control information.

3. a kind of near field corpus acquisition method according to claim 2, which is characterized in that described to be acquired by voice program Voice needed for the acquisition tasks includes:

The acquisition tasks are got by voice program；

The task control information in the acquisition tasks is read, the task control information includes voice object information and acquisition number Amount；

4. a kind of near field corpus acquisition method according to claim 3, which is characterized in that described to be uploaded to the voice Server, and convert the voice to after the corpus file of preset format further include:

The corpus file is audited automatically according to the task control information, by the corpus file if audit passes through It stores to corpus and otherwise carries out manual examination and verification.

5. a kind of near field corpus acquisition method according to claim 4, which is characterized in that the voice object information includes Accent, age bracket and gender.

6. a kind of near field corpus acquisition device characterized by comprising

Module is obtained, for obtaining acquisition tasks；

Conversion module for the voice to be uploaded to server, and converts the voice to the corpus file of preset format.

7. a kind of near field corpus acquisition device according to claim 6, which is characterized in that the acquisition tasks include: word Item, entry classification and task control information.

8. a kind of near field corpus acquisition device according to claim 7, which is characterized in that the acquisition module includes:

Unit is got, for getting the acquisition tasks by voice program；

Reading unit, for reading the task control information in the acquisition tasks, the task control information includes voice pair Image information and collecting quantity；

9. a kind of near field corpus acquisition device according to claim 8, which is characterized in that further include:

Auditing module, for identifying the corpus file by the online speech recognition platforms of third party；According to the task control Information audits the corpus file automatically, stores the corpus file to corpus if if audit, otherwise, into Row manual examination and verification.

10. a kind of near field corpus acquisition method according to claim 9, which is characterized in that the voice object information packet Include accent, age bracket and gender.