CN109902199A - A kind of near field corpus acquisition method and device - Google Patents

A kind of near field corpus acquisition method and device Download PDF

Info

Publication number
CN109902199A
CN109902199A CN201910156714.4A CN201910156714A CN109902199A CN 109902199 A CN109902199 A CN 109902199A CN 201910156714 A CN201910156714 A CN 201910156714A CN 109902199 A CN109902199 A CN 109902199A
Authority
CN
China
Prior art keywords
voice
corpus
acquisition
near field
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910156714.4A
Other languages
Chinese (zh)
Inventor
丁伟
曾敏
谢世波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wewins Wireless Communication Technology Co Ltd
Original Assignee
Shenzhen Wewins Wireless Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wewins Wireless Communication Technology Co Ltd filed Critical Shenzhen Wewins Wireless Communication Technology Co Ltd
Priority to CN201910156714.4A priority Critical patent/CN109902199A/en
Publication of CN109902199A publication Critical patent/CN109902199A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to technical field of voice recognition, a kind of near field corpus acquisition method and device are disclosed, this method comprises: obtaining acquisition tasks;Voice needed for acquiring the acquisition tasks by voice program;The voice is uploaded to server, and converts the voice to the corpus file of preset format;By voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to the corpus file of predetermined format, and then convenient for automatic audit, are improved corpus acquisition time and quality, improved corpus collecting efficiency.

Description

A kind of near field corpus acquisition method and device
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of near field corpus acquisition methods and device.
Background technique
Currently, the corpus acquisition method of speech recognition is multifarious, acquire the equipment used be also it is varied, due to saying The randomness of words and sound it is of different sizes, collected corpus quality cannot ensure.
The quality of the result of speech recognition is largely dependent upon the quality of corpus, cannot ensure in corpus quality In the case of, no matter what improves gimmick using and is all difficult to promote the effect of identification again.
Summary of the invention
It is a primary object of the present invention to propose a kind of near field corpus acquisition method and device, pass through the voices journey such as wechat Sequence can acquire rapidly a large amount of corpus, and be automatically converted to the corpus file of predetermined format, and then convenient for automatic audit, improve Corpus acquisition time and quality, improve corpus collecting efficiency.
To achieve the above object, a kind of near field corpus acquisition method provided by the invention, comprising:
Obtain acquisition tasks;
Voice needed for acquiring the acquisition tasks by voice program;
The voice is uploaded to server, and converts the voice to the corpus file of preset format.
Optionally, the acquisition tasks include: entry, entry classification and task control information.
Optionally, it is described the acquisition tasks are acquired by voice program needed for voice include:
The acquisition tasks are got by voice program;
The task control information in the acquisition tasks is read, the task control information includes voice object information and adopts Collect quantity;
Voice needed for acquiring the acquisition tasks according to the collecting quantity.
Optionally, described that the voice is uploaded to server, and convert the voice to the corpus text of preset format After part further include:
The corpus file is identified by the online speech recognition platforms of third party;
The corpus file is audited automatically according to the task control information, by the corpus if audit passes through File stores to corpus and otherwise carries out manual examination and verification.
Optionally, the voice object information includes accent, age bracket and gender.
As another aspect of the present invention, a kind of near field corpus acquisition device for providing, comprising:
Module is obtained, for obtaining acquisition tasks;
Acquisition module, for voice needed for acquiring the acquisition tasks by voice program;
Conversion module for the voice to be uploaded to server, and converts the voice to the corpus of preset format File.
Optionally, the acquisition tasks include: entry, entry classification and task control information.
Optionally, the acquisition module includes:
Unit is got, for getting the acquisition tasks by voice program;
Reading unit, for reading the task control information in the acquisition tasks, the task control information includes language Sound object information and collecting quantity;
Acquisition unit, for voice needed for acquiring the acquisition tasks according to the collecting quantity.
Optionally, further includes:
Auditing module, for identifying the corpus file by the online speech recognition platforms of third party;According to the task Control information audits the corpus file automatically, stores the corpus file to corpus if if audit, no Then, manual examination and verification are carried out.
Optionally, the voice object information includes accent, age bracket and gender.
A kind of near field corpus acquisition method and device proposed by the present invention, this method comprises: obtaining acquisition tasks;Pass through language Voice needed for acquisition tasks described in sound programmed acquisition;The voice is uploaded to server, and is converted the voice to pre- If the corpus file of format;By voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to predetermined format Corpus file improve corpus acquisition time and quality, improve corpus collecting efficiency and then convenient for automatic audit.
Detailed description of the invention
Fig. 1 is a kind of flow chart near field corpus acquisition method that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of step S20 in Fig. 1;
Fig. 3 is the flow chart for another near field corpus acquisition method that the embodiment of the present invention one provides;
Fig. 4 is a kind of exemplary block diagram of near field corpus acquisition device provided by Embodiment 2 of the present invention;
Fig. 5 is the exemplary block diagram of acquisition module in Fig. 4;
Fig. 6 is the exemplary block diagram of another near field corpus acquisition device provided by Embodiment 2 of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
In subsequent description, it is only using the suffix for indicating such as " module ", " component " or " unit " of element Be conducive to explanation of the invention, itself there is no specific meanings.Therefore, " module " can mixedly make with " component " With.
Embodiment one
As shown in Figure 1, in the present embodiment, a kind of near field corpus acquisition method, comprising:
S10, acquisition tasks are obtained;
S20, the acquisition tasks are acquired by voice program needed for voice;
S30, the voice is uploaded to server, and converts the voice to the corpus file of preset format.
In the present embodiment, by voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to make a reservation for The corpus file of format, and then convenient for automatic audit, corpus acquisition time and quality are improved, corpus collecting efficiency is improved.
In the present embodiment, the acquisition tasks include: entry, entry classification and task control information.
In the present embodiment, the corresponding entry maintenance module of entry, for the maintenance of entry, entry maintenance module is divided into word Item classification, short entry and recording three submodules of entry, wherein short entry is that mark need to be to be used, and it is desirable for recording entry Record corpus, it is necessary to belong in entry classification, such as " opening air-conditioning " is recording entry, short entry be " opening " and " air-conditioning " adheres to two class entries separately.
In the present embodiment, as described in step S10, after only entry to be collected is issued into task, system can just be adopted Collect the corpus of voice object (speaker), the task of publication can be shown by voice programs such as wechats and gets task Deng.
In the present embodiment, the voice programs user group such as wechat, QQ is huge, can acquire a large amount of corpus rapidly, at double Shorten corpus acquisition time, accelerate project interaction cycle.
Before issuing acquisition tasks, backstage manager can also by entry maintenance module or task acquisition module come pair Entry and acquisition tasks are safeguarded.
As shown in Fig. 2, in the present embodiment, the step S20 includes:
S21, the acquisition tasks are got by voice program;
Task control information in S22, the reading acquisition tasks, the task control information includes voice object information And collecting quantity;
S23, the acquisition tasks are acquired according to the collecting quantity needed for voice.
In the present embodiment, the voice object information includes accent, age bracket and gender.The collecting quantity is to need The number of acquisition, the two information constitute task control information, with assisted acquisition task.
In the present embodiment, the file after the recording of wechat program cannot be used directly for speech recognition, and voice is needed to convert The wav file for being 16bit for sample rate 16k, sample format, and remove mute before and after voice, promote the quality of corpus.
As shown in figure 3, in the present embodiment, after the step S30 further include:
S40, the corpus file is identified by the online speech recognition platforms of third party;
S50, the corpus file is audited automatically according to the task control information, it will be described if if auditing Corpus file stores to corpus and otherwise carries out manual examination and verification.
In the present embodiment, pass through audit storage if recognition result and mark are consistent;The inconsistent elder generation of recognition result It is identified, is put in storage again after manual examination and verification pass through;It is resurveyed if manual examination and verification are unacceptable.Due to collected big portion Divide corpus that can be audited with online recognition, the work difficulty of manual examination and verification is greatly reduced, can be reduced in implementing sample The cost of 90% manual examination and verification.
Embodiment two
As shown in figure 4, in the present embodiment, a kind of near field corpus acquisition device, comprising:
Module 10 is obtained, for obtaining acquisition tasks;
Acquisition module 20, for voice needed for acquiring the acquisition tasks by voice program;
Conversion module 30 for the voice to be uploaded to server, and converts the voice to the language of preset format Expect file.
In the present embodiment, by voice programs such as wechats, a large amount of corpus can be acquired rapidly, and are automatically converted to make a reservation for The corpus file of format, and then convenient for automatic audit, corpus acquisition time and quality are improved, corpus collecting efficiency is improved.
In the present embodiment, the acquisition tasks include: entry, entry classification and task control information.
In the present embodiment, the corresponding entry maintenance module of entry, for the maintenance of entry, entry maintenance module is divided into word Item classification, short entry and recording three submodules of entry, wherein short entry is that mark need to be to be used, and it is desirable for recording entry Record corpus, it is necessary to belong in entry classification, such as " opening air-conditioning " is recording entry, short entry be " opening " and " air-conditioning " adheres to two class entries separately.
In the present embodiment, after only entry to be collected being issued into task, system can just collect voice object and (say Talk about people) corpus, can by the voice programs such as wechat show publication task and get task dispatching.
In the present embodiment, the voice programs user group such as wechat, QQ is huge, can acquire a large amount of corpus rapidly, at double Shorten corpus acquisition time, accelerate project interaction cycle.
Before issuing acquisition tasks, backstage manager can also by entry maintenance module or task acquisition module come pair Entry and acquisition tasks are safeguarded.
In the present embodiment, the acquisition tasks include: entry, entry classification and task control information.
As shown in figure 5, in the present embodiment, the acquisition module includes:
Unit 21 is got, for getting the acquisition tasks by voice program;
Reading unit 22, for reading the task control information in the acquisition tasks, the task control information includes Voice object information and collecting quantity;
Acquisition unit 23, for voice needed for acquiring the acquisition tasks according to the collecting quantity.
In the present embodiment, the voice object information includes accent, age bracket and gender.The collecting quantity is to need The number of acquisition, the two information constitute task control information, with assisted acquisition task.
In the present embodiment, the file after the recording of wechat program cannot be used directly for speech recognition, and voice is needed to convert The wav file for being 16bit for sample rate 16k, sample format, and remove mute before and after voice, promote the quality of corpus.
As shown in fig. 6, in the present embodiment, near field corpus acquisition device further include:
Auditing module 40, for identifying the corpus file by the online speech recognition platforms of third party;According to described Business control information audits the corpus file automatically, stores the corpus file to corpus if if audit, Otherwise, manual examination and verification are carried out.
In the present embodiment, pass through audit storage if recognition result and mark are consistent;The inconsistent elder generation of recognition result It is identified, is put in storage again after manual examination and verification pass through;It is resurveyed if manual examination and verification are unacceptable.Due to collected big portion Divide corpus that can be audited with online recognition, the work difficulty of manual examination and verification is greatly reduced, can be reduced in implementing sample The cost of 90% manual examination and verification.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, computer, clothes Business device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of near field corpus acquisition method characterized by comprising
Obtain acquisition tasks;
Voice needed for acquiring the acquisition tasks by voice program;
The voice is uploaded to server, and converts the voice to the corpus file of preset format.
2. a kind of near field corpus acquisition method according to claim 1, which is characterized in that the acquisition tasks include: word Item, entry classification and task control information.
3. a kind of near field corpus acquisition method according to claim 2, which is characterized in that described to be acquired by voice program Voice needed for the acquisition tasks includes:
The acquisition tasks are got by voice program;
The task control information in the acquisition tasks is read, the task control information includes voice object information and acquisition number Amount;
Voice needed for acquiring the acquisition tasks according to the collecting quantity.
4. a kind of near field corpus acquisition method according to claim 3, which is characterized in that described to be uploaded to the voice Server, and convert the voice to after the corpus file of preset format further include:
The corpus file is identified by the online speech recognition platforms of third party;
The corpus file is audited automatically according to the task control information, by the corpus file if audit passes through It stores to corpus and otherwise carries out manual examination and verification.
5. a kind of near field corpus acquisition method according to claim 4, which is characterized in that the voice object information includes Accent, age bracket and gender.
6. a kind of near field corpus acquisition device characterized by comprising
Module is obtained, for obtaining acquisition tasks;
Acquisition module, for voice needed for acquiring the acquisition tasks by voice program;
Conversion module for the voice to be uploaded to server, and converts the voice to the corpus file of preset format.
7. a kind of near field corpus acquisition device according to claim 6, which is characterized in that the acquisition tasks include: word Item, entry classification and task control information.
8. a kind of near field corpus acquisition device according to claim 7, which is characterized in that the acquisition module includes:
Unit is got, for getting the acquisition tasks by voice program;
Reading unit, for reading the task control information in the acquisition tasks, the task control information includes voice pair Image information and collecting quantity;
Acquisition unit, for voice needed for acquiring the acquisition tasks according to the collecting quantity.
9. a kind of near field corpus acquisition device according to claim 8, which is characterized in that further include:
Auditing module, for identifying the corpus file by the online speech recognition platforms of third party;According to the task control Information audits the corpus file automatically, stores the corpus file to corpus if if audit, otherwise, into Row manual examination and verification.
10. a kind of near field corpus acquisition method according to claim 9, which is characterized in that the voice object information packet Include accent, age bracket and gender.
CN201910156714.4A 2019-03-01 2019-03-01 A kind of near field corpus acquisition method and device Pending CN109902199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156714.4A CN109902199A (en) 2019-03-01 2019-03-01 A kind of near field corpus acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156714.4A CN109902199A (en) 2019-03-01 2019-03-01 A kind of near field corpus acquisition method and device

Publications (1)

Publication Number Publication Date
CN109902199A true CN109902199A (en) 2019-06-18

Family

ID=66945987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156714.4A Pending CN109902199A (en) 2019-03-01 2019-03-01 A kind of near field corpus acquisition method and device

Country Status (1)

Country Link
CN (1) CN109902199A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080294979A1 (en) * 2003-12-10 2008-11-27 International Business Machines Corporation Presenting multimodal web page content on sequential multimode devices
CN103198828A (en) * 2013-04-03 2013-07-10 中金数据系统有限公司 Method and system of construction of voice corpus
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN105654954A (en) * 2016-04-06 2016-06-08 普强信息技术(北京)有限公司 Cloud voice recognition system and method
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107516509A (en) * 2017-08-29 2017-12-26 苏州奇梦者网络科技有限公司 Voice base construction method and system for news report phonetic synthesis
CN108492833A (en) * 2018-03-30 2018-09-04 江西科技学院 Voice messaging acquisition method, instantaneous communication system, mobile terminal and storage medium
CN108717852A (en) * 2018-04-28 2018-10-30 湖南师范大学 A kind of intelligent robot Semantic interaction system and method based on white light communication and the cognition of class brain

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080294979A1 (en) * 2003-12-10 2008-11-27 International Business Machines Corporation Presenting multimodal web page content on sequential multimode devices
CN103198828A (en) * 2013-04-03 2013-07-10 中金数据系统有限公司 Method and system of construction of voice corpus
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN105654954A (en) * 2016-04-06 2016-06-08 普强信息技术(北京)有限公司 Cloud voice recognition system and method
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107516509A (en) * 2017-08-29 2017-12-26 苏州奇梦者网络科技有限公司 Voice base construction method and system for news report phonetic synthesis
CN108492833A (en) * 2018-03-30 2018-09-04 江西科技学院 Voice messaging acquisition method, instantaneous communication system, mobile terminal and storage medium
CN108717852A (en) * 2018-04-28 2018-10-30 湖南师范大学 A kind of intelligent robot Semantic interaction system and method based on white light communication and the cognition of class brain

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张灯银等: "《IP电话技术原理和应用》", 31 December 2000 *

Similar Documents

Publication Publication Date Title
CN107154257B (en) Customer service quality evaluation method and system based on customer voice emotion
US20190124201A1 (en) Communication session assessment
CN101261832B (en) Extraction and modeling method for Chinese speech sensibility information
CN107210040A (en) The operating method of phonetic function and the electronic equipment for supporting this method
CN107886951B (en) Voice detection method, device and equipment
US8005676B2 (en) Speech analysis using statistical learning
CN110457432A (en) Interview methods of marking, device, equipment and storage medium
CN107133709B (en) Quality inspection method, device and system for customer service
CN105810205A (en) Speech processing method and device
CN111008273A (en) Intelligent service system driving method, device, equipment and readable storage medium
CN109754810A (en) Voice control method and device, storage medium and air conditioner
CN109785683A (en) For simulating method, apparatus, electronic equipment and the medium at speaking test scene
CN110047473B (en) Man-machine cooperative interaction method and system
CN106205635A (en) Method of speech processing and system
CN111858897A (en) Customer service staff speech guiding method and system
CN113507542B (en) Audio and video online inspection method and system for customer service seat
JP2014123813A (en) Automatic scoring device for dialog between operator and customer, and operation method for the same
CN109410921A (en) A kind of method and device carrying out quality evaluation by sound
CN109618067A (en) Outgoing call dialog process method and system
CN109902199A (en) A kind of near field corpus acquisition method and device
CN116860938A (en) Voice question-answering construction method, device and medium based on large language model
CN117116251A (en) Repayment probability assessment method and device based on collection-accelerating record
CN115063155B (en) Data labeling method, device, computer equipment and storage medium
CN113761986A (en) Text acquisition method, text live broadcast equipment and storage medium
US20040006464A1 (en) Method and system for the processing of voice data by means of voice recognition and frequency analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190618

RJ01 Rejection of invention patent application after publication