CN109376224A - Corpus filter method and device - Google Patents

Corpus filter method and device Download PDF

Info

Publication number
CN109376224A
CN109376224A CN201811241741.3A CN201811241741A CN109376224A CN 109376224 A CN109376224 A CN 109376224A CN 201811241741 A CN201811241741 A CN 201811241741A CN 109376224 A CN109376224 A CN 109376224A
Authority
CN
China
Prior art keywords
corpus
filtered
words
phrases
corpus set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811241741.3A
Other languages
Chinese (zh)
Other versions
CN109376224B (en
Inventor
况鹏
左靖东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen One Pigeon Technology Co Ltd
Original Assignee
Shenzhen One Pigeon Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen One Pigeon Technology Co Ltd filed Critical Shenzhen One Pigeon Technology Co Ltd
Priority to CN201811241741.3A priority Critical patent/CN109376224B/en
Publication of CN109376224A publication Critical patent/CN109376224A/en
Application granted granted Critical
Publication of CN109376224B publication Critical patent/CN109376224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Abstract

The present invention provides a kind of corpus filter method and devices, are related to field of speech recognition.The corpus filter method and device, the original words wheel corpus sent by receiving voice customer service machine human hair;Then original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases corpus set;To be finally not included in individual character corpus set take turns if pre-established significant word table corpus be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus and be filtered, it is filtered by abnormal voice, the rejection for realizing non-natural voice is other, the accuracy of speech recognition is improved, and improves the robustness of speech recognition performance to robust in a noisy environment, greatly improve the interactive experience of user.

Description

Corpus filter method and device
Technical field
The present invention relates to field of speech recognition, in particular to a kind of corpus filter method and device.
Background technique
Currently, with the rapid development of intelligent sound customer service Robot industry, especially after AI agitation in 2017, The year two thousand twenty China intelligent sound customer service market is up to trillion ranks.Automatic speech recognition (Automatic Speech Recognition, abbreviation ASR) " ear " as intelligent sound customer service robot, the speech recognition accuracy under quiet environment Close to 97%, have an ability of solution " listening ", but due to noise problems such as ambient noise, interchannel noises, spoken dialog voice Various informative property, such as dialect, spoken auxiliary word hesitate, repeat, more speaker overlappings not smooth with voice caused by pause, with And sentence boundary ambiguity in definition etc., cause the accuracy of speech recognition in actual environment not fully up to expectations always, discrimination is even It may be less than 50%, if ASR system does not have rejection ability, when receiving unexpected input also according to the identification knot of maximum likelihood Fruit gives text to subsequent natural language understanding and makes interactive action, and it is uncontrollable to will lead to interactive voice process, in noise The poor robustness of the speech recognition performance of robust under environment seriously affects interactive experience.
Summary of the invention
In view of this, the embodiment of the present invention is designed to provide a kind of corpus filter method and device, it is above-mentioned to improve The problem of.
In a first aspect, the embodiment of the invention provides a kind of corpus filter method, the corpus filter method includes:
Receive the original words wheel corpus that voice customer service machine human hair is sent;
Original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases Corpus set;
It will be not included in individual character corpus set if pre-established significant word table and take turns corpus and be filtered, by words and phrases language Material set in include pre-established non-natural voice antistop list negative keyword if take turns corpus be filtered.
Second aspect, the embodiment of the invention also provides a kind of corpus filter device, the corpus filter device includes:
Information receiving unit, the original words wheel corpus sent for receiving voice customer service machine human hair;
Text character set is divided by corpus division unit for original words wheel corpus to be converted text character set Individual character corpus set and words and phrases corpus set;
Corpus filter element takes turns corpus for will be not included in pre-established significant word table in individual character corpus set Be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus It is filtered.
Compared with prior art, corpus filter method and device provided by the invention, by receiving voice customer service robot The original words wheel corpus sent;Then original words wheel corpus is converted into text character set, text character set is divided into list Word corpus set and words and phrases corpus set;It is taken turns if finally pre-established significant word table being not included in individual character corpus set Corpus is filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns Corpus is filtered, and is filtered by abnormal voice, and the rejection for realizing non-natural voice is other, improves speech recognition Accuracy, and improve the robustness of speech recognition performance to robust in a noisy environment, greatly improve user Interactive experience.
To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.
Detailed description of the invention
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.The present invention being usually described and illustrated herein in the accompanying drawings is implemented The component of example can be arranged and be designed with a variety of different configurations.Therefore, below to the reality of the invention provided in the accompanying drawings The detailed description for applying example is not intended to limit the range of claimed invention, but is merely representative of selected implementation of the invention Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts Every other embodiment, shall fall within the protection scope of the present invention.
Fig. 1 is the interaction schematic diagram of speech recognition system provided in an embodiment of the present invention;
Fig. 2 is the structural block diagram of server provided in an embodiment of the present invention;
Fig. 3 is the flow chart of corpus filter method provided in an embodiment of the present invention;
Fig. 4 is the schematic diagram of vocabulary set provided in an embodiment of the present invention;
Fig. 5 is the functional block diagram of corpus filter device provided in an embodiment of the present invention.
Icon: 100- server;200- corpus filter device;300- voice customer service robot;101- processor;102- is deposited Reservoir;103- storage control;104- Peripheral Interface;501- information receiving unit;502- corpus division unit;503- corpus mistake Filter unit;504- reliability estimating unit;505- corpus culling unit;506- alarm command generation unit;507- information is sent Unit;508- activates corpus frame extraction unit;509- ratio calculation unit.
Specific embodiment
Below in conjunction with attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Usually exist The component of the embodiment of the present invention described and illustrated in attached drawing can be arranged and be designed with a variety of different configurations herein.Cause This, is not intended to limit claimed invention to the detailed description of the embodiment of the present invention provided in the accompanying drawings below Range, but it is merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.
Corpus filter method provided by present pre-ferred embodiments and device can be applied to server 100, the server 100 are applied to speech recognition system.As shown in Figure 1, speech recognition system includes server 100 and voice customer service robot 200, Communication connection is established between server 100 and voice customer service robot 200.The server 100 may be, but not limited to, network Server, database server, cloud server etc..Fig. 2 shows a kind of services that can be applied in the embodiment of the present invention The structural block diagram of device 100.Wherein, server 100 includes corpus filter device 200, Peripheral Interface 104, memory 102, storage Controller 103 and processor 101.
Peripheral Interface 104, the memory 102, storage control 103 and processor 101, each element are direct between each other Or be electrically connected indirectly, to realize the transmission or interaction of data.For example, these elements can pass through one or more between each other Communication bus or signal wire, which are realized, to be electrically connected.The corpus filter device 200 includes at least one can be with software or firmware (firmware) form is stored in the memory 102 or solidifies software function module in the server.The processing Device 101 is for executing the executable module stored in memory 102, for example, the software function that the corpus filter device 200 includes It can module or computer program.
Wherein, memory 102 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory 102Read Only Memory, ROM), programmable read only memory (Programmable Read- Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..Wherein, memory 102 is for storing program, and the processor 101 is after receiving and executing instruction, described in execution Program, method performed by the server-side that the stream process that aforementioned any embodiment of the embodiment of the present invention discloses defines can be applied to In processor 101, or realized by processor 101.
Processor 101 may be a kind of IC chip, the processing capacity with signal.Above-mentioned processor 101 can To be general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;Can also be digital signal processor (DSP), specific integrated circuit (ASIC), Ready-made programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hard Part component.It may be implemented or execute disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor It can be microprocessor or the processor 101 be also possible to any conventional processor 101 etc..
Various input/output devices are couple processor 101 and memory 102 by Peripheral Interface 104.In some implementations In example, Peripheral Interface 104, processor 101 and storage control 103 can be realized in one single chip.In some other reality In example, they can be realized by independent chip respectively.
Referring to Fig. 3, the embodiment of the invention provides a kind of corpus filter method, the corpus filter method includes:
Step S301: the original words wheel corpus that voice customer service robot 200 is sent is received.
Specifically, voice customer service robot 200 can dial to user terminal automatically, then hold user terminal User carries out interactive voice, and the content for recording interactive voice takes turns corpus as original words, and original words wheel corpus is sent to Server is handled.
Step S302: original words wheel corpus is converted into text character set, text character set is divided into individual character corpus Set and words and phrases corpus set.
Specifically, it can use automatic speech recognition technology (Automatic Speech Recognition) for original words It takes turns corpus and converts text character set.
Step S303: corpus is taken turns if pre-established significant word table will be not included in individual character corpus set and was carried out Filter, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus carried out Filter.
The pre-established significant word table is in the history words wheel in the presetting time, and word frequency is greater than presetting the The individual character set of one threshold value.Wherein, presetting first threshold is preferably greater than 50, for example, 60,70,80, it does not do herein It limits.For example, by taking presetting first threshold is 60 as an example, the vocabulary set as shown in Figure 4 for one, in Fig. 4 uh, Hello, eh, to, Oh, I, be, eh, it is good, that, row, have, your word frequency is all larger than 60, therefore can be by the set of above-mentioned word As significant word table, the accuracy for corpus to be identified is further increased by above-mentioned processing mode.
In addition, the pre-established non-natural voice antistop list is to include in the history words wheel in the presetting time Crop rotation is negative corpus set if non-natural language, and extracts the corpus that word frequency in negative corpus set is greater than preset second threshold Set.In the present embodiment, second threshold can be the value more than or equal to 20, for example, 20,25,30 etc., it is not limited here. Wherein, non-natural language can be but be not limited to system prompt sound, color speech etc..For example, system prompt sound may be " distinguished 10086........ " is sent a telegraph in client, welcome, wherein " 10086 " can be used as non-natural voice keyword, this is pre-established non- The accuracy for corpus to be identified can be further improved in natural-sounding antistop list.
In addition, if the negative keyword that in words and phrases corpus set will include pre-established non-natural voice antistop list When wheel corpus is filtered, issues on-hook and instruct to voice customer service robot 200.It is taken turns in corpus when words and non-natural language occurs When sound keyword, it is more likely that current interactive voice be it is nugatory, therefore, on-hook can be sent and instructed to the voice Customer service robot 200, so that 200 on-hook of voice customer service robot improves voice customer service machine to discharge line resource in time The working efficiency of people 200.
Step S304: respectively to the corpus set not filtered in individual character corpus set, in words and phrases corpus set not by mistake The corpus set of filter carries out reliability estimating.
Specifically, reliability estimating mode can be to count the corpus not filtered in individual character corpus set respectively The corpus close, not filtered in words and phrases corpus set is contained in the first positive keyword number of words of presetting positive corpus set With the second positive keyword number of words and, and calculate the corpus set not filtered in word corpus set according to formula S=C/D Second confidence level of the corpus not filtered in the first confidence level, words and phrases corpus set, when S is in word corpus set not by mistake When the first confidence level of the corpus set of filter, C be in individual character corpus set the corpus that is not filtered be contained in it is presetting just The positive keyword number of words of the first of corpus set and, D is the corpus set that is not filtered in individual character corpus set, when S is words and phrases language When the second confidence level of the corpus set not filtered in material set, C is the corpus packet not filtered in words and phrases corpus set The first positive keyword number of words contained in presetting positive corpus set and, D is the corpus not filtered in words and phrases corpus set It closes.Wherein, in the present embodiment, a numerical value of the confidence level between 0-1.
Step S305: when the confidence level for the corpus set not filtered in individual character corpus set is less than presetting third threshold When value, whole rejecting is carried out to the corpus set not filtered in individual character corpus set;It is not filtered when in words and phrases corpus set Corpus set confidence level be less than presetting third threshold value when, to the corpus set not filtered in words and phrases corpus set into Row is whole to reject.The accuracy for corpus to be identified is further increased by above-mentioned processing mode.
In addition, the corpus set not filtered in individual character corpus set carries out in whole rejecting or words and phrases corpus set The corpus set not filtered carries out whole rejecting simultaneously, generates audio event alarm command;Audio event alarm command is anti- It is fed to voice customer service robot 200.Voice customer service robot 200 is after receiving alarm command to current working condition It is adjusted.
The presetting positive corpus, which is combined into, subtracts presetting negative corpus vocabulary by presetting positive corpus vocabulary The corpus difference set of acquisition, and the word frequency extracted from corpus difference set is greater than the set of the positive keyword of the 4th presetting threshold value, The accuracy for corpus to be identified can be further improved in the presetting positive corpus set.
Wherein, the 4th presetting threshold value is preferably between 20-100.
Further, the corpus filter method further include:
Step S306: extract respectively the corpus set not filtered in individual character corpus set, in words and phrases corpus set not by The fundamental frequency feature of the corpus set of filtering is located at the activation corpus frame within presetting vocal acoustics' characteristic range.
Wherein, presetting vocal acoustics' characteristic range can be 50Hz~750Hz.
Step S307: the frame number of the activation corpus frame in the corpus set not filtered in individual character corpus set is calculated separately With the frame number of the activation corpus frame of the corpus set that is not filtered in the first ratio and words and phrases corpus set of totalframes and total Second ratio of frame number.
Step S308: when the first ratio is less than five presetting threshold value, by what is do not filtered in individual character corpus set Corpus set is rejected;When the second ratio is less than five presetting threshold value, the corpus that will not be filtered in words and phrases corpus set Set is rejected.
Wherein, the 5th threshold value can be, but not limited to be 0.15.It is to be appreciated that further by above-mentioned processing mode Improve the accuracy for corpus to be identified.
In the present embodiment, after being filtered to corpus, corpus feeds back to voice customer service after whole flow process is filtered Robot 200 is identified that the rejection for realizing non-natural voice is other, to improve the accuracy of corpus identification.
Referring to Fig. 5, the embodiment of the present invention is provided the embodiment of the invention also provides a kind of corpus filter device 300 Signal corpus filter device 300, the technical effect of basic principle and generation is identical with above-described embodiment, to briefly describe, The present embodiment part does not refer to place, can refer to corresponding contents in the above embodiments.The corpus filter device 300 includes letter Cease receiving unit 501, corpus division unit 502, corpus filter element 503, reliability estimating unit 504, corpus culling unit 505, alarm command generation unit 506, information transmitting unit 507, activation corpus frame extraction unit 508 and ratio calculation unit 509。
Information receiving unit 501 is used to receive the original words wheel corpus of the transmission of voice customer service robot 200.
Corpus division unit 502 is used to original words wheel corpus converting text character set, and text character set is divided For individual character corpus set and words and phrases corpus set.
Corpus filter element 503 is for taking turns language if being not included in pre-established significant word table in individual character corpus set Material be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns language Material is filtered.
Wherein, the pre-established significant word table is in the history words wheel in the presetting time, and word frequency is greater than default The individual character set of fixed first threshold;The pre-established non-natural voice antistop list is by the history in the presetting time It is negative corpus set in words wheel comprising crop rotation if non-natural language, and extracts word frequency in negative corpus set and be greater than preset second The corpus set of threshold value.
Reliability estimating unit 504 for respectively to do not filtered in individual character corpus set corpus set, words and phrases corpus The corpus set not filtered in set carries out reliability estimating.
Specifically, reliability estimating unit 504 is specifically used for counting the language not filtered in individual character corpus set respectively The corpus not filtered in material set, words and phrases corpus set is contained in the first positive key word character of presetting positive corpus set Several and the second positive keyword number of words and, and the corpus set not filtered in word corpus set is calculated according to formula S=C/D The first confidence level, the second confidence level of the corpus not filtered in words and phrases corpus set, when S be in word corpus set not by When the first confidence level of the corpus set of filtering, C be in individual character corpus set the corpus that is not filtered be contained in it is presetting First positive keyword number of words of positive corpus set and, D is the corpus set not filtered in individual character corpus set, when S is words and phrases When the second confidence level of the corpus set not filtered in corpus set, C is the corpus not filtered in words and phrases corpus set Be contained in presetting positive corpus set the first positive keyword number of words and, D is the corpus that is not filtered in words and phrases corpus set Set.
Wherein, the presetting positive corpus, which is combined into, subtracts presetting negative corpus by presetting positive corpus vocabulary The corpus difference set that vocabulary obtains, and the word frequency extracted from corpus difference set is greater than the collection of the positive keyword of the 4th presetting threshold value It closes.
The confidence level for the corpus set that corpus culling unit 505 is used to not filtered in individual character corpus set is less than default When fixed third threshold value, whole rejecting is carried out to the corpus set not filtered in individual character corpus set;When words and phrases corpus set In the confidence level of corpus set that is not filtered when being less than presetting third threshold value, to what is do not filtered in words and phrases corpus set Corpus set carries out whole rejecting.
The corpus set that alarm command generation unit 506 is used to not filtered in individual character corpus set carries out whole pick Remove or words and phrases corpus set in the corpus set that is not filtered carry out it is whole reject simultaneously, generate audio event alarm command.
Information transmitting unit 507 is used to audio event alarm command feeding back to voice customer service robot 200.
Activation corpus frame extraction unit 508 for extracting the corpus set not filtered in individual character corpus set, word respectively The fundamental frequency feature for the corpus set not filtered in sentence corpus set is located at the activation within presetting vocal acoustics' characteristic range Corpus frame;
Ratio calculation unit 509, for calculating separately the activation in the corpus set not filtered in individual character corpus set The activation corpus for the corpus set not filtered in the frame number of corpus frame and the first ratio of totalframes and words and phrases corpus set The frame number of frame and the second ratio of totalframes;
Corpus culling unit 505 is also used to when the first ratio is less than five presetting threshold value, by individual character corpus set In the corpus set that is not filtered reject;When the second ratio be less than five presetting threshold value when, by words and phrases corpus set not The corpus set filtered is rejected.
In addition, information transmitting unit 507 is also used to include pre-established non-natural voice in words and phrases corpus set When wheel corpus is filtered if the negative keyword of antistop list, issues on-hook and instruct to voice customer service robot 200.
In conclusion corpus filter method provided by the invention and device, are sent by reception voice customer service machine human hair Original words take turns corpus;Then original words wheel corpus is converted into text character set, text character set is divided into individual character corpus Set and words and phrases corpus set;To be finally not included in individual character corpus set if pre-established significant word table take turns corpus into Row filtering, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus into Row filtering, is filtered, the rejection for realizing non-natural voice is other, improves the correct of speech recognition by abnormal voice Rate, and improve the robustness of speech recognition performance to robust in a noisy environment, greatly improve the interaction of user Experience.
In several embodiments provided herein, it should be understood that disclosed device and method can also pass through Other modes are realized.The apparatus embodiments described above are merely exemplary, for example, flow chart and block diagram in attached drawing Show the device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, Function and operation.In this regard, each box in flowchart or block diagram can represent the one of a module, section or code Part, a part of the module, section or code, which includes that one or more is for implementing the specified logical function, to be held Row instruction.It should also be noted that function marked in the box can also be to be different from some implementations as replacement The sequence marked in attached drawing occurs.For example, two continuous boxes can actually be basically executed in parallel, they are sometimes It can execute in the opposite order, this depends on the function involved.It is also noted that each side in block diagram, flow chart The combination of frame and the box in block diagram, flow chart, can function or movement as defined in executing it is dedicated hardware based System is realized, or can be realized using a combination of dedicated hardware and computer instructions.
In addition, each functional module in each embodiment of the present invention can integrate one independent portion of formation together Point, it is also possible to modules individualism, an independent part can also be integrated to form with two or more modules.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.It needs Illustrate, herein, relational terms such as first and second and the like be used merely to by an entity or operation with Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this realities The relationship or sequence on border.Moreover, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, article or equipment in there is also other identical elements.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should also be noted that similar label and letter exist Similar terms are indicated in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing It is further defined and explained.

Claims (10)

1. a kind of corpus filter method, which is characterized in that the corpus filter method includes:
Receive the original words wheel corpus that voice customer service machine human hair is sent;
Original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases corpus Set;
It will be not included in individual character corpus set if pre-established significant word table and take turns corpus and be filtered, by words and phrases corpus Include in conjunction pre-established non-natural voice antistop list negative keyword if take turns corpus be filtered.
2. corpus filter method according to claim 1, which is characterized in that built in advance will be not included in individual character corpus set Taking turns corpus if vertical significant word table and being filtered will include that pre-established non-natural voice is crucial in words and phrases corpus set After taking turns the step of corpus is filtered if the negative keyword of vocabulary, the corpus filter method further include:
Respectively to the corpus set not filtered in individual character corpus set, the corpus set not filtered in words and phrases corpus set into Row reliability estimating;
When the confidence level for the corpus set not filtered in individual character corpus set is less than presetting third threshold value, to holophrastic speech The corpus set not filtered in material set carries out whole rejecting;
When the confidence level for the corpus set not filtered in words and phrases corpus set is less than presetting third threshold value, to words and phrases language The corpus set not filtered in material set carries out whole rejecting.
3. corpus filter method according to claim 2, which is characterized in that the corpus filter method further include: right The corpus set not filtered in individual character corpus set carries out the corpus not filtered in whole rejecting or words and phrases corpus set It closes and carries out whole rejecting simultaneously, generate audio event alarm command;
Audio event alarm command is fed back into voice customer service robot.
4. corpus filter method according to claim 2, which is characterized in that it is described respectively in individual character corpus set not by The corpus set not filtered in the corpus set of filtering, words and phrases corpus set carries out the step of reliability estimating and includes:
The corpus for counting the corpus set not filtered in individual character corpus set respectively, not filtered in words and phrases corpus set Be contained in presetting positive corpus set the first positive keyword number of words and the second positive keyword number of words and, and according to formula S= C/D calculates the first confidence level of the corpus set not filtered in word corpus set, is not filtered in words and phrases corpus set Second confidence level of corpus, when S is the first confidence level of the corpus set not filtered in word corpus set, C is individual character The corpus not filtered in corpus set be contained in presetting positive corpus set the first positive keyword number of words and, D is single The corpus set not filtered in word corpus set, when second that S is the corpus set not filtered in words and phrases corpus set is set When reliability, C is the first positive key that the corpus not filtered in words and phrases corpus set is contained in presetting positive corpus set Word number of words and, D is the corpus set that is not filtered in words and phrases corpus set.
5. corpus filter method according to claim 4, which is characterized in that the presetting positive corpus, which is combined into, to be passed through Presetting positive corpus vocabulary subtracts the corpus difference set that presetting negative corpus vocabulary obtains, and the word extracted from corpus difference set Frequency is greater than the set of the positive keyword of the 4th presetting threshold value.
6. corpus filter method according to claim 1, which is characterized in that built in advance will be not included in individual character corpus set Corpus is taken turns if vertical significant word table to be filtered, will include that pre-established non-natural voice is crucial in words and phrases corpus set After taking turns the step of corpus is filtered if the negative keyword of vocabulary, the corpus filter method further include:
The corpus set extracting the corpus set not filtered in individual character corpus set respectively, not filtered in words and phrases corpus set Fundamental frequency feature be located at the activation corpus frame within presetting vocal acoustics' characteristic range;
Calculate separately the activation corpus frame in the corpus set not filtered in individual character corpus set frame number and totalframes the The frame number of the activation corpus frame for the corpus set not filtered in one ratio and words and phrases corpus set and the second ratio of totalframes Value;
When the first ratio is less than five presetting threshold value, the corpus set not filtered in individual character corpus set is rejected;
When the second ratio is less than five presetting threshold value, the corpus set not filtered in words and phrases corpus set is rejected.
7. corpus filter method according to claim 1, which is characterized in that the corpus filter method further include:
If the negative keyword that in words and phrases corpus set will include pre-established non-natural voice antistop list take turns corpus into When row filtering, issues on-hook and instruct to voice customer service robot.
8. corpus filter method according to claim 1, which is characterized in that the pre-established significant word table is default In history words wheel in the fixed time, word frequency is greater than the individual character set of presetting first threshold.
9. corpus filter method according to claim 1, which is characterized in that the pre-established non-natural voice keyword Table is that will be negative corpus set in the history words wheel in the presetting time comprising crop rotation if non-natural language, and extract negative language Word frequency is greater than the corpus set of preset second threshold in material set.
10. a kind of corpus filter device, which is characterized in that the corpus filter device includes:
Information receiving unit, the original words wheel corpus sent for receiving voice customer service machine human hair;
Text character set is divided into individual character for original words wheel corpus to be converted text character set by corpus division unit Corpus set and words and phrases corpus set;
Corpus filter element is taken turns corpus for will be not included in pre-established significant word table in individual character corpus set and is carried out Filtering, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus carry out Filtering.
CN201811241741.3A 2018-10-24 2018-10-24 Corpus filtering method and apparatus Active CN109376224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811241741.3A CN109376224B (en) 2018-10-24 2018-10-24 Corpus filtering method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811241741.3A CN109376224B (en) 2018-10-24 2018-10-24 Corpus filtering method and apparatus

Publications (2)

Publication Number Publication Date
CN109376224A true CN109376224A (en) 2019-02-22
CN109376224B CN109376224B (en) 2020-07-21

Family

ID=65401742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811241741.3A Active CN109376224B (en) 2018-10-24 2018-10-24 Corpus filtering method and apparatus

Country Status (1)

Country Link
CN (1) CN109376224B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362659A (en) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 The abnormal statement filter method and system of the open corpus of robot
CN111026884A (en) * 2019-12-12 2020-04-17 南昌众荟智盈信息技术有限公司 Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944447B2 (en) * 2001-04-27 2005-09-13 Accenture Llp Location-based services
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora
WO2015145219A1 (en) * 2014-03-28 2015-10-01 Navaratnam Ratnakumar Systems for remote service of customers using virtual and physical mannequins
CN105468468A (en) * 2015-12-02 2016-04-06 北京光年无限科技有限公司 Data error correction method and apparatus facing question answering system
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105760399A (en) * 2014-12-19 2016-07-13 华为软件技术有限公司 Data retrieval method and device
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN106504744A (en) * 2016-10-26 2017-03-15 科大讯飞股份有限公司 A kind of method of speech processing and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944447B2 (en) * 2001-04-27 2005-09-13 Accenture Llp Location-based services
WO2015145219A1 (en) * 2014-03-28 2015-10-01 Navaratnam Ratnakumar Systems for remote service of customers using virtual and physical mannequins
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora
CN105760399A (en) * 2014-12-19 2016-07-13 华为软件技术有限公司 Data retrieval method and device
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105468468A (en) * 2015-12-02 2016-04-06 北京光年无限科技有限公司 Data error correction method and apparatus facing question answering system
CN106504744A (en) * 2016-10-26 2017-03-15 科大讯飞股份有限公司 A kind of method of speech processing and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362659A (en) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 The abnormal statement filter method and system of the open corpus of robot
CN111026884A (en) * 2019-12-12 2020-04-17 南昌众荟智盈信息技术有限公司 Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus

Also Published As

Publication number Publication date
CN109376224B (en) 2020-07-21

Similar Documents

Publication Publication Date Title
CN108630193B (en) Voice recognition method and device
US10692503B2 (en) Voice data processing method, apparatus and storage medium
CN110263322A (en) Audio for speech recognition corpus screening technique, device and computer equipment
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
US20140074467A1 (en) Speaker Separation in Diarization
CN105006230A (en) Voice sensitive information detecting and filtering method based on unspecified people
CN111081279A (en) Voice emotion fluctuation analysis method and device
WO2015090215A1 (en) Voice data recognition method and device for distinguishing regional accent, and server
CN109065051B (en) Voice recognition processing method and device
CN104766608A (en) Voice control method and voice control device
CN108039181B (en) Method and device for analyzing emotion information of sound signal
CN107705791A (en) Caller identity confirmation method, device and Voiceprint Recognition System based on Application on Voiceprint Recognition
CN110265001A (en) Corpus screening technique, device and computer equipment for speech recognition training
CN102708861A (en) Poor speech recognition method based on support vector machine
CN109376224A (en) Corpus filter method and device
CN110473563A (en) Breathing detection method, system, equipment and medium based on time-frequency characteristics
CN111816216A (en) Voice activity detection method and device
CN110211609A (en) A method of promoting speech recognition accuracy
CN113782026A (en) Information processing method, device, medium and equipment
CN106887226A (en) Speech recognition algorithm based on artificial intelligence recognition
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN103474067A (en) Voice signal transmission method and system
CN111128127A (en) Voice recognition processing method and device
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN114267342A (en) Recognition model training method, recognition method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant