CN109376224A - Corpus filter method and device - Google Patents
Corpus filter method and device Download PDFInfo
- Publication number
- CN109376224A CN109376224A CN201811241741.3A CN201811241741A CN109376224A CN 109376224 A CN109376224 A CN 109376224A CN 201811241741 A CN201811241741 A CN 201811241741A CN 109376224 A CN109376224 A CN 109376224A
- Authority
- CN
- China
- Prior art keywords
- corpus
- filtered
- words
- phrases
- corpus set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Abstract
The present invention provides a kind of corpus filter method and devices, are related to field of speech recognition.The corpus filter method and device, the original words wheel corpus sent by receiving voice customer service machine human hair;Then original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases corpus set;To be finally not included in individual character corpus set take turns if pre-established significant word table corpus be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus and be filtered, it is filtered by abnormal voice, the rejection for realizing non-natural voice is other, the accuracy of speech recognition is improved, and improves the robustness of speech recognition performance to robust in a noisy environment, greatly improve the interactive experience of user.
Description
Technical field
The present invention relates to field of speech recognition, in particular to a kind of corpus filter method and device.
Background technique
Currently, with the rapid development of intelligent sound customer service Robot industry, especially after AI agitation in 2017,
The year two thousand twenty China intelligent sound customer service market is up to trillion ranks.Automatic speech recognition (Automatic Speech
Recognition, abbreviation ASR) " ear " as intelligent sound customer service robot, the speech recognition accuracy under quiet environment
Close to 97%, have an ability of solution " listening ", but due to noise problems such as ambient noise, interchannel noises, spoken dialog voice
Various informative property, such as dialect, spoken auxiliary word hesitate, repeat, more speaker overlappings not smooth with voice caused by pause, with
And sentence boundary ambiguity in definition etc., cause the accuracy of speech recognition in actual environment not fully up to expectations always, discrimination is even
It may be less than 50%, if ASR system does not have rejection ability, when receiving unexpected input also according to the identification knot of maximum likelihood
Fruit gives text to subsequent natural language understanding and makes interactive action, and it is uncontrollable to will lead to interactive voice process, in noise
The poor robustness of the speech recognition performance of robust under environment seriously affects interactive experience.
Summary of the invention
In view of this, the embodiment of the present invention is designed to provide a kind of corpus filter method and device, it is above-mentioned to improve
The problem of.
In a first aspect, the embodiment of the invention provides a kind of corpus filter method, the corpus filter method includes:
Receive the original words wheel corpus that voice customer service machine human hair is sent;
Original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases
Corpus set;
It will be not included in individual character corpus set if pre-established significant word table and take turns corpus and be filtered, by words and phrases language
Material set in include pre-established non-natural voice antistop list negative keyword if take turns corpus be filtered.
Second aspect, the embodiment of the invention also provides a kind of corpus filter device, the corpus filter device includes:
Information receiving unit, the original words wheel corpus sent for receiving voice customer service machine human hair;
Text character set is divided by corpus division unit for original words wheel corpus to be converted text character set
Individual character corpus set and words and phrases corpus set;
Corpus filter element takes turns corpus for will be not included in pre-established significant word table in individual character corpus set
Be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus
It is filtered.
Compared with prior art, corpus filter method and device provided by the invention, by receiving voice customer service robot
The original words wheel corpus sent;Then original words wheel corpus is converted into text character set, text character set is divided into list
Word corpus set and words and phrases corpus set;It is taken turns if finally pre-established significant word table being not included in individual character corpus set
Corpus is filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns
Corpus is filtered, and is filtered by abnormal voice, and the rejection for realizing non-natural voice is other, improves speech recognition
Accuracy, and improve the robustness of speech recognition performance to robust in a noisy environment, greatly improve user
Interactive experience.
To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate
Appended attached drawing, is described in detail below.
Detailed description of the invention
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.The present invention being usually described and illustrated herein in the accompanying drawings is implemented
The component of example can be arranged and be designed with a variety of different configurations.Therefore, below to the reality of the invention provided in the accompanying drawings
The detailed description for applying example is not intended to limit the range of claimed invention, but is merely representative of selected implementation of the invention
Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts
Every other embodiment, shall fall within the protection scope of the present invention.
Fig. 1 is the interaction schematic diagram of speech recognition system provided in an embodiment of the present invention;
Fig. 2 is the structural block diagram of server provided in an embodiment of the present invention;
Fig. 3 is the flow chart of corpus filter method provided in an embodiment of the present invention;
Fig. 4 is the schematic diagram of vocabulary set provided in an embodiment of the present invention;
Fig. 5 is the functional block diagram of corpus filter device provided in an embodiment of the present invention.
Icon: 100- server;200- corpus filter device;300- voice customer service robot;101- processor;102- is deposited
Reservoir;103- storage control;104- Peripheral Interface;501- information receiving unit;502- corpus division unit;503- corpus mistake
Filter unit;504- reliability estimating unit;505- corpus culling unit;506- alarm command generation unit;507- information is sent
Unit;508- activates corpus frame extraction unit;509- ratio calculation unit.
Specific embodiment
Below in conjunction with attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete
Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Usually exist
The component of the embodiment of the present invention described and illustrated in attached drawing can be arranged and be designed with a variety of different configurations herein.Cause
This, is not intended to limit claimed invention to the detailed description of the embodiment of the present invention provided in the accompanying drawings below
Range, but it is merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art are not doing
Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.
Corpus filter method provided by present pre-ferred embodiments and device can be applied to server 100, the server
100 are applied to speech recognition system.As shown in Figure 1, speech recognition system includes server 100 and voice customer service robot 200,
Communication connection is established between server 100 and voice customer service robot 200.The server 100 may be, but not limited to, network
Server, database server, cloud server etc..Fig. 2 shows a kind of services that can be applied in the embodiment of the present invention
The structural block diagram of device 100.Wherein, server 100 includes corpus filter device 200, Peripheral Interface 104, memory 102, storage
Controller 103 and processor 101.
Peripheral Interface 104, the memory 102, storage control 103 and processor 101, each element are direct between each other
Or be electrically connected indirectly, to realize the transmission or interaction of data.For example, these elements can pass through one or more between each other
Communication bus or signal wire, which are realized, to be electrically connected.The corpus filter device 200 includes at least one can be with software or firmware
(firmware) form is stored in the memory 102 or solidifies software function module in the server.The processing
Device 101 is for executing the executable module stored in memory 102, for example, the software function that the corpus filter device 200 includes
It can module or computer program.
Wherein, memory 102 may be, but not limited to, random access memory (Random Access Memory,
RAM), read-only memory 102Read Only Memory, ROM), programmable read only memory (Programmable Read-
Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory,
EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory,
EEPROM) etc..Wherein, memory 102 is for storing program, and the processor 101 is after receiving and executing instruction, described in execution
Program, method performed by the server-side that the stream process that aforementioned any embodiment of the embodiment of the present invention discloses defines can be applied to
In processor 101, or realized by processor 101.
Processor 101 may be a kind of IC chip, the processing capacity with signal.Above-mentioned processor 101 can
To be general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit
(Network Processor, abbreviation NP) etc.;Can also be digital signal processor (DSP), specific integrated circuit (ASIC),
Ready-made programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hard
Part component.It may be implemented or execute disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor
It can be microprocessor or the processor 101 be also possible to any conventional processor 101 etc..
Various input/output devices are couple processor 101 and memory 102 by Peripheral Interface 104.In some implementations
In example, Peripheral Interface 104, processor 101 and storage control 103 can be realized in one single chip.In some other reality
In example, they can be realized by independent chip respectively.
Referring to Fig. 3, the embodiment of the invention provides a kind of corpus filter method, the corpus filter method includes:
Step S301: the original words wheel corpus that voice customer service robot 200 is sent is received.
Specifically, voice customer service robot 200 can dial to user terminal automatically, then hold user terminal
User carries out interactive voice, and the content for recording interactive voice takes turns corpus as original words, and original words wheel corpus is sent to
Server is handled.
Step S302: original words wheel corpus is converted into text character set, text character set is divided into individual character corpus
Set and words and phrases corpus set.
Specifically, it can use automatic speech recognition technology (Automatic Speech Recognition) for original words
It takes turns corpus and converts text character set.
Step S303: corpus is taken turns if pre-established significant word table will be not included in individual character corpus set and was carried out
Filter, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus carried out
Filter.
The pre-established significant word table is in the history words wheel in the presetting time, and word frequency is greater than presetting the
The individual character set of one threshold value.Wherein, presetting first threshold is preferably greater than 50, for example, 60,70,80, it does not do herein
It limits.For example, by taking presetting first threshold is 60 as an example, the vocabulary set as shown in Figure 4 for one, in Fig. 4 uh,
Hello, eh, to, Oh, I, be, eh, it is good, that, row, have, your word frequency is all larger than 60, therefore can be by the set of above-mentioned word
As significant word table, the accuracy for corpus to be identified is further increased by above-mentioned processing mode.
In addition, the pre-established non-natural voice antistop list is to include in the history words wheel in the presetting time
Crop rotation is negative corpus set if non-natural language, and extracts the corpus that word frequency in negative corpus set is greater than preset second threshold
Set.In the present embodiment, second threshold can be the value more than or equal to 20, for example, 20,25,30 etc., it is not limited here.
Wherein, non-natural language can be but be not limited to system prompt sound, color speech etc..For example, system prompt sound may be " distinguished
10086........ " is sent a telegraph in client, welcome, wherein " 10086 " can be used as non-natural voice keyword, this is pre-established non-
The accuracy for corpus to be identified can be further improved in natural-sounding antistop list.
In addition, if the negative keyword that in words and phrases corpus set will include pre-established non-natural voice antistop list
When wheel corpus is filtered, issues on-hook and instruct to voice customer service robot 200.It is taken turns in corpus when words and non-natural language occurs
When sound keyword, it is more likely that current interactive voice be it is nugatory, therefore, on-hook can be sent and instructed to the voice
Customer service robot 200, so that 200 on-hook of voice customer service robot improves voice customer service machine to discharge line resource in time
The working efficiency of people 200.
Step S304: respectively to the corpus set not filtered in individual character corpus set, in words and phrases corpus set not by mistake
The corpus set of filter carries out reliability estimating.
Specifically, reliability estimating mode can be to count the corpus not filtered in individual character corpus set respectively
The corpus close, not filtered in words and phrases corpus set is contained in the first positive keyword number of words of presetting positive corpus set
With the second positive keyword number of words and, and calculate the corpus set not filtered in word corpus set according to formula S=C/D
Second confidence level of the corpus not filtered in the first confidence level, words and phrases corpus set, when S is in word corpus set not by mistake
When the first confidence level of the corpus set of filter, C be in individual character corpus set the corpus that is not filtered be contained in it is presetting just
The positive keyword number of words of the first of corpus set and, D is the corpus set that is not filtered in individual character corpus set, when S is words and phrases language
When the second confidence level of the corpus set not filtered in material set, C is the corpus packet not filtered in words and phrases corpus set
The first positive keyword number of words contained in presetting positive corpus set and, D is the corpus not filtered in words and phrases corpus set
It closes.Wherein, in the present embodiment, a numerical value of the confidence level between 0-1.
Step S305: when the confidence level for the corpus set not filtered in individual character corpus set is less than presetting third threshold
When value, whole rejecting is carried out to the corpus set not filtered in individual character corpus set;It is not filtered when in words and phrases corpus set
Corpus set confidence level be less than presetting third threshold value when, to the corpus set not filtered in words and phrases corpus set into
Row is whole to reject.The accuracy for corpus to be identified is further increased by above-mentioned processing mode.
In addition, the corpus set not filtered in individual character corpus set carries out in whole rejecting or words and phrases corpus set
The corpus set not filtered carries out whole rejecting simultaneously, generates audio event alarm command;Audio event alarm command is anti-
It is fed to voice customer service robot 200.Voice customer service robot 200 is after receiving alarm command to current working condition
It is adjusted.
The presetting positive corpus, which is combined into, subtracts presetting negative corpus vocabulary by presetting positive corpus vocabulary
The corpus difference set of acquisition, and the word frequency extracted from corpus difference set is greater than the set of the positive keyword of the 4th presetting threshold value,
The accuracy for corpus to be identified can be further improved in the presetting positive corpus set.
Wherein, the 4th presetting threshold value is preferably between 20-100.
Further, the corpus filter method further include:
Step S306: extract respectively the corpus set not filtered in individual character corpus set, in words and phrases corpus set not by
The fundamental frequency feature of the corpus set of filtering is located at the activation corpus frame within presetting vocal acoustics' characteristic range.
Wherein, presetting vocal acoustics' characteristic range can be 50Hz~750Hz.
Step S307: the frame number of the activation corpus frame in the corpus set not filtered in individual character corpus set is calculated separately
With the frame number of the activation corpus frame of the corpus set that is not filtered in the first ratio and words and phrases corpus set of totalframes and total
Second ratio of frame number.
Step S308: when the first ratio is less than five presetting threshold value, by what is do not filtered in individual character corpus set
Corpus set is rejected;When the second ratio is less than five presetting threshold value, the corpus that will not be filtered in words and phrases corpus set
Set is rejected.
Wherein, the 5th threshold value can be, but not limited to be 0.15.It is to be appreciated that further by above-mentioned processing mode
Improve the accuracy for corpus to be identified.
In the present embodiment, after being filtered to corpus, corpus feeds back to voice customer service after whole flow process is filtered
Robot 200 is identified that the rejection for realizing non-natural voice is other, to improve the accuracy of corpus identification.
Referring to Fig. 5, the embodiment of the present invention is provided the embodiment of the invention also provides a kind of corpus filter device 300
Signal corpus filter device 300, the technical effect of basic principle and generation is identical with above-described embodiment, to briefly describe,
The present embodiment part does not refer to place, can refer to corresponding contents in the above embodiments.The corpus filter device 300 includes letter
Cease receiving unit 501, corpus division unit 502, corpus filter element 503, reliability estimating unit 504, corpus culling unit
505, alarm command generation unit 506, information transmitting unit 507, activation corpus frame extraction unit 508 and ratio calculation unit
509。
Information receiving unit 501 is used to receive the original words wheel corpus of the transmission of voice customer service robot 200.
Corpus division unit 502 is used to original words wheel corpus converting text character set, and text character set is divided
For individual character corpus set and words and phrases corpus set.
Corpus filter element 503 is for taking turns language if being not included in pre-established significant word table in individual character corpus set
Material be filtered, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns language
Material is filtered.
Wherein, the pre-established significant word table is in the history words wheel in the presetting time, and word frequency is greater than default
The individual character set of fixed first threshold;The pre-established non-natural voice antistop list is by the history in the presetting time
It is negative corpus set in words wheel comprising crop rotation if non-natural language, and extracts word frequency in negative corpus set and be greater than preset second
The corpus set of threshold value.
Reliability estimating unit 504 for respectively to do not filtered in individual character corpus set corpus set, words and phrases corpus
The corpus set not filtered in set carries out reliability estimating.
Specifically, reliability estimating unit 504 is specifically used for counting the language not filtered in individual character corpus set respectively
The corpus not filtered in material set, words and phrases corpus set is contained in the first positive key word character of presetting positive corpus set
Several and the second positive keyword number of words and, and the corpus set not filtered in word corpus set is calculated according to formula S=C/D
The first confidence level, the second confidence level of the corpus not filtered in words and phrases corpus set, when S be in word corpus set not by
When the first confidence level of the corpus set of filtering, C be in individual character corpus set the corpus that is not filtered be contained in it is presetting
First positive keyword number of words of positive corpus set and, D is the corpus set not filtered in individual character corpus set, when S is words and phrases
When the second confidence level of the corpus set not filtered in corpus set, C is the corpus not filtered in words and phrases corpus set
Be contained in presetting positive corpus set the first positive keyword number of words and, D is the corpus that is not filtered in words and phrases corpus set
Set.
Wherein, the presetting positive corpus, which is combined into, subtracts presetting negative corpus by presetting positive corpus vocabulary
The corpus difference set that vocabulary obtains, and the word frequency extracted from corpus difference set is greater than the collection of the positive keyword of the 4th presetting threshold value
It closes.
The confidence level for the corpus set that corpus culling unit 505 is used to not filtered in individual character corpus set is less than default
When fixed third threshold value, whole rejecting is carried out to the corpus set not filtered in individual character corpus set;When words and phrases corpus set
In the confidence level of corpus set that is not filtered when being less than presetting third threshold value, to what is do not filtered in words and phrases corpus set
Corpus set carries out whole rejecting.
The corpus set that alarm command generation unit 506 is used to not filtered in individual character corpus set carries out whole pick
Remove or words and phrases corpus set in the corpus set that is not filtered carry out it is whole reject simultaneously, generate audio event alarm command.
Information transmitting unit 507 is used to audio event alarm command feeding back to voice customer service robot 200.
Activation corpus frame extraction unit 508 for extracting the corpus set not filtered in individual character corpus set, word respectively
The fundamental frequency feature for the corpus set not filtered in sentence corpus set is located at the activation within presetting vocal acoustics' characteristic range
Corpus frame;
Ratio calculation unit 509, for calculating separately the activation in the corpus set not filtered in individual character corpus set
The activation corpus for the corpus set not filtered in the frame number of corpus frame and the first ratio of totalframes and words and phrases corpus set
The frame number of frame and the second ratio of totalframes;
Corpus culling unit 505 is also used to when the first ratio is less than five presetting threshold value, by individual character corpus set
In the corpus set that is not filtered reject;When the second ratio be less than five presetting threshold value when, by words and phrases corpus set not
The corpus set filtered is rejected.
In addition, information transmitting unit 507 is also used to include pre-established non-natural voice in words and phrases corpus set
When wheel corpus is filtered if the negative keyword of antistop list, issues on-hook and instruct to voice customer service robot 200.
In conclusion corpus filter method provided by the invention and device, are sent by reception voice customer service machine human hair
Original words take turns corpus;Then original words wheel corpus is converted into text character set, text character set is divided into individual character corpus
Set and words and phrases corpus set;To be finally not included in individual character corpus set if pre-established significant word table take turns corpus into
Row filtering, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus into
Row filtering, is filtered, the rejection for realizing non-natural voice is other, improves the correct of speech recognition by abnormal voice
Rate, and improve the robustness of speech recognition performance to robust in a noisy environment, greatly improve the interaction of user
Experience.
In several embodiments provided herein, it should be understood that disclosed device and method can also pass through
Other modes are realized.The apparatus embodiments described above are merely exemplary, for example, flow chart and block diagram in attached drawing
Show the device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product,
Function and operation.In this regard, each box in flowchart or block diagram can represent the one of a module, section or code
Part, a part of the module, section or code, which includes that one or more is for implementing the specified logical function, to be held
Row instruction.It should also be noted that function marked in the box can also be to be different from some implementations as replacement
The sequence marked in attached drawing occurs.For example, two continuous boxes can actually be basically executed in parallel, they are sometimes
It can execute in the opposite order, this depends on the function involved.It is also noted that each side in block diagram, flow chart
The combination of frame and the box in block diagram, flow chart, can function or movement as defined in executing it is dedicated hardware based
System is realized, or can be realized using a combination of dedicated hardware and computer instructions.
In addition, each functional module in each embodiment of the present invention can integrate one independent portion of formation together
Point, it is also possible to modules individualism, an independent part can also be integrated to form with two or more modules.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module
It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words
The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a
People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.It needs
Illustrate, herein, relational terms such as first and second and the like be used merely to by an entity or operation with
Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this realities
The relationship or sequence on border.Moreover, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability
Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including
Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device.
In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element
Process, method, article or equipment in there is also other identical elements.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should also be noted that similar label and letter exist
Similar terms are indicated in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing
It is further defined and explained.
Claims (10)
1. a kind of corpus filter method, which is characterized in that the corpus filter method includes:
Receive the original words wheel corpus that voice customer service machine human hair is sent;
Original words wheel corpus is converted into text character set, text character set is divided into individual character corpus set and words and phrases corpus
Set;
It will be not included in individual character corpus set if pre-established significant word table and take turns corpus and be filtered, by words and phrases corpus
Include in conjunction pre-established non-natural voice antistop list negative keyword if take turns corpus be filtered.
2. corpus filter method according to claim 1, which is characterized in that built in advance will be not included in individual character corpus set
Taking turns corpus if vertical significant word table and being filtered will include that pre-established non-natural voice is crucial in words and phrases corpus set
After taking turns the step of corpus is filtered if the negative keyword of vocabulary, the corpus filter method further include:
Respectively to the corpus set not filtered in individual character corpus set, the corpus set not filtered in words and phrases corpus set into
Row reliability estimating;
When the confidence level for the corpus set not filtered in individual character corpus set is less than presetting third threshold value, to holophrastic speech
The corpus set not filtered in material set carries out whole rejecting;
When the confidence level for the corpus set not filtered in words and phrases corpus set is less than presetting third threshold value, to words and phrases language
The corpus set not filtered in material set carries out whole rejecting.
3. corpus filter method according to claim 2, which is characterized in that the corpus filter method further include: right
The corpus set not filtered in individual character corpus set carries out the corpus not filtered in whole rejecting or words and phrases corpus set
It closes and carries out whole rejecting simultaneously, generate audio event alarm command;
Audio event alarm command is fed back into voice customer service robot.
4. corpus filter method according to claim 2, which is characterized in that it is described respectively in individual character corpus set not by
The corpus set not filtered in the corpus set of filtering, words and phrases corpus set carries out the step of reliability estimating and includes:
The corpus for counting the corpus set not filtered in individual character corpus set respectively, not filtered in words and phrases corpus set
Be contained in presetting positive corpus set the first positive keyword number of words and the second positive keyword number of words and, and according to formula S=
C/D calculates the first confidence level of the corpus set not filtered in word corpus set, is not filtered in words and phrases corpus set
Second confidence level of corpus, when S is the first confidence level of the corpus set not filtered in word corpus set, C is individual character
The corpus not filtered in corpus set be contained in presetting positive corpus set the first positive keyword number of words and, D is single
The corpus set not filtered in word corpus set, when second that S is the corpus set not filtered in words and phrases corpus set is set
When reliability, C is the first positive key that the corpus not filtered in words and phrases corpus set is contained in presetting positive corpus set
Word number of words and, D is the corpus set that is not filtered in words and phrases corpus set.
5. corpus filter method according to claim 4, which is characterized in that the presetting positive corpus, which is combined into, to be passed through
Presetting positive corpus vocabulary subtracts the corpus difference set that presetting negative corpus vocabulary obtains, and the word extracted from corpus difference set
Frequency is greater than the set of the positive keyword of the 4th presetting threshold value.
6. corpus filter method according to claim 1, which is characterized in that built in advance will be not included in individual character corpus set
Corpus is taken turns if vertical significant word table to be filtered, will include that pre-established non-natural voice is crucial in words and phrases corpus set
After taking turns the step of corpus is filtered if the negative keyword of vocabulary, the corpus filter method further include:
The corpus set extracting the corpus set not filtered in individual character corpus set respectively, not filtered in words and phrases corpus set
Fundamental frequency feature be located at the activation corpus frame within presetting vocal acoustics' characteristic range;
Calculate separately the activation corpus frame in the corpus set not filtered in individual character corpus set frame number and totalframes the
The frame number of the activation corpus frame for the corpus set not filtered in one ratio and words and phrases corpus set and the second ratio of totalframes
Value;
When the first ratio is less than five presetting threshold value, the corpus set not filtered in individual character corpus set is rejected;
When the second ratio is less than five presetting threshold value, the corpus set not filtered in words and phrases corpus set is rejected.
7. corpus filter method according to claim 1, which is characterized in that the corpus filter method further include:
If the negative keyword that in words and phrases corpus set will include pre-established non-natural voice antistop list take turns corpus into
When row filtering, issues on-hook and instruct to voice customer service robot.
8. corpus filter method according to claim 1, which is characterized in that the pre-established significant word table is default
In history words wheel in the fixed time, word frequency is greater than the individual character set of presetting first threshold.
9. corpus filter method according to claim 1, which is characterized in that the pre-established non-natural voice keyword
Table is that will be negative corpus set in the history words wheel in the presetting time comprising crop rotation if non-natural language, and extract negative language
Word frequency is greater than the corpus set of preset second threshold in material set.
10. a kind of corpus filter device, which is characterized in that the corpus filter device includes:
Information receiving unit, the original words wheel corpus sent for receiving voice customer service machine human hair;
Text character set is divided into individual character for original words wheel corpus to be converted text character set by corpus division unit
Corpus set and words and phrases corpus set;
Corpus filter element is taken turns corpus for will be not included in pre-established significant word table in individual character corpus set and is carried out
Filtering, by include in words and phrases corpus set pre-established non-natural voice antistop list negative keyword if take turns corpus carry out
Filtering.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811241741.3A CN109376224B (en) | 2018-10-24 | 2018-10-24 | Corpus filtering method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811241741.3A CN109376224B (en) | 2018-10-24 | 2018-10-24 | Corpus filtering method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109376224A true CN109376224A (en) | 2019-02-22 |
CN109376224B CN109376224B (en) | 2020-07-21 |
Family
ID=65401742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811241741.3A Active CN109376224B (en) | 2018-10-24 | 2018-10-24 | Corpus filtering method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109376224B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
CN111026884A (en) * | 2019-12-12 | 2020-04-17 | 南昌众荟智盈信息技术有限公司 | Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6944447B2 (en) * | 2001-04-27 | 2005-09-13 | Accenture Llp | Location-based services |
CN104408078A (en) * | 2014-11-07 | 2015-03-11 | 北京第二外国语学院 | Construction method for key word-based Chinese-English bilingual parallel corpora |
WO2015145219A1 (en) * | 2014-03-28 | 2015-10-01 | Navaratnam Ratnakumar | Systems for remote service of customers using virtual and physical mannequins |
CN105468468A (en) * | 2015-12-02 | 2016-04-06 | 北京光年无限科技有限公司 | Data error correction method and apparatus facing question answering system |
CN105551485A (en) * | 2015-11-30 | 2016-05-04 | 讯飞智元信息科技有限公司 | Audio file retrieval method and system |
CN105760399A (en) * | 2014-12-19 | 2016-07-13 | 华为软件技术有限公司 | Data retrieval method and device |
CN105845127A (en) * | 2015-01-13 | 2016-08-10 | 阿里巴巴集团控股有限公司 | Voice recognition method and system |
CN106504744A (en) * | 2016-10-26 | 2017-03-15 | 科大讯飞股份有限公司 | A kind of method of speech processing and device |
-
2018
- 2018-10-24 CN CN201811241741.3A patent/CN109376224B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6944447B2 (en) * | 2001-04-27 | 2005-09-13 | Accenture Llp | Location-based services |
WO2015145219A1 (en) * | 2014-03-28 | 2015-10-01 | Navaratnam Ratnakumar | Systems for remote service of customers using virtual and physical mannequins |
CN104408078A (en) * | 2014-11-07 | 2015-03-11 | 北京第二外国语学院 | Construction method for key word-based Chinese-English bilingual parallel corpora |
CN105760399A (en) * | 2014-12-19 | 2016-07-13 | 华为软件技术有限公司 | Data retrieval method and device |
CN105845127A (en) * | 2015-01-13 | 2016-08-10 | 阿里巴巴集团控股有限公司 | Voice recognition method and system |
CN105551485A (en) * | 2015-11-30 | 2016-05-04 | 讯飞智元信息科技有限公司 | Audio file retrieval method and system |
CN105468468A (en) * | 2015-12-02 | 2016-04-06 | 北京光年无限科技有限公司 | Data error correction method and apparatus facing question answering system |
CN106504744A (en) * | 2016-10-26 | 2017-03-15 | 科大讯飞股份有限公司 | A kind of method of speech processing and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
CN111026884A (en) * | 2019-12-12 | 2020-04-17 | 南昌众荟智盈信息技术有限公司 | Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus |
Also Published As
Publication number | Publication date |
---|---|
CN109376224B (en) | 2020-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108630193B (en) | Voice recognition method and device | |
US10692503B2 (en) | Voice data processing method, apparatus and storage medium | |
CN110263322A (en) | Audio for speech recognition corpus screening technique, device and computer equipment | |
CN110473566A (en) | Audio separation method, device, electronic equipment and computer readable storage medium | |
US20140074467A1 (en) | Speaker Separation in Diarization | |
CN105006230A (en) | Voice sensitive information detecting and filtering method based on unspecified people | |
CN111081279A (en) | Voice emotion fluctuation analysis method and device | |
WO2015090215A1 (en) | Voice data recognition method and device for distinguishing regional accent, and server | |
CN109065051B (en) | Voice recognition processing method and device | |
CN104766608A (en) | Voice control method and voice control device | |
CN108039181B (en) | Method and device for analyzing emotion information of sound signal | |
CN107705791A (en) | Caller identity confirmation method, device and Voiceprint Recognition System based on Application on Voiceprint Recognition | |
CN110265001A (en) | Corpus screening technique, device and computer equipment for speech recognition training | |
CN102708861A (en) | Poor speech recognition method based on support vector machine | |
CN109376224A (en) | Corpus filter method and device | |
CN110473563A (en) | Breathing detection method, system, equipment and medium based on time-frequency characteristics | |
CN111816216A (en) | Voice activity detection method and device | |
CN110211609A (en) | A method of promoting speech recognition accuracy | |
CN113782026A (en) | Information processing method, device, medium and equipment | |
CN106887226A (en) | Speech recognition algorithm based on artificial intelligence recognition | |
CN111640423A (en) | Word boundary estimation method and device and electronic equipment | |
CN103474067A (en) | Voice signal transmission method and system | |
CN111128127A (en) | Voice recognition processing method and device | |
CN114155845A (en) | Service determination method and device, electronic equipment and storage medium | |
CN114267342A (en) | Recognition model training method, recognition method, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |