WO2020186695A1 - Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2020186695A1
WO2020186695A1 PCT/CN2019/103345 CN2019103345W WO2020186695A1 WO 2020186695 A1 WO2020186695 A1 WO 2020186695A1 CN 2019103345 W CN2019103345 W CN 2019103345W WO 2020186695 A1 WO2020186695 A1 WO 2020186695A1
Authority
WO
WIPO (PCT)
Prior art keywords
preset
script
voice information
running
voice
Prior art date
Application number
PCT/CN2019/103345
Other languages
English (en)
Chinese (zh)
Inventor
王涛
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020186695A1 publication Critical patent/WO2020186695A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of data processing, and in particular to a method, device, computer equipment and storage medium for batch processing of voice information.
  • the embodiments of the present application provide a batch processing method, device, computer equipment, and storage medium for voice information, which can efficiently and accurately realize the unified conversion of multiple voice information to be processed, and reduce errors in the conversion process.
  • an embodiment of the present application provides a method for batch processing of voice information, the method including:
  • the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement batch processing of all voice information to be processed.
  • the number of voice messages is less than or equal to the number of voice messages to be processed;
  • an embodiment of the present application also provides a batch processing device for voice information, which includes:
  • An obtaining unit configured to obtain a preset training set if an information processing instruction is received, the training set including a plurality of to-be-processed voice information;
  • the batch processing unit is used to sequentially call and run the sub-running scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-running scripts is executed, all the voice messages to be processed will be processed in batches.
  • multiple target voice information is obtained, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement the processing of all voice information to be processed
  • the quantity of the target voice information is less than or equal to the quantity of the voice information to be processed;
  • the noise removal unit is used to filter all target voice information through preset voice activation detection to obtain the intermediate voice information after noise removal;
  • the framing unit performs framing processing on all intermediate voice information through preset framing rules to obtain test voice information for training the voice recognition model.
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor implements the above method when the computer program is executed.
  • an embodiment of the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program can implement the foregoing method when executed by a processor.
  • FIG. 1 is a schematic flowchart of a method for batch processing of voice information provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application
  • FIG. 5 is a schematic block diagram of an apparatus for batch processing of voice information according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application.
  • FIG. 7 is another schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application.
  • FIG. 8 is another schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the structural composition of a computer device provided by an embodiment of the present application.
  • Fig. 1 is a schematic flowchart of a method for batch processing of voice information provided by an embodiment of the present application.
  • the batch processing method of voice information is applied to the management server.
  • the management server trains the neural network through the training set, it performs batch preprocessing on the to-be-processed voice information in the acquired training set, such as removing damaged and too short to-be-processed voice information from the training set; Convert the audio format and sampling rate of the voice information to be processed in the training set to a uniform audio format and sampling rate; rename all the voice information to be processed according to specific rules, etc., through the above-mentioned batch processing Efficiently and accurately realize the unified conversion of multiple to-be-processed voice information in the training set, and can effectively reduce errors in the process of processing and converting another to-be-processed voice information after each to-be-processed voice information is processed and converted , In order to accurately realize the training of the neural network.
  • the steps of the method include steps
  • Step S101 If an information processing instruction is received, a preset training set is obtained, and the training set includes a plurality of to-be-processed voice information.
  • the training set can be preset, that is, voice information can be collected and stored from various applications capable of obtaining voice information. At this time, the voice information stored in the training set is the voice information to be processed.
  • the management server receives the information processing instruction initiated by the user, it obtains a preset training set, that is, obtains multiple to-be-processed voice information in the training set to facilitate subsequent operations.
  • Step S102 according to the information processing instruction, call and run the sub-running scripts in the preset Bash script in turn, so that when one of the sub-running scripts is executed, all the voice messages to be processed are processed in batches and all the voice messages are processed.
  • the preset Bash script includes at least one preset sub-run script, and each sub-run script is used to realize batch processing of all voice information to be processed, The quantity of the target voice information is less than or equal to the quantity of the voice information to be processed.
  • the preset Bash script can be integrated with multiple pre-set sub-run scripts.
  • Each sub-run script can realize batch processing of all audio files to be processed in the same processing step.
  • Running a sub-run script is able to perform the same conversion or change processing on all pending audio files, and the management server can call another sub-run script after all pending audio files have completed the corresponding processing. On the basis of the previous process, another conversion or change process is realized.
  • the management server can sequentially call the sub-running scripts in the preset Bash script according to the information processing instruction, and each sub-running script is called once, so as to perform corresponding batch processing on all the voice information to be processed, and then Call another sub-run script in the Bash script again until all sub-run scripts have been run to obtain multiple target voice messages that have been completely converted or changed.
  • Each batch conversion or change mentioned above requires all waiting The next batch conversion or change is performed after the processing of the voice information is completed, which can effectively reduce errors caused by the number of voice information to be processed and too many conversion steps in the current conversion process, thereby greatly improving the processing efficiency of the voice information to be processed.
  • the management server can generally execute the Bash script through Python, that is, it can execute multiple preset sub-run scripts in the Bash script in turn through Python to sequentially implement batch processing operations on the voice information to be processed in the training set, reducing gradual iterative processing Errors in the process improve the efficiency and accuracy of conversion.
  • the step S102 may include steps S201 to S202.
  • the management server may call the first running script in the preset Bash script according to the received information processing instruction, so as to facilitate subsequent processing.
  • the first running script can realize the conversion of audio format and sampling rate of all voice information to be processed in the preset training set.
  • the first running script may be an FFmpeg script.
  • the FFmpeg script is a set of open source computer programs that can be used to record, convert digital audio and video, and convert them into streams.
  • the FFmpeg script can convert the audio format and sample rate of the voice information to be tested.
  • S202 Run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain multiple target voice information with preset audio formats and preset sampling rates.
  • all the voice information to be tested can be converted into a unified audio format and a unified sampling rate.
  • the management server runs the first running script, it can batch convert all the voice information to be processed into preset audio formats and preset samples according to the preset audio format and preset sampling rate set in the first running script Rate the target voice information.
  • common audio formats can include WAV, MIDI, MP3, RA, MP4 and other format types.
  • the preset audio format can be set to WAV format, that is, when all audio formats are non-preset audio formats, It can be converted to WAV format by running the first run script.
  • sampling rate is also called sampling speed or sampling rate, which defines the number of samples extracted from a continuous signal per second to form a discrete signal, and it is expressed in Hertz (Hz).
  • sampling period or sampling time, which is the time interval between samples.
  • sampling frequency refers to how many signal samples the computer collects per second.
  • Sampling rate indicates how many sampling points are collected per second, then 8k means 8000 times of 1s acquisition, and 16k means 1s acquisition of 16000 times, that is, if the preset sampling rate is 8k, and the sampling rate of the voice information to be converted is 16k, then pass
  • the first running script converts the sampling rate of the voice information to be processed from 16k to 8k.
  • the preset Bash script includes a first running script for audio format conversion and a second running script for effective audio filtering.
  • the step S102 may include Steps S301 to S304.
  • S301 Invoke a first running script in a preset Bash script according to the information processing instruction.
  • the management server may call the first running script in the preset Bash script according to the received information processing instruction, so as to facilitate subsequent processing.
  • the first running script can realize the conversion of audio format and sampling rate of all voice information to be processed in the preset training set.
  • all the voice information to be tested can be converted into a unified audio format and a unified sampling rate.
  • the management server runs the first running script, it can batch convert all the voice information to be processed into preset audio formats and preset samples according to the preset audio format and preset sampling rate set in the first running script Rate the target voice information.
  • the management server needs to call the second running script for effective audio filtering in the preset Bash script.
  • the preset specifications in the second running script set conditions for screening voice information, so that voice information that meets the preset specifications can be selected from a plurality of first voice messages as valid voice information.
  • the second running script may be SOX.
  • SOX can filter out effective voice information from a plurality of first voice information according to a set preset specification.
  • S304 Run the second running script to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or equal to the number of the first voice information.
  • the management server runs the second running script, it can filter all the first voice information according to the preset specifications to obtain the target voice information that meets the conditions, so after the screening, the number of the target voice information is less than or equal to The number of first voice messages.
  • the preset specification may be a preset voice duration threshold. For example, if the duration of the first voice message is lower than the preset threshold, the first voice message is deleted.
  • the preset specification can also be a preset threshold for the sampling point of the voice information, or a preset threshold for the scaling factor of the voice information, or a preset threshold for the maximum amplitude of the voice information. value.
  • the preset Bash script includes a first running script for audio format conversion, a second running script for effective audio filtering, and a script for renaming.
  • the third running script, the step S102 may include steps S401 to S406.
  • S401 Call a first running script in a preset Bash script according to the information processing instruction.
  • the management server may call the first running script in the preset Bash script according to the received information processing instruction, so as to facilitate subsequent processing.
  • the first running script can realize the conversion of audio format and sampling rate of all voice information to be processed in the preset training set.
  • S402 Run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voice information with the same audio format and sampling rate.
  • all the voice information to be tested can be converted into a unified audio format and a unified sampling rate.
  • the management server runs the first running script, it can batch convert all the voice information to be processed into preset audio formats and preset samples according to the preset audio format and preset sampling rate set in the first running script Rate the target voice information.
  • the management server needs to call the second running script for effective audio filtering in the preset Bash script.
  • the preset specifications in the second running script set conditions for screening voice information, so that voice information that meets the preset specifications can be selected from multiple first voice messages as valid voice information.
  • the second running script may be SOX.
  • SOX can filter out effective voice information from a plurality of first voice information according to a set preset specification.
  • the management server runs the second running script, it can filter all the first voice information according to the preset specifications to obtain the target voice information that meets the conditions, so after the screening, the number of the target voice information is less than or equal to The number of first voice messages.
  • the preset specification may be a preset voice duration threshold. For example, if the duration of the first voice message is lower than the preset threshold, the first voice message is deleted.
  • the preset specification can also be a preset threshold for the sampling point of the voice information, or a preset threshold for the scaling factor of the voice information, or a preset threshold for the maximum amplitude of the voice information. value.
  • the management server needs to call the third running script for renaming in the preset Bash script, so that the renamed voice message can be more accurately and quickly Read.
  • a preset name format is preset in the third running script, so that multiple second voice messages can be renamed according to the preset name format.
  • the third running script is a renaming function, and the renaming function may be a function rename() for renaming files.
  • S406 Run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.
  • all the voice information in the training set can be generated by the same subject, that is, each subject can correspond to multiple pieces of different voice information.
  • it needs to be based on the preset name format and the second voice information.
  • Rename the existing information The management server can obtain the corresponding renamed target voice information after running the third running script.
  • the naming of the target voice information conforms to the preset name format.
  • the number of target voice information is the same as that of the second voice. The amount of information is equal, and there is a one-to-one correspondence between the two.
  • Step S103 Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal.
  • the voice activation detection is Voice Activity Detection, or VAD for short, which can distinguish voices in voice signals. Signal and background noise, thereby improving the accuracy of training neural networks and reducing the time required for training.
  • VAD Voice Activity Detection
  • the voice activation detection can cut off the mute at the beginning and the end of the voice information and reduce the interference caused to the subsequent steps. That is, the voice activation detection can filter all target voice information in batch processing to obtain multiple corresponding intermediate voices after denoising information.
  • Step S104 Perform framing processing on all intermediate voice information according to preset framing rules to obtain test voice information for training a voice recognition model.
  • the management server also needs to perform framing processing on all intermediate voice information according to preset framing rules, so as to obtain a corresponding number of framed test voice information.
  • the test speech information can be used to train a speech recognition model, so as to obtain a speech recognition model capable of corresponding speech recognition.
  • the preset framing rule may refer to sound framing through a moving window function, that is, the voice information is cut into a small segment and a small segment, each segment is called a frame, and there is generally a frame between each frame. Overlapping.
  • the step S104 may specifically include: performing framing processing on the intermediate voice information through the Enframe function to obtain test voice information for training a voice recognition model.
  • the Enframe function is a specific framing function
  • the management server can perform unified framing processing on all intermediate voice information after calling the framing function, so as to obtain the final test voice information for training.
  • the embodiments of the present application can efficiently and accurately realize the unified conversion of multiple to-be-processed voice information in the training set, and reduce errors in the conversion process, so as to accurately implement neural network training.
  • the program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), etc.
  • an embodiment of the present application also proposes a device for batch processing of voice information.
  • the device 100 includes: an acquisition unit 101, a batch processing unit 102, a noise removal unit 103, and Framing unit 104.
  • the obtaining unit 101 is configured to obtain a preset training set if an information processing instruction is received, and the training set includes a plurality of to-be-processed voice information.
  • the batch processing unit 102 is configured to sequentially call and run the sub-running scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-running scripts is executed, all the voice messages to be processed are batched accordingly Process and run until all the sub-run scripts are finished, so as to obtain multiple target voice messages.
  • the preset Bash script includes at least one preset sub-run script, and each sub-run script is used to realize the processing of all the sub-run scripts.
  • the quantity of the target voice information is less than or equal to the quantity of the voice information to be processed.
  • the preset Bash script includes a first running script for converting audio format and sampling rate
  • the batch processing unit 102 may include: a first calling unit 201 and The first operating unit 202.
  • the first calling unit 201 is configured to call the first running script in the preset Bash script according to the information processing instruction.
  • the first running unit 202 is configured to run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, thereby obtaining multiple targets with preset audio formats and preset sampling rates voice message.
  • the preset Bash script includes a first running script for audio format conversion and a second running script for effective audio filtering.
  • the batch processing unit 102 It may include a first calling unit 301, a first running unit 302, a second calling unit 303, and a second running unit 304.
  • the first calling unit 301 is configured to call the first running script in the preset Bash script according to the information processing instruction.
  • the first running unit 302 is configured to run the first running script to perform audio format conversion and sample rate conversion on all the voice information to be processed, so as to obtain a corresponding number of audio formats with preset audio formats and preset sampling rates.
  • the first voice message is configured to run the first running script to perform audio format conversion and sample rate conversion on all the voice information to be processed, so as to obtain a corresponding number of audio formats with preset audio formats and preset sampling rates.
  • the second calling unit 303 is configured to call a second running script in the preset Bash script.
  • the second running unit 304 is configured to run the second running script to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or Equal to the number of first voice messages.
  • the preset Bash script includes a first running script for audio format conversion, a second running script for effective audio filtering, and a renaming script.
  • the third running script, the batch processing unit 102 may include a first calling unit 401, a first running unit 402, a second calling unit 403, a second running unit 404, a third calling unit 405, and a third running unit 406.
  • the first calling unit 401 is configured to call the first running script in the preset Bash script according to the information processing instruction.
  • the first running unit 402 is configured to run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voices with the same audio format and sampling rate information.
  • the second calling unit 403 is configured to call a second running script in the preset Bash script.
  • the second running unit 404 is configured to run the second running script to filter all the first voice messages, so as to obtain a plurality of second voice messages that meet the preset specifications, and the number of the second voice messages Less than or equal to the number of first voice messages.
  • the third calling unit 405 is configured to call the third running script in the preset Bash script.
  • the third running unit 406 is configured to run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.
  • the noise removal unit 103 is configured to perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal.
  • the framing unit 104 performs framing processing on all intermediate voice information according to a preset framing rule to obtain test voice information for training a voice recognition model.
  • the framing unit 104 may be specifically configured to perform framing processing on the intermediate voice information through the Enframe function to obtain test voice information for training a voice recognition model.
  • the above acquisition unit 101, batch processing unit 102, noise removal unit 103, and framing unit 104 can be embedded in hardware or independent of life insurance reporting devices, or can be in software Stored in the memory of the batch processing device for voice information, so that the processor can call and execute the operations corresponding to the above units.
  • the processor can be a central processing unit (CPU), a microprocessor, a single-chip microcomputer, etc.
  • the foregoing apparatus for batch processing of voice information can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9.
  • FIG. 9 is a schematic diagram of the structural composition of a computer device of this application.
  • the device can be a server, where the server can be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, an internal memory 504, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and the internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute a method for batch processing of voice information.
  • the processor 502 is used to provide computing and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute a method for batch processing of voice information.
  • the network interface 505 is used for network communication with other devices.
  • FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the method for batch processing of voice information in any of the foregoing embodiments.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program may be stored in a storage medium, and the storage medium is a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the voice information batch processing method in any of the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the device in the embodiment of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un procédé et un appareil de traitement par lots d'informations vocales, un dispositif informatique et un support de stockage. Le procédé consiste à : si une instruction de traitement d'informations est reçue, obtenir un ensemble d'apprentissage prédéfini, l'ensemble d'apprentissage comprenant une pluralité d'informations vocales à traiter ; invoquer et exécuter séquentiellement des sous-scripts d'exécution dans un script Bash prédéfini selon l'instruction de traitement d'informations pour effectuer un traitement par lots correspondant sur toutes les informations vocales à traiter, de façon à obtenir une pluralité d'informations vocales cibles ; filtrer toutes les informations vocales cibles au moyen d'une détection d'activation vocale prédéfinie pour obtenir des informations vocales intermédiaires après élimination du bruit ; et effectuer une segmentation de trame sur toutes les informations vocales intermédiaires selon une règle de segmentation de trame prédéfinie pour obtenir des informations vocales de test pour l'apprentissage d'un modèle de reconnaissance vocale.
PCT/CN2019/103345 2019-03-15 2019-08-29 Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage WO2020186695A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910197848.0 2019-03-15
CN201910197848.0A CN110060667B (zh) 2019-03-15 2019-03-15 语音信息的批量处理方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020186695A1 true WO2020186695A1 (fr) 2020-09-24

Family

ID=67317009

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103345 WO2020186695A1 (fr) 2019-03-15 2019-08-29 Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN110060667B (fr)
WO (1) WO2020186695A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060667B (zh) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 语音信息的批量处理方法、装置、计算机设备及存储介质
CN112820309A (zh) * 2020-12-31 2021-05-18 北京天润融通科技股份有限公司 基于rnn的降噪处理方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
CN1588538A (zh) * 2004-09-29 2005-03-02 上海交通大学 用于嵌入式自动语音识别系统的训练方法
CN108922543A (zh) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 模型库建立方法、语音识别方法、装置、设备及介质
CN109326305A (zh) * 2018-09-18 2019-02-12 易诚博睿(南京)科技有限公司 一种批量测试语音识别和文本合成的方法和测试系统
CN110060667A (zh) * 2019-03-15 2019-07-26 平安科技(深圳)有限公司 语音信息的批量处理方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286892B2 (en) * 2014-04-01 2016-03-15 Google Inc. Language modeling in speech recognition
CN107908679B (zh) * 2017-10-26 2020-11-27 平安科技(深圳)有限公司 脚本语句转换方法、装置及计算机可读存储介质
CN108595656B (zh) * 2018-04-28 2022-02-18 宁波银行股份有限公司 一种数据的处理方法及系统
CN108877775B (zh) * 2018-06-04 2023-03-31 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN109376166B (zh) * 2018-08-20 2023-07-04 中国平安财产保险股份有限公司 脚本转换方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
CN1588538A (zh) * 2004-09-29 2005-03-02 上海交通大学 用于嵌入式自动语音识别系统的训练方法
CN108922543A (zh) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 模型库建立方法、语音识别方法、装置、设备及介质
CN109326305A (zh) * 2018-09-18 2019-02-12 易诚博睿(南京)科技有限公司 一种批量测试语音识别和文本合成的方法和测试系统
CN110060667A (zh) * 2019-03-15 2019-07-26 平安科技(深圳)有限公司 语音信息的批量处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110060667B (zh) 2023-05-30
CN110060667A (zh) 2019-07-26

Similar Documents

Publication Publication Date Title
AU2016260156B2 (en) Method and device for improving audio processing performance
CN108833722B (zh) 语音识别方法、装置、计算机设备及存储介质
WO2015188581A1 (fr) Procédé et dispositif de commande par l'audio de vibration de moteur
WO2016112113A1 (fr) Utilisation de microphones numériques pour la suppression du bruit et la détection de mot-clé à faible puissance
WO2020140374A1 (fr) Procédé, appareil et dispositif de traitement de données vocales et support d'informations
WO2020186695A1 (fr) Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage
US11790886B2 (en) System and method for synthesizing automated test cases from natural interactions
JP2015537237A (ja) リアルタイム交通検出
WO2023103253A1 (fr) Procédé et appareil de détection audio, et équipement terminal
CN112185424A (zh) 一种语音文件裁剪还原方法、装置、设备和存储介质
JP2011530121A (ja) プログラム動作をフィルタリング・モニタリングするための方法とシステム
CN109903775A (zh) 一种音频爆音检测方法和装置
WO2024099359A1 (fr) Procédé et appareil de détection vocale, dispositif électronique et support de stockage
CN108428457B (zh) 音频去重方法及装置
CN104240697A (zh) 一种音频数据的特征提取方法及装置
CN111968620B (zh) 算法的测试方法、装置、电子设备及存储介质
WO2021179470A1 (fr) Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures
US9978393B1 (en) System and method for automatically removing noise defects from sound recordings
CN110189763B (zh) 一种声波配置方法、装置及终端设备
CN111028860B (zh) 音频数据处理方法、装置、计算机设备以及存储介质
CN110930986B (zh) 语音处理方法、装置、电子设备及存储介质
CN114155845A (zh) 服务确定方法、装置、电子设备及存储介质
CN113270118A (zh) 语音活动侦测方法及装置、存储介质和电子设备
CN113658581A (zh) 声学模型的训练、语音处理方法、装置、设备及存储介质
CN110047472B (zh) 语音信息的批量转换方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919963

Country of ref document: EP

Kind code of ref document: A1