WO2020186695A1

WO2020186695A1 - Voice information batch processing method and apparatus, computer device, and storage medium

Info

Publication number: WO2020186695A1
Application number: PCT/CN2019/103345
Authority: WO
Inventors: 王涛
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-03-15
Filing date: 2019-08-29
Publication date: 2020-09-24
Also published as: CN110060667B; CN110060667A

Abstract

Disclosed are a voice information batch processing method and apparatus, a computer device, and a storage medium. The method comprises: if an information processing instruction is received, obtaining a preset training set, the training set comprising a plurality of voice information to be processed; sequentially invoking and running sub-run scripts in a preset Bash script according to the information processing instruction to perform corresponding batch processing on all the voice information to be processed, so as to obtain a plurality of target voice information; filtering all target voice information by means of preset voice activation detection to obtain intermediate voice information after noise removal; and performing frame segmentation on all the intermediate voice information according to a preset frame segmentation rule to obtain test voice information for training a voice recognition model.

Description

Method, device, computer equipment and storage medium for batch processing of voice information

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 15, 2019, the application number is 201910197848.0, and the application name is "Batch processing methods, devices, computer equipment and storage media for voice information", and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the field of data processing, and in particular to a method, device, computer equipment and storage medium for batch processing of voice information.

Background technique

In speech recognition projects, it is usually necessary to collect or collect a large amount of speech information from various channels, and use these speech information as training samples in the training set to train the neural network, so as to obtain the corresponding recognition model for feature speech recognition. In order to ensure the smooth training process of the neural network and the accuracy of the acquired recognition model, it is usually necessary to pre-process the acquired speech information before training, and the preprocessing of a large amount of speech information requires gradual iteration. Complete, but the process of repeated iterative processing is very prone to operational errors due to the large amount of data, resulting in inaccurate voice information processing.

Summary of the invention

The embodiments of the present application provide a batch processing method, device, computer equipment, and storage medium for voice information, which can efficiently and accurately realize the unified conversion of multiple voice information to be processed, and reduce errors in the conversion process.

In the first aspect, an embodiment of the present application provides a method for batch processing of voice information, the method including:

If an information processing instruction is received, obtain a preset training set, where the training set includes multiple voice messages to be processed;

According to the information processing instruction, call and run the sub-run scripts in the preset Bash script in turn, so that when one of the sub-run scripts is run, all the voice messages to be processed will be processed in batches until all sub-runs are run. Script to obtain multiple target voice information, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement batch processing of all voice information to be processed. The number of voice messages is less than or equal to the number of voice messages to be processed;

Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal;

Perform framing processing on all intermediate voice information by preset framing rules to obtain test voice information for training the voice recognition model.

In the second aspect, an embodiment of the present application also provides a batch processing device for voice information, which includes:

An obtaining unit, configured to obtain a preset training set if an information processing instruction is received, the training set including a plurality of to-be-processed voice information;

The batch processing unit is used to sequentially call and run the sub-running scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-running scripts is executed, all the voice messages to be processed will be processed in batches. After running all the sub-running scripts, multiple target voice information is obtained, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement the processing of all voice information to be processed In batch processing, the quantity of the target voice information is less than or equal to the quantity of the voice information to be processed;

The noise removal unit is used to filter all target voice information through preset voice activation detection to obtain the intermediate voice information after noise removal;

The framing unit performs framing processing on all intermediate voice information through preset framing rules to obtain test voice information for training the voice recognition model.

In a third aspect, an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor implements the above method when the computer program is executed.

In a fourth aspect, an embodiment of the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program can implement the foregoing method when executed by a processor.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a method for batch processing of voice information provided by an embodiment of the present application;

2 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application;

3 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application;

4 is a schematic diagram of a sub-flow of a method for batch processing of voice information provided by an embodiment of the present application;

FIG. 5 is a schematic block diagram of an apparatus for batch processing of voice information according to an embodiment of the present application;

6 is a schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application;

FIG. 7 is another schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application;

FIG. 8 is another schematic block diagram of a batch processing unit of a voice information batch processing apparatus provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of the structural composition of a computer device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

Please refer to Fig. 1, which is a schematic flowchart of a method for batch processing of voice information provided by an embodiment of the present application. The batch processing method of voice information is applied to the management server. Before the management server trains the neural network through the training set, it performs batch preprocessing on the to-be-processed voice information in the acquired training set, such as removing damaged and too short to-be-processed voice information from the training set; Convert the audio format and sampling rate of the voice information to be processed in the training set to a uniform audio format and sampling rate; rename all the voice information to be processed according to specific rules, etc., through the above-mentioned batch processing Efficiently and accurately realize the unified conversion of multiple to-be-processed voice information in the training set, and can effectively reduce errors in the process of processing and converting another to-be-processed voice information after each to-be-processed voice information is processed and converted , In order to accurately realize the training of the neural network. As shown in Figure 1, the steps of the method include steps S101 to S104.

Step S101: If an information processing instruction is received, a preset training set is obtained, and the training set includes a plurality of to-be-processed voice information.

In this embodiment, in order to train the neural network to obtain the corresponding speech recognition model, it is necessary to perform batch preprocessing on the acquired speech information in the training set, so as to meet the requirements of training the neural network and improve the speech obtained by training. Identify the accuracy of the model. The training set can be preset, that is, voice information can be collected and stored from various applications capable of obtaining voice information. At this time, the voice information stored in the training set is the voice information to be processed. When the management server receives the information processing instruction initiated by the user, it obtains a preset training set, that is, obtains multiple to-be-processed voice information in the training set to facilitate subsequent operations.

Step S102, according to the information processing instruction, call and run the sub-running scripts in the preset Bash script in turn, so that when one of the sub-running scripts is executed, all the voice messages to be processed are processed in batches and all the voice messages are processed. To obtain multiple target voice information, the preset Bash script includes at least one preset sub-run script, and each sub-run script is used to realize batch processing of all voice information to be processed, The quantity of the target voice information is less than or equal to the quantity of the voice information to be processed.

In this embodiment, the preset Bash script can be integrated with multiple pre-set sub-run scripts. Each sub-run script can realize batch processing of all audio files to be processed in the same processing step. Running a sub-run script is able to perform the same conversion or change processing on all pending audio files, and the management server can call another sub-run script after all pending audio files have completed the corresponding processing. On the basis of the previous process, another conversion or change process is realized.

Specifically, the management server can sequentially call the sub-running scripts in the preset Bash script according to the information processing instruction, and each sub-running script is called once, so as to perform corresponding batch processing on all the voice information to be processed, and then Call another sub-run script in the Bash script again until all sub-run scripts have been run to obtain multiple target voice messages that have been completely converted or changed. Each batch conversion or change mentioned above requires all waiting The next batch conversion or change is performed after the processing of the voice information is completed, which can effectively reduce errors caused by the number of voice information to be processed and too many conversion steps in the current conversion process, thereby greatly improving the processing efficiency of the voice information to be processed.

Among them, the management server can generally execute the Bash script through Python, that is, it can execute multiple preset sub-run scripts in the Bash script in turn through Python to sequentially implement batch processing operations on the voice information to be processed in the training set, reducing gradual iterative processing Errors in the process improve the efficiency and accuracy of conversion.

In an embodiment, as shown in FIG. 2, the step S102 may include steps S201 to S202.

S201: Call a first running script in a preset Bash script according to the information processing instruction.

The management server may call the first running script in the preset Bash script according to the received information processing instruction, so as to facilitate subsequent processing. The first running script can realize the conversion of audio format and sampling rate of all voice information to be processed in the preset training set.

Optionally, the first running script may be an FFmpeg script. The FFmpeg script is a set of open source computer programs that can be used to record, convert digital audio and video, and convert them into streams. In this application, the FFmpeg script can convert the audio format and sample rate of the voice information to be tested.

S202: Run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain multiple target voice information with preset audio formats and preset sampling rates.

Among them, in order to enable the voice information to be tested in the training set to quickly perform feature extraction in the process of training the neural network, all the voice information to be tested can be converted into a unified audio format and a unified sampling rate. After the management server runs the first running script, it can batch convert all the voice information to be processed into preset audio formats and preset samples according to the preset audio format and preset sampling rate set in the first running script Rate the target voice information.

Specifically, common audio formats can include WAV, MIDI, MP3, RA, MP4 and other format types. In order to unify the audio format, the preset audio format can be set to WAV format, that is, when all audio formats are non-preset audio formats, It can be converted to WAV format by running the first run script.

The sampling rate is also called sampling speed or sampling rate, which defines the number of samples extracted from a continuous signal per second to form a discrete signal, and it is expressed in Hertz (Hz). The reciprocal of the sampling rate is the sampling period or sampling time, which is the time interval between samples. In layman's terms, sampling frequency refers to how many signal samples the computer collects per second. Sampling rate indicates how many sampling points are collected per second, then 8k means 8000 times of 1s acquisition, and 16k means 1s acquisition of 16000 times, that is, if the preset sampling rate is 8k, and the sampling rate of the voice information to be converted is 16k, then pass The first running script converts the sampling rate of the voice information to be processed from 16k to 8k.

In an embodiment, as shown in FIG. 3, the preset Bash script includes a first running script for audio format conversion and a second running script for effective audio filtering. The step S102 may include Steps S301 to S304.

S301: Invoke a first running script in a preset Bash script according to the information processing instruction.

S302. Run the first running script to perform audio format conversion and sampling rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voice information having a preset audio format and a preset sampling rate.

S303, calling a second running script in the preset Bash script.

Among them, in order to filter the current first voice information of which the audio format and sampling rate have been converted, the management server needs to call the second running script for effective audio filtering in the preset Bash script. The preset specifications in the second running script set conditions for screening voice information, so that voice information that meets the preset specifications can be selected from a plurality of first voice messages as valid voice information. Optionally, the second running script may be SOX. As a voice processing tool, SOX can filter out effective voice information from a plurality of first voice information according to a set preset specification.

S304: Run the second running script to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or equal to the number of the first voice information.

Wherein, after the management server runs the second running script, it can filter all the first voice information according to the preset specifications to obtain the target voice information that meets the conditions, so after the screening, the number of the target voice information is less than or equal to The number of first voice messages. Furthermore, the preset specification may be a preset voice duration threshold. For example, if the duration of the first voice message is lower than the preset threshold, the first voice message is deleted. In the same way, the preset specification can also be a preset threshold for the sampling point of the voice information, or a preset threshold for the scaling factor of the voice information, or a preset threshold for the maximum amplitude of the voice information. value.

In one embodiment, as shown in FIG. 4, the preset Bash script includes a first running script for audio format conversion, a second running script for effective audio filtering, and a script for renaming. The third running script, the step S102 may include steps S401 to S406.

S401: Call a first running script in a preset Bash script according to the information processing instruction.

S402: Run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voice information with the same audio format and sampling rate.

S403: Call the second running script in the preset Bash script.

Among them, in order to filter the current first voice information of which the audio format and sampling rate have been converted, the management server needs to call the second running script for effective audio filtering in the preset Bash script. The preset specifications in the second running script set conditions for screening voice information, so that voice information that meets the preset specifications can be selected from multiple first voice messages as valid voice information. Optionally, the second running script may be SOX. As a voice processing tool, SOX can filter out effective voice information from a plurality of first voice information according to a set preset specification.

S404. Run the second running script to filter all the first voice information, so as to obtain a plurality of second voice information meeting preset specifications, and the number of the second voice information is less than or equal to that of the first voice information. Quantity.

S405: Call the third running script in the preset Bash script.

Among them, in order to rename the current second voice message, the management server needs to call the third running script for renaming in the preset Bash script, so that the renamed voice message can be more accurately and quickly Read. A preset name format is preset in the third running script, so that multiple second voice messages can be renamed according to the preset name format. Optionally, the third running script is a renaming function, and the renaming function may be a function rename() for renaming files.

S406: Run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.

Among them, all the voice information in the training set can be generated by the same subject, that is, each subject can correspond to multiple pieces of different voice information. In order to facilitate distinguishing settings, it needs to be based on the preset name format and the second voice information. Rename the existing information. The management server can obtain the corresponding renamed target voice information after running the third running script. At the same time, the naming of the target voice information conforms to the preset name format. Furthermore, the number of target voice information is the same as that of the second voice. The amount of information is equal, and there is a one-to-one correspondence between the two.

Step S103: Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal.

In this embodiment, before performing neural network training, it is also necessary to filter the target voice information through voice activation detection. The voice activation detection is Voice Activity Detection, or VAD for short, which can distinguish voices in voice signals. Signal and background noise, thereby improving the accuracy of training neural networks and reducing the time required for training. Among them, the voice activation detection can cut off the mute at the beginning and the end of the voice information and reduce the interference caused to the subsequent steps. That is, the voice activation detection can filter all target voice information in batch processing to obtain multiple corresponding intermediate voices after denoising information.

Step S104: Perform framing processing on all intermediate voice information according to preset framing rules to obtain test voice information for training a voice recognition model.

In this embodiment, the management server also needs to perform framing processing on all intermediate voice information according to preset framing rules, so as to obtain a corresponding number of framed test voice information. Among them, the test speech information can be used to train a speech recognition model, so as to obtain a speech recognition model capable of corresponding speech recognition. Specifically, the preset framing rule may refer to sound framing through a moving window function, that is, the voice information is cut into a small segment and a small segment, each segment is called a frame, and there is generally a frame between each frame. Overlapping.

In another embodiment, the step S104 may specifically include: performing framing processing on the intermediate voice information through the Enframe function to obtain test voice information for training a voice recognition model.

Wherein, the Enframe function is a specific framing function, and the management server can perform unified framing processing on all intermediate voice information after calling the framing function, so as to obtain the final test voice information for training.

In summary, the embodiments of the present application can efficiently and accurately realize the unified conversion of multiple to-be-processed voice information in the training set, and reduce errors in the conversion process, so as to accurately implement neural network training.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), etc.

Referring to FIG. 5, corresponding to the above-mentioned method for batch processing of voice information, an embodiment of the present application also proposes a device for batch processing of voice information. The device 100 includes: an acquisition unit 101, a batch processing unit 102, a noise removal unit 103, and Framing unit 104.

The obtaining unit 101 is configured to obtain a preset training set if an information processing instruction is received, and the training set includes a plurality of to-be-processed voice information.

The batch processing unit 102 is configured to sequentially call and run the sub-running scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-running scripts is executed, all the voice messages to be processed are batched accordingly Process and run until all the sub-run scripts are finished, so as to obtain multiple target voice messages. The preset Bash script includes at least one preset sub-run script, and each sub-run script is used to realize the processing of all the sub-run scripts. For batch processing of voice information, the quantity of the target voice information is less than or equal to the quantity of the voice information to be processed.

In an embodiment, as shown in FIG. 6, the preset Bash script includes a first running script for converting audio format and sampling rate, and the batch processing unit 102 may include: a first calling unit 201 and The first operating unit 202.

The first calling unit 201 is configured to call the first running script in the preset Bash script according to the information processing instruction.

The first running unit 202 is configured to run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, thereby obtaining multiple targets with preset audio formats and preset sampling rates voice message.

In an embodiment, as shown in FIG. 7, the preset Bash script includes a first running script for audio format conversion and a second running script for effective audio filtering. The batch processing unit 102 It may include a first calling unit 301, a first running unit 302, a second calling unit 303, and a second running unit 304.

The first calling unit 301 is configured to call the first running script in the preset Bash script according to the information processing instruction.

The first running unit 302 is configured to run the first running script to perform audio format conversion and sample rate conversion on all the voice information to be processed, so as to obtain a corresponding number of audio formats with preset audio formats and preset sampling rates. The first voice message.

The second calling unit 303 is configured to call a second running script in the preset Bash script.

The second running unit 304 is configured to run the second running script to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or Equal to the number of first voice messages.

In one embodiment, as shown in FIG. 8, the preset Bash script includes a first running script for audio format conversion, a second running script for effective audio filtering, and a renaming script. The third running script, the batch processing unit 102 may include a first calling unit 401, a first running unit 402, a second calling unit 403, a second running unit 404, a third calling unit 405, and a third running unit 406.

The first calling unit 401 is configured to call the first running script in the preset Bash script according to the information processing instruction.

The first running unit 402 is configured to run the first running script to perform audio format conversion and sample rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voices with the same audio format and sampling rate information.

The second calling unit 403 is configured to call a second running script in the preset Bash script.

The second running unit 404 is configured to run the second running script to filter all the first voice messages, so as to obtain a plurality of second voice messages that meet the preset specifications, and the number of the second voice messages Less than or equal to the number of first voice messages.

The third calling unit 405 is configured to call the third running script in the preset Bash script.

The third running unit 406 is configured to run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.

The noise removal unit 103 is configured to perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal.

The framing unit 104 performs framing processing on all intermediate voice information according to a preset framing rule to obtain test voice information for training a voice recognition model.

In another embodiment, the framing unit 104 may be specifically configured to perform framing processing on the intermediate voice information through the Enframe function to obtain test voice information for training a voice recognition model.

It should be noted that those skilled in the art can clearly understand that the above-mentioned voice information batch processing apparatus 100 and the specific implementation process of each unit can refer to the corresponding description in the foregoing method embodiment. For the convenience and conciseness of the description, I won't repeat them here.

As can be seen from the above, in terms of hardware implementation, the above acquisition unit 101, batch processing unit 102, noise removal unit 103, and framing unit 104 can be embedded in hardware or independent of life insurance reporting devices, or can be in software Stored in the memory of the batch processing device for voice information, so that the processor can call and execute the operations corresponding to the above units. The processor can be a central processing unit (CPU), a microprocessor, a single-chip microcomputer, etc.

The foregoing apparatus for batch processing of voice information can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9.

FIG. 9 is a schematic diagram of the structural composition of a computer device of this application. The device can be a server, where the server can be an independent server or a server cluster composed of multiple servers.

9, the computer device 500 includes a processor 502, a memory, an internal memory 504, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and the internal memory 504.

The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute a method for batch processing of voice information.

The processor 502 is used to provide computing and control capabilities, and support the operation of the entire computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute a method for batch processing of voice information.

The network interface 505 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the method for batch processing of voice information in any of the foregoing embodiments.

It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by computer programs instructing relevant hardware. The computer program may be stored in a storage medium, and the storage medium is a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.

Therefore, this application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the voice information batch processing method in any of the above embodiments.

The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described in terms of function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of each unit is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.

The steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs. The units in the device in the embodiment of the present application may be combined, divided, and deleted according to actual needs. In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method for batch processing of voice information, including:

If an information processing instruction is received, obtain a preset training set, where the training set includes multiple voice messages to be processed;

According to the information processing instruction, call and run the sub-run scripts in the preset Bash script in turn, so that when one of the sub-run scripts is run, all the voice messages to be processed will be processed in batches until all sub-runs are run. Script to obtain multiple target voice information, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement batch processing of all voice information to be processed. The number of voice messages is less than or equal to the number of voice messages to be processed;

Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal;

Perform framing processing on all intermediate voice information by preset framing rules to obtain test voice information for training the voice recognition model.
The method of claim 1, wherein the preset Bash script includes a first running script for converting audio format and sampling rate, and the preset Bash script is called and executed in turn according to the information processing instruction. The sub-running script in the script, when one of the sub-running scripts is run, all the voice messages to be processed will be processed in batches until all sub-running scripts are run to obtain multiple target voice messages, including:

Calling the first running script in the preset Bash script according to the information processing instruction;

The first running script is executed to perform audio format conversion and sampling rate conversion on all voice information to be processed, thereby obtaining multiple target voice information with preset audio formats and preset sampling rates.
The method according to claim 2, wherein the first running script is an FFmpeg script, and the running the first running script performs audio format conversion and sampling rate conversion on all voice information to be processed, thereby obtaining multiple A step of target voice information with preset audio format and preset sampling rate includes:

Run the FFmpeg script and determine the preset audio format and preset sampling rate set in the FFmpeg script to convert all the voice information to be processed into target voice information with the preset audio format and preset sampling rate in batch .
The method according to claim 1, wherein the preset Bash script includes a first running script for performing audio format conversion and a second running script for performing effective audio screening, and the processing according to the information The instructions call and run the sub-run scripts in the preset Bash script in turn, so that when one of the sub-run scripts is run, all the voice messages to be processed will be processed in batches until all sub-run scripts are run. The steps of a target voice message include:

Calling the first running script in the preset Bash script according to the information processing instruction;

Running the first running script to perform audio format conversion and sampling rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voice information with a preset audio format and a preset sampling rate;

Call the second running script in the preset Bash script;

The second running script is executed to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or equal to the number of the first voice information.
The method of claim 4, wherein the second running script is a SOX voice processing tool, and the running the second running script filters all the first voice information to obtain a plurality of The steps to specify the target voice message include:

Run the SOX voice processing tool to filter all the first voice information, so as to obtain multiple target voice information that meet the preset voice duration threshold.
The method of claim 1, wherein the preset Bash script includes a first running script for audio format conversion, a second running script for effective audio filtering, and a second running script for renaming. Three running scripts, the sub-run scripts in the preset Bash script are called and run in turn according to the information processing instructions, so that when one of the sub-run scripts is run, all the voice messages to be processed are processed in batches until The steps to get multiple target voice information after running all the sub-run scripts include:

Calling the first running script in the preset Bash script according to the information processing instruction;

Running the first running script to perform audio format conversion and sampling rate conversion on all the voice information to be processed, so as to obtain a corresponding number of first voice information with the same audio format and sampling rate;

Call the second running script in the preset Bash script;

Running the second running script to filter all the first voice information, so as to obtain a plurality of second voice information meeting preset specifications, and the number of the second voice information is less than or equal to the number of the first voice information;

Call the third running script in the preset Bash script;

Run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.
7. The method of claim 6, wherein the third running script is a renaming function, and the renaming function is a function rename().
8. The method according to claim 1, wherein the step of performing framing processing on all intermediate voice information according to a preset framing rule to obtain test voice information for training a voice recognition model comprises:

The intermediate voice information is framed by the Enframe function to obtain test voice information for training the voice recognition model.
A batch processing device for voice information, including:

An obtaining unit, configured to obtain a preset training set if an information processing instruction is received, the training set including a plurality of to-be-processed voice information;

The batch processing unit is used to sequentially call and run the sub-running scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-running scripts is executed, all the voice messages to be processed will be processed in batches. After running all the sub-running scripts, multiple target voice information is obtained, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement the processing of all voice information to be processed In batch processing, the quantity of the target voice information is less than or equal to the quantity of the voice information to be processed;

The noise removal unit is used to filter all target voice information through preset voice activation detection to obtain the intermediate voice information after noise removal;

The framing unit performs framing processing on all intermediate voice information through preset framing rules to obtain test voice information for training the voice recognition model.
9. The device of claim 9, wherein the preset Bash script includes a first running script for converting audio format and sampling rate, and the batch processing unit includes:

The first calling unit is configured to call the first running script in the preset Bash script according to the information processing instruction;

The first running unit is configured to run the first running script to perform audio format conversion and sampling rate conversion on all voice information to be processed, so as to obtain multiple target voice information with preset audio formats and preset sampling rates.
A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

If an information processing instruction is received, obtain a preset training set, where the training set includes multiple voice messages to be processed;

According to the information processing instruction, call and run the sub-run scripts in the preset Bash script in turn, so that when one of the sub-run scripts is run, all the voice messages to be processed will be processed in batches until all sub-runs are run. Script to obtain multiple target voice information, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement batch processing of all voice information to be processed. The number of voice messages is less than or equal to the number of voice messages to be processed;

Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal;

Perform framing processing on all intermediate voice information by preset framing rules to obtain test voice information for training the voice recognition model.
The computer device according to claim 11, wherein the preset Bash script includes a first running script for converting audio format and sampling rate, and the preset Bash script is called and executed in turn according to the information processing instruction. The sub-running script in the Bash script, when one of the sub-running scripts is run, all the voice messages to be processed are processed in batches until all sub-running scripts are run to obtain multiple target voice information, including :

Calling the first running script in the preset Bash script according to the information processing instruction;

The first running script is executed to perform audio format conversion and sampling rate conversion on all voice information to be processed, thereby obtaining multiple target voice information with preset audio formats and preset sampling rates.
The computer device according to claim 12, wherein the first running script is an FFmpeg script, and the running the first running script performs audio format conversion and sampling rate conversion on all voice information to be processed, thereby obtaining The steps of multiple target voice information with preset audio formats and preset sampling rates include:

Run the FFmpeg script and determine the preset audio format and preset sampling rate set in the FFmpeg script to convert all the voice information to be processed into target voice information with the preset audio format and preset sampling rate in batch .
11. The computer device according to claim 11, wherein the preset Bash script includes a first running script for performing audio format conversion and a second running script for performing effective audio screening, and the said information The processing instructions sequentially call and run the sub-running scripts in the preset Bash script to run one of the sub-running scripts, that is, to perform corresponding batch processing of all the voice information to be processed until all the sub-running scripts are run, thereby obtaining The steps for multiple target voice messages include:

Calling the first running script in the preset Bash script according to the information processing instruction;

Running the first running script to perform audio format conversion and sampling rate conversion on all voice information to be processed, so as to obtain a corresponding number of first voice information with a preset audio format and a preset sampling rate;

Call the second running script in the preset Bash script;

The second running script is executed to filter all the first voice information, so as to obtain a plurality of target voice information meeting preset specifications, and the number of the target voice information is less than or equal to the number of the first voice information.
The computer device according to claim 14, wherein the second running script is a SOX voice processing tool, and the running of the second running script is to filter all the first voice information, so as to obtain a plurality of conforming presets. The steps of setting the target voice information of the specification include:

Run the SOX voice processing tool to filter all the first voice information, so as to obtain multiple target voice information that meet the preset voice duration threshold.
The computer device according to claim 11, wherein the preset Bash script includes a first running script for performing audio format conversion, a second running script for performing effective audio screening, and a second running script for performing renaming. The third running script, which sequentially calls and runs the sub-run scripts in the preset Bash script according to the information processing instruction, so that when one of the sub-run scripts is run, all the voice messages to be processed are processed in batches and The steps until all sub-run scripts are run to obtain multiple target voice information include:

Calling the first running script in the preset Bash script according to the information processing instruction;

Running the first running script to perform audio format conversion and sampling rate conversion on all the voice information to be processed, so as to obtain a corresponding number of first voice information with the same audio format and sampling rate;

Call the second running script in the preset Bash script;

Running the second running script to filter all the first voice messages to obtain a plurality of second voice messages that meet the preset specifications, and the number of the second voice messages is less than or equal to the number of the first voice messages;

Call the third running script in the preset Bash script;

Run the third running script to rename all the second voice information, so as to obtain a corresponding number of target voice information with a preset name format.
The computer device according to claim 16, wherein the third execution script is a rename function, and the rename function is a function rename().
11. The computer device according to claim 11, wherein the step of performing framing processing on all intermediate voice information according to preset framing rules to obtain test voice information for training a voice recognition model comprises:

The intermediate voice information is framed by the Enframe function to obtain test voice information for training the voice recognition model.
A computer-readable storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:

If an information processing instruction is received, obtain a preset training set, where the training set includes multiple voice messages to be processed;

According to the information processing instruction, call and run the sub-run scripts in the preset Bash script in turn, so that when one of the sub-run scripts is run, all the voice messages to be processed will be processed in batches until all sub-runs are run. Script to obtain multiple target voice information, wherein the preset Bash script includes at least one preset sub-running script, and each sub-running script is used to implement batch processing of all voice information to be processed. The number of voice messages is less than or equal to the number of voice messages to be processed;

Perform filtering processing on all target voice information through preset voice activation detection to obtain intermediate voice information after noise removal;

Perform framing processing on all intermediate voice information by preset framing rules to obtain test voice information for training the voice recognition model.
The computer-readable storage medium according to claim 19, wherein the preset Bash script includes a first running script for performing audio format and sampling rate conversion, and the information processing instructions are sequentially called and run The sub-running scripts in the preset Bash script, when one of the sub-running scripts is run, all the voice messages to be processed will be processed in batches until all sub-running scripts are run to obtain multiple target voice messages. The steps include:

Calling the first running script in the preset Bash script according to the information processing instruction;

The first running script is executed to perform audio format conversion and sampling rate conversion on all voice information to be processed, thereby obtaining multiple target voice information with preset audio formats and preset sampling rates.