WO2023226726A1

WO2023226726A1 - Voice data processing method and apparatus

Info

Publication number: WO2023226726A1
Application number: PCT/CN2023/092438
Authority: WO
Inventors: 刘佛圣; 吴广杰
Original assignee: 京东方科技集团股份有限公司; 高创(苏州)电子有限公司
Priority date: 2022-05-25
Filing date: 2023-05-06
Publication date: 2023-11-30
Also published as: CN114999464A

Abstract

A voice data processing method and apparatus, belonging to the technical field of voice processing. The voice data processing method comprises: acquiring original audio data (101); receiving a first marking instruction for a first time progress of the original audio data (102); receiving a second marking instruction for a second time progress of the original audio data, the second time progress being later than the first time progress (103); and storing the first time progress, the second time progress and the original audio data (104). The method can achieve voice segmentation.

Description

Voice data processing method and device

Cross-references to related applications

This application claims priority to Chinese Patent Application No. 202210578264.X filed in China on May 25, 2022, the entire content of which is incorporated herein by reference.

Technical field

The present disclosure relates to the field of voice processing technology, and in particular, to a voice data processing method and device.

Background technique

At present, voice input is widely used, especially during meetings, voice-to-text can be used to quickly and easily record the entire meeting and generate text records in real time. However, currently speech-to-text is affected by multiple factors such as the environment and the speaker's pronunciation, and cannot achieve 100% accuracy. The generated text needs to be manually corrected after the meeting. In related technologies, it is necessary to listen to the recording from beginning to end for correction, which takes a long time, is inefficient, and has a poor user experience.

Contents of the invention

The technical problem to be solved by this disclosure is to provide a voice data processing method and device that can realize voice segmentation.

In order to solve the above technical problems, embodiments of the present disclosure provide the following technical solutions:

On the one hand, a voice data processing method is provided, including:

Get original audio data;

receiving a first marking instruction for a first time progression of the original audio data;

receiving a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;

The first time schedule, the second time schedule and the original audio data are stored.

In some embodiments, after storing the first time progress, the second time progress and the original audio data, the method further includes:

The original audio data between the first time progress and the second time progress is taken as The target audio data is marked.

Perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.

In some embodiments, after obtaining the original audio data, the method further includes:

Perform speech recognition on the original audio data to obtain first text data;

After receiving the first marking instruction for the first time progress of the original audio data, the method further includes:

Mark the first text data starting from a first starting position, where the first starting position corresponds to the first time progress;

After receiving the second marking instruction for the second time progression of the original audio data, the method further includes:

Stop marking the first text data from a first end position, which corresponds to the second time progress;

The first text data that has been marked is stored.

In some embodiments, after storing the marked first text data, the method further includes:

receiving processing instructions for a second position of the first text data, the second position being between the first starting position and the first ending position;

Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.

In some embodiments, performing speech recognition on the original audio data between the first time schedule and the second time schedule includes:

Play the original audio data in a loop between the first time progress and the second time progress;

Utilize a speech recognition engine to identify the time between the first time schedule and the second time schedule. The original audio data is subjected to speech recognition to obtain the speech recognition result.

In some embodiments, correcting the first text data according to the speech recognition result includes:

A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. outside the first end position;

Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, replace the first part with the speech recognition result to obtain the second part including the speech recognition result and the second speech recognition result. part of the second text data.

In some embodiments, performing speech recognition on the original audio data includes:

Convert the original audio format of the original audio data to the target audio format based on the FFMPEG tool;

Perform segmentation processing on the original audio data in the target audio format to obtain target voice data;

The target speech data is input to a speech recognition engine.

In some embodiments, the method further includes:

Receive voice search instructions input by the user;

Perform voice recognition on the voice search instructions, and convert the voice search instructions into search keywords;

The search keyword is found in the first text data, and the location of the search keyword is marked.

In some embodiments, the method further includes:

Receive typo recognition instructions input by the user;

Identify typos in the first text data according to the lexicon and context semantic recognition algorithm, and mark the typos.

In some embodiments, labeling the first text data includes:

using a first color to mark the first text data, the first color being different from black; and/or

Bold characters in the first text data.

An embodiment of the present disclosure also provides a voice data processing device, including:

Obtain module, used to obtain original audio data;

A first receiving module configured to receive a first marking instruction for the first time progress of the original audio data;

a second receiving module configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;

A storage module configured to store the first time progress, the second time progress and the original audio data.

In some embodiments, the device further includes:

A marking processing module, configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.

In some embodiments, the device further includes:

A speech recognition module, configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.

In some embodiments, the device further includes:

A speech recognition module, used to perform speech recognition on the original audio data to obtain the first text data;

A marking processing module, configured to start marking the first text data from a first starting position, which corresponds to the first time progress; and stop marking the first text from a first ending position. The data is marked, and the first end position corresponds to the second time progress;

The storage module is used to store the first text data after mark processing.

In some embodiments, the device further includes:

A third receiving module configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position;

The second processing module is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the second text data.

In some embodiments, the speech recognition module is specifically configured to loop the original audio data between the first time schedule and the second time schedule; use a speech recognition engine to play the first time schedule Perform speech recognition on the original audio data between the second time schedule and the second time schedule to obtain a speech recognition result.

In some embodiments, the second processing module includes:

Interception submodule, used to intercept the first part and the second part of the first text data, the first part is located between the first starting position and the first ending position, and the second part is located at the outside the first starting position and the first ending position;

A comparison submodule is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the speech including the speech recognition result. The recognition result and the second part of the second text data.

In some embodiments, the speech recognition module includes:

A conversion submodule for converting the original audio format of the original audio data into a target audio format based on the FFMPEG tool;

A segmentation sub-module, used to segment the original audio data in the target audio format to obtain target voice data;

A processing submodule is used to input the target speech data to a speech recognition engine.

In some embodiments, the device further includes:

A voice search module, configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.

In some embodiments, the device further includes:

A typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.

An embodiment of the present disclosure also provides a voice data processing device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor Steps to implement the voice data processing method as described above.

Embodiments of the present disclosure also provide a readable storage medium on which programs or instructions are stored. When the programs or instructions are executed by a processor, the steps of the voice data processing method as described above are implemented.

Embodiments of the present disclosure have the following beneficial effects:

In the above solution, after obtaining the original audio data, the marking instructions for different time progresses of the original audio data are received, and the original audio data is stored according to the marking instructions, so that the speech can be segmented according to the different time progresses recorded.

Description of the drawings

Figure 1 is a schematic flowchart of a voice data processing method according to an embodiment of the present disclosure;

Figure 2 is a schematic diagram of the composition of an electronic device according to an embodiment of the present disclosure;

Figure 3 is a structural block diagram of a voice data processing device according to an embodiment of the present disclosure;

Figure 4 is a structural block diagram of the second processing module according to the embodiment of the present disclosure;

Figure 5 is a structural block diagram of a speech recognition module according to an embodiment of the present disclosure;

Figure 6 is a schematic diagram of the composition of a voice data processing device according to an embodiment of the present disclosure.

Detailed ways

In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present disclosure clearer, a detailed description will be given below with reference to the accompanying drawings and specific embodiments.

Embodiments of the present disclosure provide a voice data processing method and device, which can realize voice segmentation.

Embodiments of the present disclosure provide a voice data processing method, as shown in Figure 1, including:

Step 101: Obtain original audio data;

The technical solution of this embodiment is applied to electronic equipment. The electronic equipment can perform human-computer interaction with the user. As shown in Figure 2, the electronic equipment includes a voice recording system, a voice-to-text system, a timer system, and a text style system. Voice data operating system, voice playback system, etc. Electronic devices can interact with backend servers through the network.

Optionally, in this embodiment, the above-mentioned electronic device may be configured with a target client and/or a target client. The terminal equipment of the standard server. The above-mentioned terminal equipment can be a microphone or a microphone array, or a terminal equipment equipped with a microphone. The above-mentioned electronic equipment can include but is not limited to at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), Laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs, etc. The target client can be a video client, an instant messaging client, a browser client, an education client, etc. The target server can be a video server, an instant messaging server, a browser server, an education server, etc. The above-mentioned networks may include but are not limited to: wired networks and wireless networks. The wired networks include local area networks, metropolitan area networks and wide area networks. The wireless networks include Bluetooth, WIFI and other networks that implement wireless communication. The above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is only an example, and there is no limitation on this in this embodiment.

In this embodiment, original audio data can be obtained through microphone or microphone array recording. The original audio data can be data files in various audio formats obtained by the recording terminal, including but not limited to: ACT, REC, MP3, WAV, WMA, VY1, VY2, DVF, MSC, AIFF and other formats; the original audio data can also be It is Pulse Code Modulation (PCM) audio stream data.

The electronic device can display a recording button on the operation interface. The user clicks the recording button to start recording. The voice recording system starts to work, and continuously collects audio data in a sub-thread through AudioRecord and AudioChunk, and passes the collected audio data to speech-to-text. system so that speech-to-text systems convert audio data into text. Among them, the AudioRecord is an android media recording tool; the AudioChunk is a custom data box, which contains a byte array and provides the function of converting the byte array into a short array; the byte array is used to receive the audio data returned by AudioRecord.

Step 102: Receive a first marking instruction for the first time progress of the original audio data;

Step 103: Receive a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;

In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress.

During the recording process, the timer system can record various time points, including the total duration of recording, the starting time point for marking processing (i.e., the first time progress) and the end time point (i.e., the second time progress), where the The first time progress and the second time progress appear in pairs. The number of the first time progress can be one or more. The original audio data between each pair of the first time progress and the second time progress is the original audio that needs to be focused on. data. Among them, the first time progress can be the starting time point of the entire original audio data, or a certain time point in the middle of the original audio data; the second time progress can be the end time point of the entire original audio data, or it can be the original Some time point in the middle of the audio data.

Step 104: Store the first time progress, the second time progress and the original audio data.

A segment of speech can be determined through the first time progress and the second time progress, and the speech can be segmented according to the first time progress and the second time progress, where the first time progress is the starting time point of the segmented speech, The second time progress is the end time point of the segmented speech. When the original audio data includes multiple sets of first time progress and second time progress, the original audio data may be divided into multiple speech segments.

When the recording starts, the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.

The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.

The original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on. In order to facilitate the user to quickly determine the original audio data between the first time progress and the second time progress, the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked. In a specific example, the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress. and second time progress information. For example, if the first time progress is 38 seconds and the second time progress is 58 seconds, the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data. The user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds. The two time points of 58 seconds determine the original audio data that needs to be focused on. Or, directly intercept the target audio data between the first time progress and the second time progress, and store it in a specific area.

Since the original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on, speech recognition can only be performed on the original audio data between the first time progress and the second time progress. , obtain segmented text data, which can reduce the workload of speech recognition and ensure that users get the key content that needs attention.

Of course, in this embodiment, speech recognition can also be performed on all original audio data. In some embodiments, after obtaining the original audio data, the method further includes:

In this embodiment, the speech-to-text system is used to convert original audio data into text. In actual application, the original audio data can be converted into third-order audio data through the speech recognition engine in automatic speech recognition technology (Automatic Speech Recognition, ASR). A text data, ASR is a technology that converts human speech into text. Its goal is to enable the computer to "dictate" the continuous speech spoken by different people. It is also called a "voice dictation machine" and is the realization of " Technology for converting "sound" to "text". In this embodiment, the speech recognition engine can be Google speech recognition engine, Microsoft speech recognition engine or iFlytek's speech recognition engine, which is not limited here. The speech fragments in the original audio data can be converted into text through the speech recognition engine. information.

Specifically, the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format. The voice data is input to the voice recognition engine to obtain the first text data. For example, the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only keep Audio clip of human voice. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.

In some embodiments, the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model. The streaming speech recognition system supports recording and converting at the same time, that is, converting audio data into text data while recording. Supports direct identification of existing audio data.

After receiving the first marking instruction for the first time progress of the original audio data, recording the first time progress of the original audio data at this time, and simultaneously marking the first text data starting from the first starting position. Perform marking processing, and the first starting position corresponds to the first time progress;

After receiving the second marking instruction for the second time progress of the original audio data, record the second time progress of the original audio data at this time, and at the same time stop marking the first text data from the first end position. , the first end position corresponds to the second time progress;

In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress. The text style system is used to mark the content of the first text data during recording and playback, including: marking the first text data with a first color, the first color being different from black; and/or, marking the first text data The text in the first text data is bolded, so that the user can easily identify the content that needs attention from the first text data.

During the recording process, the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress The original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected. At the same time, the first time progress corresponds to the first starting position one by one. After the original audio data corresponding to the first time progress is converted into text, the position in the first text data is the first starting position; the second time progress corresponds to the first starting position. The end positions have a one-to-one correspondence. After the original audio data corresponding to the second time progress is converted into text, the position in the first text data is the first end position.

Specifically, when the electronic device obtains the original audio data through microphone or microphone array recording, the first text data corresponding to the original audio data can be displayed to the user in real time through the operation interface. When the user clicks or selects the first text for the 2k-1th time, When the content of the data is entered, the position of the content in the first text data is recorded as the first starting position. When the content of the first text data is clicked or selected for the 2kth time, the position of the content in the first text data is recorded as the first starting position. The end position is used to mark the first text data between the first start position and the first end position, and k is a positive integer. For example, when the user clicks or selects "then" in the first text data for the third time, the position of "then" is recorded as the first starting position. When the user clicks or selects "WordPad" in the first text data for the fourth time, ", record the position of "writing board" as the first end position, and mark the text "then, existing writing board" between "then" and "writing board", such as bolding and/or coloring. When the user clicks or selects "very good" in the first text data for the 7th time, the position of "very good" is recorded as the first starting position, and when the user clicks or selects "planning" in the first text data for the 8th time When, the position of "planning" is recorded as the first end position, and the text "very good plan" between "very good" and "planning" is marked, such as bolding and/or coloring.

In this embodiment, the text style system is used to mark the first text data, such as coloring and bolding, recording the position of the marked content in the voice content, synchronizing the time points of the marked content, etc. The voice content is the entire voice string text returned by the speech-to-text system; the marked content is the marked string text returned by the speech-to-text system during the time period from the start mark to the end mark. The start bit and end bit of the marked content are the positions of the marked content in the voice content, and usually the character string is determined by the corner mark.

Afterwards, the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data are stored.

In this embodiment, the voice data operating system is used to store the first text data, the first starting position, the first ending position, the first time progress, and the marked processing. The second time schedule and the original audio data. The voice data operating system contains a database. The voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc. The audio file index is the storage path of the audio file; the mark data is the mark content, mark position and mark time point of each mark, and the position of the text in the speech is the time progress in the speech corresponding to the text.

In this embodiment, when converting the original audio data into the first text data, the erroneous text or key content may be marked, the marked first text data may be stored, and the time progress corresponding to the original audio data may be stored. , so that when the first text data is proofread later, by selecting the marked text, it can be quickly synchronized to the corresponding original audio data according to the corresponding time progress, which facilitates the user to correct or make corrections to the marked text. Other processing can prevent users from listening to the original audio data from beginning to end, improve the efficiency of correcting the recorded text, and improve the user experience.

In some embodiments, the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio are stored. After the data is obtained, when the first text data needs to be corrected, the method further includes:

In this embodiment, when the first text data needs to be corrected, the voice playback system is used to play the recorded original audio data, and at the same time, the first text data corresponding to the original audio data is displayed to the user in real time on the operation interface. The first text The data includes a first part and a second part, where the first part is located between the first starting position and the first ending position of each pair, and has been marked for content that needs to be focused on; the second part is located at the Outside a starting position and the first ending position, there is content that has not been marked. Among them, the first part is the part where speech conversion text errors may occur, and the second part is the part where errors are unlikely to occur. Therefore, when correcting the first text data, in order to improve efficiency, only the first part needs to be corrected, that is, Can.

The user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received. The first starting position and the first ending position need to be processed. The first text data between positions is corrected. When the user clicks or selects a At any position between the first starting position and the first ending position, it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.

The corresponding position of the original audio data can be quickly located according to the pre-stored first time progress corresponding to the first starting position and the second time progress corresponding to the first end position, and the first time progress and the first time progress can be replayed. The original audio data between the second time progress, specifically, before receiving the processing instruction for the next second position, all the original audio data between the first time progress and the second time progress can be processed. The original audio data is played in a loop, that is, starting from the first time progress to play the original audio data, stopping playing the original audio data to the second time progress, and then returning to the first time progress to restart playing the original audio data. ; Of course, you can also stop playing after playing a preset number of times, for example, stop playing the original audio data after playing it once or twice.

Use a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result, such as the original audio data between 38 seconds and 1 minute and 12 seconds. Perform speech recognition and get the speech recognition result "So, why use notepad when you have a writing pad?";

A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. In addition to the first ending position, for example, the first part is "So, why do you need a notepad when you have a writing pad?";

Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the first part and the second part. the second text data.

When the speech recognition result obtained by passing the original audio data through the speech recognition engine is significantly different from the first part, the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition. . When the first part is inconsistent with the speech recognition result, the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result "So, both "Why do you need a notepad if you have a writing pad?" Replace the first part "So, why do you need a notepad when you have a writing pad?"

In some embodiments, after obtaining the second text data, the method further includes:

The second text data, the first time progress, the second time progress and the original audio data are stored.

After correcting the first text data, the voice data operating system saves the corrected second text data. In this embodiment, since only the original audio data between the first time schedule and the second time schedule needs to be replayed when correcting the first text data, the efficiency of correcting the voice file can be greatly improved.

In this embodiment, the above solution can be used to correct the first text data. In addition, the original audio data between the first time progress and the second time progress can also be played to the user. After the user listens to the original audio data, Manually correct the first text data to obtain the second text data.

In addition, in this embodiment, after the voice recording is completed, voice search can also be performed. The method further includes:

Receive voice search instructions input by the user;

Specifically, a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data. The voice-to-text system is used for processing, and the voice search instructions are converted into search keywords. Among them, when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction. After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword. When the user clicks on the first text data at the highlighted position, the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position. Through the technical solution of this embodiment, It is convenient for users to find the content they need from text data and audio data.

In addition, in this embodiment, after the voice recording is completed, typos can also be recognized. The method further includes:

Receive typo recognition instructions input by the user;

Specifically, a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data.

An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 3, including:

Acquisition module 21, used to obtain original audio data;

Optionally, in this embodiment, the above-mentioned electronic device may be a terminal device configured with a target client and/or a target server. The above-mentioned terminal device may be a microphone or a microphone array, or may be a terminal device configured with a microphone. The above-mentioned Electronic devices may include but are not limited to at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs wait. The target client can be a video client, an instant messaging client, a browser client, an education client, etc. The target server can be a video server, an instant messaging server, a browser server, an education server, etc. The above-mentioned networks may include but are not limited to: wired networks and wireless networks. The wired networks include local area networks, metropolitan area networks and wide area networks. The wireless networks include Bluetooth, WIFI and other networks that implement wireless communication. The above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is just an example. This implementation There is no restriction on this in the example.

The first receiving module 22 is configured to receive a first marking instruction for the first time progress of the original audio data;

The second receiving module 23 is configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;

Storage module 24, used to store the first time progress, the second time progress and the original time progress. Start audio data.

In some embodiments, the device further includes:

The marking processing module 26 is configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.

The original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on. In order to facilitate the user to quickly determine the original audio data between the first time progress and the second time progress, the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked. In a specific example, the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress and the second time progress. Time progress information. For example, if the first time progress is 38 seconds and the second time progress is 58 seconds, the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data. The user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds. The two time points of 58 seconds determine the original audio data that needs to be focused on.

In some embodiments, the device further includes:

The speech recognition module 25 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.

Since the original audio data between the first time progress and the second time progress needs to be focused on of the original audio data. Therefore, speech recognition can only be performed on the original audio data between the first time progress and the second time progress to obtain segmented text data. This can reduce the workload of speech recognition and ensure that the user Get the key content that needs attention.

Of course, in this embodiment, speech recognition can also be performed on all original audio data. In some embodiments, the speech recognition module 25 is used to perform speech recognition on the original audio data to obtain the first text data; the marking processing module 26 is used to mark the first text data starting from the first starting position. processing, the first starting position corresponds to the first time progress; the marking process of the first text data is stopped from the first end position, and the first end position corresponds to the second time progress; so The storage module 24 is used to store the first text data that has been marked.

Specifically, the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format. The voice data is input to the voice recognition engine to obtain the first text data. For example, the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only audio clips containing human voices will be retained. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.

In some embodiments, the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model. The streaming speech recognition system supports recording and translating at the same time, that is, recording while Convert audio data to text data and also support direct recognition of existing audio data.

During the recording process, the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress The original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected. At the same time, the first time progress corresponds to the first starting position one by one. After the original audio data corresponding to the first time progress is converted into text, the position in the first text data is the first starting position; the second time progress corresponds to the first starting position. There is a one-to-one correspondence between the end positions. After the original audio data corresponding to the second time progress is converted into text, the position in the first text data is the first end position.

Specifically, the electronic device can obtain the original audio data by recording through a microphone or microphone array. When the user clicks or selects the content of the first text data for the 2k-1st time, the first text data corresponding to the original audio data is displayed to the user in real time through the operation interface, and the position of the content in the first text data is recorded. is the first starting position. When the content of the first text data is clicked or selected for the 2kth time, the position of the content in the first text data is recorded as the first ending position. The distance between the first starting position and the first ending position is The first text data is marked, and k is a positive integer. For example, when the user clicks or selects "then" in the first text data for the third time, the position of "then" is recorded as the first starting position. When the user clicks or selects "WordPad" in the first text data for the fourth time, ", record the position of "WordPad" as the first end position, and mark the text "So, why do you need Notepad when there is a WordPad" between "Then" and "WordPad", such as bold and /or dyeing. When the user clicks or selects "Very Good" in the first text data for the 7th time, the position of "Very Good" is recorded as the first starting position, and when the user clicks or selects "Plan" in the first text data for the 8th time When, the position of "planning" is recorded as the first end position, and the text "very good plan" between "very good" and "planning" is marked, such as bolding and/or coloring.

The storage module 24 is specifically configured to store the marked first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data.

In this embodiment, the voice data operating system is used to store the marked first text data, the first starting position, the first ending position, the first time progress, and the second time. progress and the raw audio data. The voice data operating system contains a database. The voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc. The audio file index is the storage path of the audio file; the mark data is the mark content of each mark, the mark bit and Mark the time point, and the position of the text in the speech is the time progress in the speech corresponding to the text.

In some embodiments, as shown in Figure 3, the device further includes:

The third receiving module 27 is configured to receive processing instructions for the second position of the first text data, the second position being located between the first starting position and the first ending position;

The user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received. The first starting position and the first ending position need to be processed. The first text data between positions is corrected. When the user clicks or selects any position between the first starting position and the first ending position, it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.

The second processing module 28 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the first text data. 2. Text data.

can be based on the pre-stored first time progress corresponding to the first starting position and the first ending The second time progress corresponding to the position, quickly locate the corresponding position of the original audio data, and replay the original audio data between the first time progress and the second time progress. Specifically, if no response is received for Before the processing instruction at the next second position, the original audio data between the first time progress and the second time progress can be played in a loop, that is, the original audio data is played starting from the first time progress. , stop playing the original audio data at the second time schedule, and then return to the first time schedule to restart playing the original audio data; of course, you can also stop playing after playing a preset number of times, such as playing once or twice. That is, stop playing the original audio data.

When the speech recognition result obtained by passing the original audio data through the speech recognition engine is significantly different from the first part, the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition. . When the first part is inconsistent with the speech recognition result, the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result "So, why do you need a notepad when you have a writing pad?" to replace the first part of "So, why do you need a notepad when you have a writing pad?"

In some embodiments, the storage module 24 is also used to store the second text data, the first time progress, the second time progress and the original audio data.

In some embodiments, as shown in Figure 4, the second processing module 28 includes:

Interception sub-module 281 is used to intercept the first part and the second part of the first text data. The first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. Outside the first starting position and the first ending position;

Comparison sub-module 282 is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the result including the The speech recognition result and the second part of the second text data.

In some embodiments, as shown in Figure 5, the speech recognition module 25 includes:

Conversion sub-module 251, used to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool;

The segmentation module 252 is used to segment the original audio data in the target audio format to obtain target voice data;

The processing sub-module 253 is used to input the target speech data to the speech recognition engine.

In some embodiments, the device further includes:

Specifically, a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data. The voice-to-text system is used for processing, and the voice search instructions are converted into search keywords. Among them, when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction. After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword. When the user clicks on the highlighted location When the first text data is at, the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position. Through the technical solution of this embodiment, it is convenient for users to find required content from text data and audio data.

In some embodiments, the device further includes:

Specifically, a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data. In some embodiments, the first receiving module 23 is specifically configured to use a first color to mark the first text data, and the first color is different from black; and/or, to mark the text in the first text data. Make it bold.

An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 6, including a processor 32 and a memory 31. The memory 31 stores programs or instructions that can be run on the processor 32. When the program or instruction is executed by the processor 32, the steps of the voice data processing method as described above are implemented.

In some embodiments, the processor 32 is configured to obtain original audio data; receive a first marking instruction for a first time progression of the original audio data; and receive a second marking instruction for a second time progression of the original audio data. Mark instructions, the second time schedule is later than the first time schedule; store the first time schedule, the second time schedule and the original audio data.

In some embodiments, the processor 32 is configured to use the original audio data between the first time schedule and the second time schedule as target audio data, and perform marking processing on the target audio data.

In some embodiments, the processor 32 is configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.

In some embodiments, the processor 32 is used to perform speech recognition on the original audio data, Obtain the first text data; start marking the first text data from a first starting position, the first starting position corresponds to the first time progress; stop processing the first text from a first end position The data is marked, and the first end position corresponds to the second time progress; the first text data after marking is stored.

In some embodiments, the processor 32 is configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position; Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.

In some embodiments, the processor 32 is configured to loop the original audio data between the first time progress and the second time progress; use a speech recognition engine to play the first time progress and the second time progress in a loop; The original audio data between the second time progress is subjected to speech recognition to obtain a speech recognition result.

In some embodiments, the processor 32 is configured to intercept the first part and the second part of the first text data, and the first part is located between the first starting position and the first ending position, so The second part is located outside the first starting position and the first ending position; compare the first part with the speech recognition result, and if the first part is inconsistent with the speech recognition result, use the The speech recognition result replaces the first part, and second text data including the speech recognition result and the second part is obtained.

In some embodiments, the processor 32 is configured to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool; perform segmentation processing on the original audio data in the target audio format to obtain Target speech data; input the target speech data to the speech recognition engine.

In some embodiments, the processor 32 is configured to receive a voice search instruction input by the user; perform voice recognition on the voice search instruction, and convert the voice search instruction into a search keyword; in the first text data Find the search keyword, and mark the location of the search keyword.

In some embodiments, the processor 32 is configured to receive typo identification instructions input by the user; Identify typos in the first text data based on the lexicon and context semantic recognition algorithm, and mark the typos.

In some embodiments, the processor 32 is configured to use a first color to mark the first text data, the first color being different from black; and/or to bold the text in the first text data. .

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage terminal devices or any other non-transmission medium may be used to store information that can be accessed by a computing terminal device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art can clearly understand the above The embodiment method can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in various embodiments of this application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

A voice data processing method, characterized by including:

Get original audio data;

receiving a first marking instruction for a first time progression of the original audio data;

receiving a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;

The first time schedule, the second time schedule and the original audio data are stored.
The voice data processing method according to claim 1, characterized in that after storing the first time progress, the second time progress and the original audio data, the method further includes:

The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
The voice data processing method according to claim 1, characterized in that after storing the first time progress, the second time progress and the original audio data, the method further includes:

Perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
The voice data processing method according to claim 1, characterized in that after obtaining the original audio data, the method further includes:

Perform speech recognition on the original audio data to obtain first text data;

After receiving the first marking instruction for the first time progress of the original audio data, the method further includes:

Mark the first text data starting from a first starting position, where the first starting position corresponds to the first time progress;

After receiving the second marking instruction for the second time progression of the original audio data, the method further includes:

Stop marking the first text data from a first end position, which corresponds to the second time progress;

The first text data that has been marked is stored.
The voice data processing method according to claim 4, characterized in that after storing the marked first text data, the method further includes:

receiving processing instructions for a second position of the first text data, the second position being between the first starting position and the first ending position;

Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
The voice data processing method according to claim 3 or 5, wherein performing voice recognition on the original audio data between the first time schedule and the second time schedule includes:

Play the original audio data in a loop between the first time progress and the second time progress;

A speech recognition engine is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result.
The voice data processing method according to claim 5, wherein correcting the first text data according to the voice recognition result includes:

A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. outside the first end position;

Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, replace the first part with the speech recognition result to obtain the second part including the speech recognition result and the second speech recognition result. part of the second text data.
The voice data processing method according to claim 3 or 4, characterized in that, performing voice recognition on the original audio data includes:

Convert the original audio format of the original audio data to the target audio format based on the FFMPEG tool;

Perform segmentation processing on the original audio data in the target audio format to obtain target voice data;

The target speech data is input to a speech recognition engine.
The voice data processing method according to claim 4, characterized in that the method further includes:

Receive voice search instructions input by the user;

Perform voice recognition on the voice search instructions, and convert the voice search instructions into search keywords;

The search keyword is found in the first text data, and the location of the search keyword is marked.
The voice data processing method according to claim 4, characterized in that the method further includes:

Receive typo recognition instructions input by the user;

Identify typos in the first text data according to the lexicon and context semantic recognition algorithm, and mark the typos.
The voice data processing method according to claim 4, 9 or 10, wherein labeling the first text data includes:

using a first color to mark the first text data, the first color being different from black; and/or

Bold characters in the first text data.
A voice data processing device, characterized by including:

Obtain module, used to obtain original audio data;

A first receiving module configured to receive a first marking instruction for the first time progress of the original audio data;

a second receiving module configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;

A storage module configured to store the first time progress, the second time progress and the original audio data.
The voice data processing device according to claim 12, characterized in that the device further includes:

a mark processing module, configured to mark all the points between the first time progress and the second time progress The original audio data is used as target audio data, and the target audio data is marked.
The voice data processing device according to claim 12, characterized in that the device further includes:

A speech recognition module, configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
The voice data processing device according to claim 12, characterized in that the device further includes:

A speech recognition module, used to perform speech recognition on the original audio data to obtain the first text data;

A marking processing module, configured to start marking the first text data from a first starting position, which corresponds to the first time progress; and stop marking the first text from a first ending position. The data is marked, and the first end position corresponds to the second time progress;

The storage module is used to store the first text data after mark processing.
The voice data processing device according to claim 15, characterized in that the device further includes:

A third receiving module configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position;

The second processing module is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the second text data.
The voice data processing device according to claim 14 or 16, characterized in that the voice recognition module is specifically configured to cycle the original audio data between the first time progress and the second time progress. Play; use a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result.
The voice data processing device according to claim 16, wherein the second processing module includes:

The interception sub-module is used to intercept the first part and the second part of the first text data, so The first portion is located between the first starting position and the first ending position, and the second portion is located outside the first starting position and the first ending position;

A comparison submodule is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the speech including the speech recognition result. The recognition result and the second part of the second text data.
The voice data processing device according to claim 14 or 15, characterized in that the voice recognition module includes:

A conversion submodule for converting the original audio format of the original audio data into a target audio format based on the FFMPEG tool;

A segmentation sub-module, used to segment the original audio data in the target audio format to obtain target voice data;

A processing submodule is used to input the target speech data to a speech recognition engine.
The voice data processing device according to claim 15, characterized in that the device further includes:

A voice search module, configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
The voice data processing device according to claim 15, characterized in that the device further includes:

A typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
A voice data processing device, characterized in that it includes a processor and a memory, the memory stores programs or instructions that can be run on the processor, and when the programs or instructions are executed by the processor, the claims are implemented The steps of the voice data processing method described in any one of 1 to 11.
A readable storage medium, characterized in that the readable storage medium stores programs or instructions, and when the programs or instructions are executed by a processor, the voice data processing method according to any one of claims 1-11 is implemented. A step of.