WO2023226726A1 - Voice data processing method and apparatus - Google Patents

Voice data processing method and apparatus Download PDF

Info

Publication number
WO2023226726A1
WO2023226726A1 PCT/CN2023/092438 CN2023092438W WO2023226726A1 WO 2023226726 A1 WO2023226726 A1 WO 2023226726A1 CN 2023092438 W CN2023092438 W CN 2023092438W WO 2023226726 A1 WO2023226726 A1 WO 2023226726A1
Authority
WO
WIPO (PCT)
Prior art keywords
original audio
audio data
data
voice
speech recognition
Prior art date
Application number
PCT/CN2023/092438
Other languages
French (fr)
Chinese (zh)
Inventor
刘佛圣
吴广杰
Original Assignee
京东方科技集团股份有限公司
高创(苏州)电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 高创(苏州)电子有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2023226726A1 publication Critical patent/WO2023226726A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to the field of voice processing technology, and in particular, to a voice data processing method and device.
  • voice-to-text can be used to quickly and easily record the entire meeting and generate text records in real time.
  • speech-to-text is affected by multiple factors such as the environment and the speaker's pronunciation, and cannot achieve 100% accuracy.
  • the generated text needs to be manually corrected after the meeting.
  • it is necessary to listen to the recording from beginning to end for correction which takes a long time, is inefficient, and has a poor user experience.
  • the technical problem to be solved by this disclosure is to provide a voice data processing method and device that can realize voice segmentation.
  • a voice data processing method including:
  • the first time schedule, the second time schedule and the original audio data are stored.
  • the method further includes:
  • the original audio data between the first time progress and the second time progress is taken as The target audio data is marked.
  • the method further includes:
  • the method further includes:
  • the method further includes:
  • the method further includes:
  • the first text data that has been marked is stored.
  • the method further includes:
  • Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
  • performing speech recognition on the original audio data between the first time schedule and the second time schedule includes:
  • a speech recognition engine to identify the time between the first time schedule and the second time schedule.
  • the original audio data is subjected to speech recognition to obtain the speech recognition result.
  • correcting the first text data according to the speech recognition result includes:
  • a first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. outside the first end position;
  • performing speech recognition on the original audio data includes:
  • the target speech data is input to a speech recognition engine.
  • the method further includes:
  • the search keyword is found in the first text data, and the location of the search keyword is marked.
  • the method further includes:
  • labeling the first text data includes:
  • An embodiment of the present disclosure also provides a voice data processing device, including:
  • a first receiving module configured to receive a first marking instruction for the first time progress of the original audio data
  • a second receiving module configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;
  • a storage module configured to store the first time progress, the second time progress and the original audio data.
  • the device further includes:
  • a marking processing module configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.
  • the device further includes:
  • a speech recognition module configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
  • the device further includes:
  • a speech recognition module used to perform speech recognition on the original audio data to obtain the first text data
  • a marking processing module configured to start marking the first text data from a first starting position, which corresponds to the first time progress; and stop marking the first text from a first ending position. The data is marked, and the first end position corresponds to the second time progress;
  • the storage module is used to store the first text data after mark processing.
  • the device further includes:
  • a third receiving module configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position;
  • the second processing module is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the second text data.
  • the speech recognition module is specifically configured to loop the original audio data between the first time schedule and the second time schedule; use a speech recognition engine to play the first time schedule Perform speech recognition on the original audio data between the second time schedule and the second time schedule to obtain a speech recognition result.
  • the second processing module includes:
  • Interception submodule used to intercept the first part and the second part of the first text data, the first part is located between the first starting position and the first ending position, and the second part is located at the outside the first starting position and the first ending position;
  • a comparison submodule is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the speech including the speech recognition result.
  • the recognition result and the second part of the second text data are used to compare the first part with the speech recognition result.
  • the speech recognition module includes:
  • a segmentation sub-module used to segment the original audio data in the target audio format to obtain target voice data
  • a processing submodule is used to input the target speech data to a speech recognition engine.
  • the device further includes:
  • a voice search module configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
  • the device further includes:
  • a typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
  • An embodiment of the present disclosure also provides a voice data processing device, including a processor and a memory.
  • the memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor Steps to implement the voice data processing method as described above.
  • Embodiments of the present disclosure also provide a readable storage medium on which programs or instructions are stored. When the programs or instructions are executed by a processor, the steps of the voice data processing method as described above are implemented.
  • the marking instructions for different time progresses of the original audio data are received, and the original audio data is stored according to the marking instructions, so that the speech can be segmented according to the different time progresses recorded.
  • Figure 1 is a schematic flowchart of a voice data processing method according to an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of the composition of an electronic device according to an embodiment of the present disclosure
  • Figure 3 is a structural block diagram of a voice data processing device according to an embodiment of the present disclosure.
  • FIG. 4 is a structural block diagram of the second processing module according to the embodiment of the present disclosure.
  • Figure 5 is a structural block diagram of a speech recognition module according to an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of the composition of a voice data processing device according to an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a voice data processing method and device, which can realize voice segmentation.
  • Embodiments of the present disclosure provide a voice data processing method, as shown in Figure 1, including:
  • Step 101 Obtain original audio data
  • the technical solution of this embodiment is applied to electronic equipment.
  • the electronic equipment can perform human-computer interaction with the user.
  • the electronic equipment includes a voice recording system, a voice-to-text system, a timer system, and a text style system. Voice data operating system, voice playback system, etc.
  • Electronic devices can interact with backend servers through the network.
  • the above-mentioned electronic device may be configured with a target client and/or a target client.
  • the above-mentioned terminal equipment can be a microphone or a microphone array, or a terminal equipment equipped with a microphone.
  • the above-mentioned electronic equipment can include but is not limited to at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), Laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs, etc.
  • the target client can be a video client, an instant messaging client, a browser client, an education client, etc.
  • the target server can be a video server, an instant messaging server, a browser server, an education server, etc.
  • the above-mentioned networks may include but are not limited to: wired networks and wireless networks.
  • the wired networks include local area networks, metropolitan area networks and wide area networks.
  • the wireless networks include Bluetooth, WIFI and other networks that implement wireless communication.
  • the above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is only an example, and there is no limitation on this in this embodiment.
  • original audio data can be obtained through microphone or microphone array recording.
  • the original audio data can be data files in various audio formats obtained by the recording terminal, including but not limited to: ACT, REC, MP3, WAV, WMA, VY1, VY2, DVF, MSC, AIFF and other formats; the original audio data can also be It is Pulse Code Modulation (PCM) audio stream data.
  • PCM Pulse Code Modulation
  • the electronic device can display a recording button on the operation interface.
  • the user clicks the recording button to start recording.
  • the voice recording system starts to work, and continuously collects audio data in a sub-thread through AudioRecord and AudioChunk, and passes the collected audio data to speech-to-text. system so that speech-to-text systems convert audio data into text.
  • the AudioRecord is an android media recording tool
  • the AudioChunk is a custom data box, which contains a byte array and provides the function of converting the byte array into a short array
  • the byte array is used to receive the audio data returned by AudioRecord.
  • Step 102 Receive a first marking instruction for the first time progress of the original audio data
  • Step 103 Receive a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;
  • the timer system is used to record the time progress during audio data recording and playback.
  • the recording time progress of the original audio data corresponds to the playback time progress.
  • the timer system can record various time points, including the total duration of recording, the starting time point for marking processing (i.e., the first time progress) and the end time point (i.e., the second time progress), where the The first time progress and the second time progress appear in pairs.
  • the number of the first time progress can be one or more.
  • the original audio data between each pair of the first time progress and the second time progress is the original audio that needs to be focused on. data.
  • the first time progress can be the starting time point of the entire original audio data, or a certain time point in the middle of the original audio data;
  • the second time progress can be the end time point of the entire original audio data, or it can be the original Some time point in the middle of the audio data.
  • Step 104 Store the first time progress, the second time progress and the original audio data.
  • a segment of speech can be determined through the first time progress and the second time progress, and the speech can be segmented according to the first time progress and the second time progress, where the first time progress is the starting time point of the segmented speech, The second time progress is the end time point of the segmented speech.
  • the original audio data may be divided into multiple speech segments.
  • the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
  • the method further includes:
  • the original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
  • the original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on.
  • the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
  • the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress. and second time progress information.
  • the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data.
  • the user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds.
  • the two time points of 58 seconds determine the original audio data that needs to be focused on. Or, directly intercept the target audio data between the first time progress and the second time progress, and store it in a specific area.
  • the original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on
  • speech recognition can only be performed on the original audio data between the first time progress and the second time progress.
  • obtain segmented text data which can reduce the workload of speech recognition and ensure that users get the key content that needs attention.
  • speech recognition can also be performed on all original audio data.
  • the method further includes:
  • the speech-to-text system is used to convert original audio data into text.
  • the original audio data can be converted into third-order audio data through the speech recognition engine in automatic speech recognition technology (Automatic Speech Recognition, ASR).
  • a text data ASR is a technology that converts human speech into text. Its goal is to enable the computer to "dictate” the continuous speech spoken by different people. It is also called a “voice dictation machine” and is the realization of " Technology for converting "sound" to "text”.
  • the speech recognition engine can be Google speech recognition engine, Microsoft speech recognition engine or iFlytek's speech recognition engine, which is not limited here.
  • the speech fragments in the original audio data can be converted into text through the speech recognition engine. information.
  • the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format.
  • the voice data is input to the voice recognition engine to obtain the first text data.
  • the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only keep Audio clip of human voice. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.
  • the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model.
  • the streaming speech recognition system supports recording and converting at the same time, that is, converting audio data into text data while recording. Supports direct identification of existing audio data.
  • the timer system is used to record the time progress during audio data recording and playback.
  • the recording time progress of the original audio data corresponds to the playback time progress.
  • the text style system is used to mark the content of the first text data during recording and playback, including: marking the first text data with a first color, the first color being different from black; and/or, marking the first text data
  • the text in the first text data is bolded, so that the user can easily identify the content that needs attention from the first text data.
  • the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress
  • the original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected.
  • the first time progress corresponds to the first starting position one by one.
  • the position in the first text data is the first starting position; the second time progress corresponds to the first starting position.
  • the end positions have a one-to-one correspondence.
  • the position in the first text data is the first end position.
  • the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
  • the first text data corresponding to the original audio data can be displayed to the user in real time through the operation interface.
  • the user clicks or selects the first text for the 2k-1th time When the content of the data is entered, the position of the content in the first text data is recorded as the first starting position.
  • the position of the content in the first text data is recorded as the first starting position.
  • the end position is used to mark the first text data between the first start position and the first end position, and k is a positive integer.
  • the text style system is used to mark the first text data, such as coloring and bolding, recording the position of the marked content in the voice content, synchronizing the time points of the marked content, etc.
  • the voice content is the entire voice string text returned by the speech-to-text system;
  • the marked content is the marked string text returned by the speech-to-text system during the time period from the start mark to the end mark.
  • the start bit and end bit of the marked content are the positions of the marked content in the voice content, and usually the character string is determined by the corner mark.
  • the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data are stored.
  • the voice data operating system is used to store the first text data, the first starting position, the first ending position, the first time progress, and the marked processing.
  • the voice data operating system contains a database.
  • the voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc.
  • the audio file index is the storage path of the audio file;
  • the mark data is the mark content, mark position and mark time point of each mark, and the position of the text in the speech is the time progress in the speech corresponding to the text.
  • the erroneous text or key content when converting the original audio data into the first text data, the erroneous text or key content may be marked, the marked first text data may be stored, and the time progress corresponding to the original audio data may be stored. , so that when the first text data is proofread later, by selecting the marked text, it can be quickly synchronized to the corresponding original audio data according to the corresponding time progress, which facilitates the user to correct or make corrections to the marked text.
  • Other processing can prevent users from listening to the original audio data from beginning to end, improve the efficiency of correcting the recorded text, and improve the user experience.
  • the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio are stored. After the data is obtained, when the first text data needs to be corrected, the method further includes:
  • the voice playback system when the first text data needs to be corrected, the voice playback system is used to play the recorded original audio data, and at the same time, the first text data corresponding to the original audio data is displayed to the user in real time on the operation interface.
  • the first text The data includes a first part and a second part, where the first part is located between the first starting position and the first ending position of each pair, and has been marked for content that needs to be focused on; the second part is located at the Outside a starting position and the first ending position, there is content that has not been marked.
  • the first part is the part where speech conversion text errors may occur
  • the second part is the part where errors are unlikely to occur. Therefore, when correcting the first text data, in order to improve efficiency, only the first part needs to be corrected, that is, Can.
  • the user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received.
  • the first starting position and the first ending position need to be processed.
  • the first text data between positions is corrected.
  • the user clicks or selects a At any position between the first starting position and the first ending position it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.
  • the corresponding position of the original audio data can be quickly located according to the pre-stored first time progress corresponding to the first starting position and the second time progress corresponding to the first end position, and the first time progress and the first time progress can be replayed.
  • the original audio data between the second time progress specifically, before receiving the processing instruction for the next second position, all the original audio data between the first time progress and the second time progress can be processed.
  • the original audio data is played in a loop, that is, starting from the first time progress to play the original audio data, stopping playing the original audio data to the second time progress, and then returning to the first time progress to restart playing the original audio data. ;
  • you can also stop playing after playing a preset number of times for example, stop playing the original audio data after playing it once or twice.
  • a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result, such as the original audio data between 38 seconds and 1 minute and 12 seconds. Perform speech recognition and get the speech recognition result "So, why use notepad when you have a writing pad?";
  • a first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position.
  • the first part is "So, why do you need a notepad when you have a writing pad?";
  • the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition.
  • the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result “So, both "Why do you need a notepad if you have a writing pad?" Replace the first part "So, why do you need a notepad when you have a writing pad?”
  • the method further includes:
  • the second text data, the first time progress, the second time progress and the original audio data are stored.
  • the voice data operating system saves the corrected second text data.
  • the efficiency of correcting the voice file can be greatly improved.
  • the above solution can be used to correct the first text data.
  • the original audio data between the first time progress and the second time progress can also be played to the user. After the user listens to the original audio data, Manually correct the first text data to obtain the second text data.
  • voice search can also be performed.
  • the method further includes:
  • the search keyword is found in the first text data, and the location of the search keyword is marked.
  • a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data.
  • the voice-to-text system is used for processing, and the voice search instructions are converted into search keywords.
  • the voice search button when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction.
  • the text style system After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword.
  • the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position.
  • the method further includes:
  • a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data.
  • An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 3, including:
  • Acquisition module 21 used to obtain original audio data
  • the technical solution of this embodiment is applied to electronic equipment.
  • the electronic equipment can perform human-computer interaction with the user.
  • the electronic equipment includes a voice recording system, a voice-to-text system, a timer system, and a text style system. Voice data operating system, voice playback system, etc.
  • Electronic devices can interact with backend servers through the network.
  • the above-mentioned electronic device may be a terminal device configured with a target client and/or a target server.
  • the above-mentioned terminal device may be a microphone or a microphone array, or may be a terminal device configured with a microphone.
  • the above-mentioned Electronic devices may include but are not limited to at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs wait.
  • the target client can be a video client, an instant messaging client, a browser client, an education client, etc.
  • the target server can be a video server, an instant messaging server, a browser server, an education server, etc.
  • the above-mentioned networks may include but are not limited to: wired networks and wireless networks.
  • the wired networks include local area networks, metropolitan area networks and wide area networks.
  • the wireless networks include Bluetooth, WIFI and other networks that implement wireless communication.
  • the above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is just an example. This implementation There is no restriction on this in the example.
  • original audio data can be obtained through microphone or microphone array recording.
  • the original audio data can be data files in various audio formats obtained by the recording terminal, including but not limited to: ACT, REC, MP3, WAV, WMA, VY1, VY2, DVF, MSC, AIFF and other formats; the original audio data can also be It is Pulse Code Modulation (PCM) audio stream data.
  • PCM Pulse Code Modulation
  • the electronic device can display a recording button on the operation interface.
  • the user clicks the recording button to start recording.
  • the voice recording system starts to work, and continuously collects audio data in a sub-thread through AudioRecord and AudioChunk, and passes the collected audio data to speech-to-text. system so that speech-to-text systems convert audio data into text.
  • the AudioRecord is an android media recording tool
  • the AudioChunk is a custom data box, which contains a byte array and provides the function of converting the byte array into a short array
  • the byte array is used to receive the audio data returned by AudioRecord.
  • the first receiving module 22 is configured to receive a first marking instruction for the first time progress of the original audio data
  • the second receiving module 23 is configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;
  • the timer system is used to record the time progress during audio data recording and playback.
  • the recording time progress of the original audio data corresponds to the playback time progress.
  • the timer system can record various time points, including the total duration of recording, the starting time point for marking processing (i.e., the first time progress) and the end time point (i.e., the second time progress), where the The first time progress and the second time progress appear in pairs.
  • the number of the first time progress can be one or more.
  • the original audio data between each pair of the first time progress and the second time progress is the original audio that needs to be focused on. data.
  • the first time progress can be the starting time point of the entire original audio data, or a certain time point in the middle of the original audio data;
  • the second time progress can be the end time point of the entire original audio data, or it can be the original Some time point in the middle of the audio data.
  • Storage module 24 used to store the first time progress, the second time progress and the original time progress. Start audio data.
  • a segment of speech can be determined through the first time progress and the second time progress, and the speech can be segmented according to the first time progress and the second time progress, where the first time progress is the starting time point of the segmented speech, The second time progress is the end time point of the segmented speech.
  • the original audio data may be divided into multiple speech segments.
  • the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
  • the device further includes:
  • the marking processing module 26 is configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.
  • the original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on.
  • the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
  • the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress and the second time progress. Time progress information.
  • the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data.
  • the user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds.
  • the two time points of 58 seconds determine the original audio data that needs to be focused on.
  • the device further includes:
  • the speech recognition module 25 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
  • speech recognition can also be performed on all original audio data.
  • the speech recognition module 25 is used to perform speech recognition on the original audio data to obtain the first text data;
  • the marking processing module 26 is used to mark the first text data starting from the first starting position. processing, the first starting position corresponds to the first time progress; the marking process of the first text data is stopped from the first end position, and the first end position corresponds to the second time progress; so
  • the storage module 24 is used to store the first text data that has been marked.
  • the speech-to-text system is used to convert original audio data into text.
  • the original audio data can be converted into third-order audio data through the speech recognition engine in automatic speech recognition technology (Automatic Speech Recognition, ASR).
  • a text data ASR is a technology that converts human speech into text. Its goal is to enable the computer to "dictate” the continuous speech spoken by different people. It is also called a “voice dictation machine” and is the realization of " Technology for converting "sound" to "text”.
  • the speech recognition engine can be Google speech recognition engine, Microsoft speech recognition engine or iFlytek's speech recognition engine, which is not limited here.
  • the speech fragments in the original audio data can be converted into text through the speech recognition engine. information.
  • the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format.
  • the voice data is input to the voice recognition engine to obtain the first text data.
  • the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only audio clips containing human voices will be retained. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.
  • the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model.
  • the streaming speech recognition system supports recording and translating at the same time, that is, recording while Convert audio data to text data and also support direct recognition of existing audio data.
  • the timer system is used to record the time progress during audio data recording and playback.
  • the recording time progress of the original audio data corresponds to the playback time progress.
  • the text style system is used to mark the content of the first text data during recording and playback, including: marking the first text data with a first color, the first color being different from black; and/or, marking the first text data
  • the text in the first text data is bolded, so that the user can easily identify the content that needs attention from the first text data.
  • the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress
  • the original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected.
  • the first time progress corresponds to the first starting position one by one.
  • the position in the first text data is the first starting position; the second time progress corresponds to the first starting position. There is a one-to-one correspondence between the end positions. After the original audio data corresponding to the second time progress is converted into text, the position in the first text data is the first end position.
  • the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
  • the electronic device can obtain the original audio data by recording through a microphone or microphone array.
  • the first text data corresponding to the original audio data is displayed to the user in real time through the operation interface, and the position of the content in the first text data is recorded. is the first starting position.
  • the position of the content in the first text data is recorded as the first ending position.
  • the distance between the first starting position and the first ending position is The first text data is marked, and k is a positive integer.
  • the text style system is used to mark the first text data, such as coloring and bolding, recording the position of the marked content in the voice content, synchronizing the time points of the marked content, etc.
  • the voice content is the entire voice string text returned by the speech-to-text system;
  • the marked content is the marked string text returned by the speech-to-text system during the time period from the start mark to the end mark.
  • the start bit and end bit of the marked content are the positions of the marked content in the voice content, and usually the character string is determined by the corner mark.
  • the storage module 24 is specifically configured to store the marked first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data.
  • the voice data operating system is used to store the marked first text data, the first starting position, the first ending position, the first time progress, and the second time. progress and the raw audio data.
  • the voice data operating system contains a database.
  • the voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc.
  • the audio file index is the storage path of the audio file;
  • the mark data is the mark content of each mark, the mark bit and Mark the time point, and the position of the text in the speech is the time progress in the speech corresponding to the text.
  • the erroneous text or key content when converting the original audio data into the first text data, the erroneous text or key content may be marked, the marked first text data may be stored, and the time progress corresponding to the original audio data may be stored. , so that when the first text data is proofread later, by selecting the marked text, it can be quickly synchronized to the corresponding original audio data according to the corresponding time progress, which facilitates the user to correct or make corrections to the marked text.
  • Other processing can prevent users from listening to the original audio data from beginning to end, improve the efficiency of correcting the recorded text, and improve the user experience.
  • the device further includes:
  • the third receiving module 27 is configured to receive processing instructions for the second position of the first text data, the second position being located between the first starting position and the first ending position;
  • the voice playback system when the first text data needs to be corrected, the voice playback system is used to play the recorded original audio data, and at the same time, the first text data corresponding to the original audio data is displayed to the user in real time on the operation interface.
  • the first text The data includes a first part and a second part, where the first part is located between the first starting position and the first ending position of each pair, and has been marked for content that needs to be focused on; the second part is located at the Outside a starting position and the first ending position, there is content that has not been marked.
  • the first part is the part where speech conversion text errors may occur
  • the second part is the part where errors are unlikely to occur. Therefore, when correcting the first text data, in order to improve efficiency, only the first part needs to be corrected, that is, Can.
  • the user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received.
  • the first starting position and the first ending position need to be processed.
  • the first text data between positions is corrected.
  • the user clicks or selects any position between the first starting position and the first ending position it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.
  • the second processing module 28 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the first text data. 2. Text data.
  • the original audio data between the first time progress and the second time progress can be played in a loop, that is, the original audio data is played starting from the first time progress. , stop playing the original audio data at the second time schedule, and then return to the first time schedule to restart playing the original audio data; of course, you can also stop playing after playing a preset number of times, such as playing once or twice. That is, stop playing the original audio data.
  • a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result, such as the original audio data between 38 seconds and 1 minute and 12 seconds. Perform speech recognition and get the speech recognition result "So, why use notepad when you have a writing pad?";
  • a first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position.
  • the first part is "So, why do you need a notepad when you have a writing pad?";
  • the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition.
  • the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result "So, why do you need a notepad when you have a writing pad?" to replace the first part of "So, why do you need a notepad when you have a writing pad?"
  • the storage module 24 is also used to store the second text data, the first time progress, the second time progress and the original audio data.
  • the above solution can be used to correct the first text data.
  • the original audio data between the first time progress and the second time progress can also be played to the user. After the user listens to the original audio data, Manually correct the first text data to obtain the second text data.
  • the second processing module 28 includes:
  • Interception sub-module 281 is used to intercept the first part and the second part of the first text data.
  • the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. Outside the first starting position and the first ending position;
  • Comparison sub-module 282 is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the result including the The speech recognition result and the second part of the second text data.
  • the speech recognition module 25 includes:
  • Conversion sub-module 251 used to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool;
  • the segmentation module 252 is used to segment the original audio data in the target audio format to obtain target voice data
  • the processing sub-module 253 is used to input the target speech data to the speech recognition engine.
  • the device further includes:
  • a voice search module configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
  • a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data.
  • the voice-to-text system is used for processing, and the voice search instructions are converted into search keywords.
  • the voice search button when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction.
  • the text style system After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword.
  • the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position.
  • the device further includes:
  • a typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
  • a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data.
  • the first receiving module 23 is specifically configured to use a first color to mark the first text data, and the first color is different from black; and/or, to mark the text in the first text data. Make it bold.
  • An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 6, including a processor 32 and a memory 31.
  • the memory 31 stores programs or instructions that can be run on the processor 32. When the program or instruction is executed by the processor 32, the steps of the voice data processing method as described above are implemented.
  • the processor 32 is configured to obtain original audio data; receive a first marking instruction for a first time progression of the original audio data; and receive a second marking instruction for a second time progression of the original audio data. Mark instructions, the second time schedule is later than the first time schedule; store the first time schedule, the second time schedule and the original audio data.
  • the processor 32 is configured to use the original audio data between the first time schedule and the second time schedule as target audio data, and perform marking processing on the target audio data.
  • the processor 32 is configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
  • the processor 32 is used to perform speech recognition on the original audio data, Obtain the first text data; start marking the first text data from a first starting position, the first starting position corresponds to the first time progress; stop processing the first text from a first end position The data is marked, and the first end position corresponds to the second time progress; the first text data after marking is stored.
  • the processor 32 is configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position; Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
  • the processor 32 is configured to loop the original audio data between the first time progress and the second time progress; use a speech recognition engine to play the first time progress and the second time progress in a loop; The original audio data between the second time progress is subjected to speech recognition to obtain a speech recognition result.
  • the processor 32 is configured to intercept the first part and the second part of the first text data, and the first part is located between the first starting position and the first ending position, so The second part is located outside the first starting position and the first ending position; compare the first part with the speech recognition result, and if the first part is inconsistent with the speech recognition result, use the The speech recognition result replaces the first part, and second text data including the speech recognition result and the second part is obtained.
  • the processor 32 is configured to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool; perform segmentation processing on the original audio data in the target audio format to obtain Target speech data; input the target speech data to the speech recognition engine.
  • the processor 32 is configured to receive a voice search instruction input by the user; perform voice recognition on the voice search instruction, and convert the voice search instruction into a search keyword; in the first text data Find the search keyword, and mark the location of the search keyword.
  • the processor 32 is configured to receive typo identification instructions input by the user; Identify typos in the first text data based on the lexicon and context semantic recognition algorithm, and mark the typos.
  • the processor 32 is configured to use a first color to mark the first text data, the first color being different from black; and/or to bold the text in the first text data. .
  • Embodiments of the present disclosure also provide a readable storage medium on which programs or instructions are stored. When the programs or instructions are executed by a processor, the steps of the voice data processing method as described above are implemented.
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape magnetic disk storage or other magnetic storage terminal devices or any other non-transmission medium may be used to store information that can be accessed by a computing terminal device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • the embodiment method can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is a better implementation method.
  • the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology.
  • the computer software product is stored in a storage medium (such as ROM/RAM, disk , CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in various embodiments of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A voice data processing method and apparatus, belonging to the technical field of voice processing. The voice data processing method comprises: acquiring original audio data (101); receiving a first marking instruction for a first time progress of the original audio data (102); receiving a second marking instruction for a second time progress of the original audio data, the second time progress being later than the first time progress (103); and storing the first time progress, the second time progress and the original audio data (104). The method can achieve voice segmentation.

Description

语音数据处理方法及装置Voice data processing method and device
相关申请的交叉引用Cross-references to related applications
本申请主张在2022年05月25日在中国提交的中国专利申请号No.202210578264.X的优先权,其全部内容通过引用包含于此。This application claims priority to Chinese Patent Application No. 202210578264.X filed in China on May 25, 2022, the entire content of which is incorporated herein by reference.
技术领域Technical field
本公开涉及语音处理技术领域,特别是指一种语音数据处理方法及装置。The present disclosure relates to the field of voice processing technology, and in particular, to a voice data processing method and device.
背景技术Background technique
目前语音输入得到普遍使用,特别是开会时通过语音转文字可以快捷省事完成会议全程录音并实时生成文字纪录。但目前语音转文字受环境和说话人发音等多重因素的影响,并不能达到100%的准确率,会后还需要人为对生成的文字进行校正。相关技术中需要把录音从头到尾听一遍进行校正,耗费时间长,效率低,用户体验不好。At present, voice input is widely used, especially during meetings, voice-to-text can be used to quickly and easily record the entire meeting and generate text records in real time. However, currently speech-to-text is affected by multiple factors such as the environment and the speaker's pronunciation, and cannot achieve 100% accuracy. The generated text needs to be manually corrected after the meeting. In related technologies, it is necessary to listen to the recording from beginning to end for correction, which takes a long time, is inefficient, and has a poor user experience.
发明内容Contents of the invention
本公开要解决的技术问题是提供一种语音数据处理方法及装置,能够实现语音分段。The technical problem to be solved by this disclosure is to provide a voice data processing method and device that can realize voice segmentation.
为解决上述技术问题,本公开的实施例提供技术方案如下:In order to solve the above technical problems, embodiments of the present disclosure provide the following technical solutions:
一方面,提供一种语音数据处理方法,包括:On the one hand, a voice data processing method is provided, including:
获取原始音频数据;Get original audio data;
接收针对所述原始音频数据的第一时间进度的第一标记指令;receiving a first marking instruction for a first time progression of the original audio data;
接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;receiving a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;
存储所述第一时间进度、所述第二时间进度和所述原始音频数据。The first time schedule, the second time schedule and the original audio data are stored.
一些实施例中,存储所述第一时间进度、所述第二时间进度和所述原始音频数据之后,所述方法还包括:In some embodiments, after storing the first time progress, the second time progress and the original audio data, the method further includes:
将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为 目标音频数据,对所述目标音频数据进行标记处理。The original audio data between the first time progress and the second time progress is taken as The target audio data is marked.
一些实施例中,存储所述第一时间进度、所述第二时间进度和所述原始音频数据之后,所述方法还包括:In some embodiments, after storing the first time progress, the second time progress and the original audio data, the method further includes:
对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。Perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
一些实施例中,获取原始音频数据之后,所述方法还包括:In some embodiments, after obtaining the original audio data, the method further includes:
对所述原始音频数据进行语音识别,得到第一文本数据;Perform speech recognition on the original audio data to obtain first text data;
接收针对所述原始音频数据的第一时间进度的第一标记指令之后,所述方法还包括:After receiving the first marking instruction for the first time progress of the original audio data, the method further includes:
从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;Mark the first text data starting from a first starting position, where the first starting position corresponds to the first time progress;
接收针对所述原始音频数据的第二时间进度的第二标记指令之后,所述方法还包括:After receiving the second marking instruction for the second time progression of the original audio data, the method further includes:
从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;Stop marking the first text data from a first end position, which corresponds to the second time progress;
存储进行标记处理后的所述第一文本数据。The first text data that has been marked is stored.
一些实施例中,存储进行标记处理后的所述第一文本数据之后,所述方法还包括:In some embodiments, after storing the marked first text data, the method further includes:
接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;receiving processing instructions for a second position of the first text data, the second position being between the first starting position and the first ending position;
对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
一些实施例中,对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别包括:In some embodiments, performing speech recognition on the original audio data between the first time schedule and the second time schedule includes:
对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放;Play the original audio data in a loop between the first time progress and the second time progress;
利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述 原始音频数据进行语音识别,得到语音识别结果。Utilize a speech recognition engine to identify the time between the first time schedule and the second time schedule. The original audio data is subjected to speech recognition to obtain the speech recognition result.
一些实施例中,所述根据语音识别结果对所述第一文本数据进行校正包括:In some embodiments, correcting the first text data according to the speech recognition result includes:
截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. outside the first end position;
比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, replace the first part with the speech recognition result to obtain the second part including the speech recognition result and the second speech recognition result. part of the second text data.
一些实施例中,对所述原始音频数据进行语音识别包括:In some embodiments, performing speech recognition on the original audio data includes:
基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;Convert the original audio format of the original audio data to the target audio format based on the FFMPEG tool;
对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;Perform segmentation processing on the original audio data in the target audio format to obtain target voice data;
将所述目标语音数据输入至语音识别引擎。The target speech data is input to a speech recognition engine.
一些实施例中,所述方法还包括:In some embodiments, the method further includes:
接收用户输入的语音搜索指令;Receive voice search instructions input by the user;
对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;Perform voice recognition on the voice search instructions, and convert the voice search instructions into search keywords;
在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。The search keyword is found in the first text data, and the location of the search keyword is marked.
一些实施例中,所述方法还包括:In some embodiments, the method further includes:
接收用户输入的错别字识别指令;Receive typo recognition instructions input by the user;
根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。Identify typos in the first text data according to the lexicon and context semantic recognition algorithm, and mark the typos.
一些实施例中,对所述第一文本数据进行标记处理包括:In some embodiments, labeling the first text data includes:
采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或using a first color to mark the first text data, the first color being different from black; and/or
对所述第一文本数据中的文字进行加粗。 Bold characters in the first text data.
本公开的实施例还提供了一种语音数据处理装置,包括:An embodiment of the present disclosure also provides a voice data processing device, including:
获取模块,用于获取原始音频数据;Obtain module, used to obtain original audio data;
第一接收模块,用于接收针对所述原始音频数据的第一时间进度的第一标记指令;A first receiving module configured to receive a first marking instruction for the first time progress of the original audio data;
第二接收模块,用于接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;a second receiving module configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;
存储模块,用于存储所述第一时间进度、所述第二时间进度和所述原始音频数据。A storage module configured to store the first time progress, the second time progress and the original audio data.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
标记处理模块,用于将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。A marking processing module, configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
语音识别模块,用于对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。A speech recognition module, configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
语音识别模块,用于对所述原始音频数据进行语音识别,得到第一文本数据;A speech recognition module, used to perform speech recognition on the original audio data to obtain the first text data;
标记处理模块,用于从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;A marking processing module, configured to start marking the first text data from a first starting position, which corresponds to the first time progress; and stop marking the first text from a first ending position. The data is marked, and the first end position corresponds to the second time progress;
所述存储模块用于存储进行标记处理后的所述第一文本数据。The storage module is used to store the first text data after mark processing.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
第三接收模块,用于接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;A third receiving module configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position;
第二处理模块,用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。 The second processing module is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the second text data.
一些实施例中,所述语音识别模块具体用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放;利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果。In some embodiments, the speech recognition module is specifically configured to loop the original audio data between the first time schedule and the second time schedule; use a speech recognition engine to play the first time schedule Perform speech recognition on the original audio data between the second time schedule and the second time schedule to obtain a speech recognition result.
一些实施例中,所述第二处理模块包括:In some embodiments, the second processing module includes:
截取子模块,用于截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;Interception submodule, used to intercept the first part and the second part of the first text data, the first part is located between the first starting position and the first ending position, and the second part is located at the outside the first starting position and the first ending position;
比对子模块,用于比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。A comparison submodule is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the speech including the speech recognition result. The recognition result and the second part of the second text data.
一些实施例中,所述语音识别模块包括:In some embodiments, the speech recognition module includes:
转换子模块,用于基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;A conversion submodule for converting the original audio format of the original audio data into a target audio format based on the FFMPEG tool;
切分子模块,用于对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;A segmentation sub-module, used to segment the original audio data in the target audio format to obtain target voice data;
处理子模块,用于将所述目标语音数据输入至语音识别引擎。A processing submodule is used to input the target speech data to a speech recognition engine.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
语音搜索模块,用于接收用户输入的语音搜索指令;对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。A voice search module, configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
错别字识别模块,用于接收用户输入的错别字识别指令;根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。A typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
本公开的实施例还提供了一种语音数据处理装置,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如上所述的语音数据处理方法的步骤。 An embodiment of the present disclosure also provides a voice data processing device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor Steps to implement the voice data processing method as described above.
本公开的实施例还提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如上所述的语音数据处理方法的步骤。Embodiments of the present disclosure also provide a readable storage medium on which programs or instructions are stored. When the programs or instructions are executed by a processor, the steps of the voice data processing method as described above are implemented.
本公开的实施例具有以下有益效果:Embodiments of the present disclosure have the following beneficial effects:
上述方案中,获取原始音频数据后,接收针对原始音频数据的不同时间进度的标记指令,并根据标记指令存储原始音频数据,这样根据记录的不同时间进度可以实现对语音的分段。In the above solution, after obtaining the original audio data, the marking instructions for different time progresses of the original audio data are received, and the original audio data is stored according to the marking instructions, so that the speech can be segmented according to the different time progresses recorded.
附图说明Description of the drawings
图1为本公开实施例语音数据处理方法的流程示意图;Figure 1 is a schematic flowchart of a voice data processing method according to an embodiment of the present disclosure;
图2为本公开实施例电子设备的组成示意图;Figure 2 is a schematic diagram of the composition of an electronic device according to an embodiment of the present disclosure;
图3为本公开实施例语音数据处理装置的结构框图;Figure 3 is a structural block diagram of a voice data processing device according to an embodiment of the present disclosure;
图4为本公开实施例第二处理模块的结构框图;Figure 4 is a structural block diagram of the second processing module according to the embodiment of the present disclosure;
图5为本公开实施例语音识别模块的结构框图;Figure 5 is a structural block diagram of a speech recognition module according to an embodiment of the present disclosure;
图6为本公开实施例语音数据处理装置的组成示意图。Figure 6 is a schematic diagram of the composition of a voice data processing device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开的实施例要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present disclosure clearer, a detailed description will be given below with reference to the accompanying drawings and specific embodiments.
本公开实施例提供一种语音数据处理方法及装置,能够实现对语音的分段。Embodiments of the present disclosure provide a voice data processing method and device, which can realize voice segmentation.
本公开的实施例提供一种语音数据处理方法,如图1所示,包括:Embodiments of the present disclosure provide a voice data processing method, as shown in Figure 1, including:
步骤101:获取原始音频数据;Step 101: Obtain original audio data;
本实施例的技术方案应用于电子设备中,该电子设备能够与用户进行人机交互,如图2所示,该电子设备包含语音录制系统、语音转文字系统、计时器系统、文字样式系统、语音数据操作系统、语音播放系统等。电子设备可以通过网络与后台服务器之间进行交互。The technical solution of this embodiment is applied to electronic equipment. The electronic equipment can perform human-computer interaction with the user. As shown in Figure 2, the electronic equipment includes a voice recording system, a voice-to-text system, a timer system, and a text style system. Voice data operating system, voice playback system, etc. Electronic devices can interact with backend servers through the network.
可选地,在本实施例中,上述电子设备可以是配置有目标客户端和/或目 标服务端的终端设备,上述终端设备可以是麦克风或麦克风阵列,也可以是配置有麦克风的终端设备,上述电子设备可以包括但不限于以下至少之一:手机(如Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices,移动互联网设备)、PAD、台式电脑、智能电视等。目标客户端可以是视频客户端、即时通信客户端、浏览器客户端、教育客户端等。目标服务端可以是视频服务端、即时通信服务端、浏览器服务端、教育服务端等。上述网络可以包括但不限于:有线网络,无线网络,其中,该有线网络包括:局域网、城域网和广域网,该无线网络包括:蓝牙、WIFI及其他实现无线通信的网络。上述服务端可以是单一服务器,也可以是由多个服务器组成的服务器集群,或者是云服务器。上述仅是一种示例,本实施例中对此不作任何限定。Optionally, in this embodiment, the above-mentioned electronic device may be configured with a target client and/or a target client. The terminal equipment of the standard server. The above-mentioned terminal equipment can be a microphone or a microphone array, or a terminal equipment equipped with a microphone. The above-mentioned electronic equipment can include but is not limited to at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), Laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs, etc. The target client can be a video client, an instant messaging client, a browser client, an education client, etc. The target server can be a video server, an instant messaging server, a browser server, an education server, etc. The above-mentioned networks may include but are not limited to: wired networks and wireless networks. The wired networks include local area networks, metropolitan area networks and wide area networks. The wireless networks include Bluetooth, WIFI and other networks that implement wireless communication. The above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is only an example, and there is no limitation on this in this embodiment.
本实施例中,可以通过麦克风或麦克风阵列录音来获取原始音频数据。原始音频数据可以为录音终端获取到的各种音频格式的数据文件,包括但不限于:ACT、REC、MP3、WAV、WMA、VY1、VY2、DVF、MSC、AIFF等格式;原始音频数据也可以是脉冲编码调制(Pulse Code Modulation,PCM)音频流数据。In this embodiment, original audio data can be obtained through microphone or microphone array recording. The original audio data can be data files in various audio formats obtained by the recording terminal, including but not limited to: ACT, REC, MP3, WAV, WMA, VY1, VY2, DVF, MSC, AIFF and other formats; the original audio data can also be It is Pulse Code Modulation (PCM) audio stream data.
电子设备可以在操作界面上显示录音按钮,用户点击录音按钮开始录音,语音录制系统开始工作,并在子线程中通过AudioRecord和AudioChunk不断循环收集音频数据,并将收集的音频数据传递给语音转文字系统,以便语音转文字系统将音频数据转化成文字。其中,所述AudioRecord为android媒体录音工具;所述AudioChunk为自定义数据盒子,包含一个byte数组并提供byte数组转化成short数组功能;所述byte数组用于接收AudioRecord返回的音频数据。The electronic device can display a recording button on the operation interface. The user clicks the recording button to start recording. The voice recording system starts to work, and continuously collects audio data in a sub-thread through AudioRecord and AudioChunk, and passes the collected audio data to speech-to-text. system so that speech-to-text systems convert audio data into text. Among them, the AudioRecord is an android media recording tool; the AudioChunk is a custom data box, which contains a byte array and provides the function of converting the byte array into a short array; the byte array is used to receive the audio data returned by AudioRecord.
步骤102:接收针对所述原始音频数据的第一时间进度的第一标记指令;Step 102: Receive a first marking instruction for the first time progress of the original audio data;
步骤103:接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;Step 103: Receive a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;
本实施例中,计时器系统用于记录音频数据录制和播放时的时间进度。其中,原始音频数据的录制时间进度与播放时间进度相对应。 In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress.
在录制过程中,计时器系统可以记录各个时间节点,包括录制的总时长、进行标记处理的起始时间点(即第一时间进度)和结束时间点(即第二时间进度),其中,第一时间进度与第二时间进度成对出现,第一时间进度的数量可以为一个或多个,每一对第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据。其中,第一时间进度可以是整个原始音频数据的起始时间点,也可以是原始音频数据中间的某个时间点;第二时间进度可以是整个原始音频数据的结束时间点,也可以是原始音频数据中间的某个时间点。During the recording process, the timer system can record various time points, including the total duration of recording, the starting time point for marking processing (i.e., the first time progress) and the end time point (i.e., the second time progress), where the The first time progress and the second time progress appear in pairs. The number of the first time progress can be one or more. The original audio data between each pair of the first time progress and the second time progress is the original audio that needs to be focused on. data. Among them, the first time progress can be the starting time point of the entire original audio data, or a certain time point in the middle of the original audio data; the second time progress can be the end time point of the entire original audio data, or it can be the original Some time point in the middle of the audio data.
步骤104:存储所述第一时间进度、所述第二时间进度和所述原始音频数据。Step 104: Store the first time progress, the second time progress and the original audio data.
通过第一时间进度和第二时间进度可以确定一段语音,根据第一时间进度和第二时间进度可以对语音进行分段,其中,第一时间进度为分段后的语音的起始时间点,第二时间进度为分段后的语音的结束时间点。在原始音频数据包括多组第一时间进度和第二时间进度时,可以将原始音频数据分为多段语音。A segment of speech can be determined through the first time progress and the second time progress, and the speech can be segmented according to the first time progress and the second time progress, where the first time progress is the starting time point of the segmented speech, The second time progress is the end time point of the segmented speech. When the original audio data includes multiple sets of first time progress and second time progress, the original audio data may be divided into multiple speech segments.
录音开始时,计时器系统获取电子设备当前时间毫秒值作为语音的起始时间点,然后可以通过android制定定时任务工具Timer每隔一毫秒更新一次语音的结束时间点和语音时长。When the recording starts, the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
一些实施例中,存储所述第一时间进度、所述第二时间进度和所述原始音频数据之后,所述方法还包括:In some embodiments, after storing the first time progress, the second time progress and the original audio data, the method further includes:
将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据,为了方便用户快速确定第一时间进度和第二时间进度之间的原始音频数据,可以将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。一具体示例中,可以在原始音频数据对应的播放进度条中标识出成对的第一时间进度和第二时间进度,或者,可以采用专门的显示界面显示成对的第一时间进度 和第二时间进度的信息。比如,第一时间进度为38秒,第二时间进度为58秒,则可以在原始音频数据对应的播放进度条中对38秒和58秒这两个时间点进行打标,用户可以通过标记出的38秒和58秒这两个时间点确定需要重点关注的原始音频数据;或者,在原始音频数据对应的显示界面显示38秒和58秒这两个时间点,用户可以通过记录的38秒和58秒这两个时间点确定需要重点关注的原始音频数据。或者,直接截取出第一时间进度和第二时间进度之间的目标音频数据,存储在特定区域中。The original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on. In order to facilitate the user to quickly determine the original audio data between the first time progress and the second time progress, the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked. In a specific example, the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress. and second time progress information. For example, if the first time progress is 38 seconds and the second time progress is 58 seconds, the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data. The user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds. The two time points of 58 seconds determine the original audio data that needs to be focused on. Or, directly intercept the target audio data between the first time progress and the second time progress, and store it in a specific area.
由于第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据,因此,可以仅对第一时间进度、所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据,这样可以降低语音识别的工作量,并且可以保证用户获取到需要注意的重点内容。Since the original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on, speech recognition can only be performed on the original audio data between the first time progress and the second time progress. , obtain segmented text data, which can reduce the workload of speech recognition and ensure that users get the key content that needs attention.
当然,本实施例中,还可以对全部原始音频数据进行语音识别。一些实施例中,获取原始音频数据之后,所述方法还包括:Of course, in this embodiment, speech recognition can also be performed on all original audio data. In some embodiments, after obtaining the original audio data, the method further includes:
对所述原始音频数据进行语音识别,得到第一文本数据;Perform speech recognition on the original audio data to obtain first text data;
本实施例中,语音转文字系统用于转化原始音频数据成文字,实际应用时,原始音频数据可以通过自动语音识别技术(Automatic Speech Recognition,ASR)中的语音识别引擎将原始音频数据转换为第一文本数据,ASR是一种将人的语音转换为文本的技术,其目标是让计算机能够“听写”出不同人所说出的连续语音,也称之为“语音听写机”,是实现“声音”到“文字”转换的技术。在本实施例中,语音识别引擎可以为谷歌语音识别引擎、微软语音识别引擎或科大讯飞的语音识别引擎,在此不作限定,通过语音识别引擎可以将原始音频数据中的语音片段转换为文字信息。In this embodiment, the speech-to-text system is used to convert original audio data into text. In actual application, the original audio data can be converted into third-order audio data through the speech recognition engine in automatic speech recognition technology (Automatic Speech Recognition, ASR). A text data, ASR is a technology that converts human speech into text. Its goal is to enable the computer to "dictate" the continuous speech spoken by different people. It is also called a "voice dictation machine" and is the realization of " Technology for converting "sound" to "text". In this embodiment, the speech recognition engine can be Google speech recognition engine, Microsoft speech recognition engine or iFlytek's speech recognition engine, which is not limited here. The speech fragments in the original audio data can be converted into text through the speech recognition engine. information.
具体地,可以基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;将所述目标语音数据输入至语音识别引擎,得到所述第一文本数据。例如,基于FFMPEG工具将原始音频数据从PCM格式转换为MP3格式,将该MP3格式的原始音频数据进行切分,得到包含语音片段的目标语音数据,也就是说该MP3格式的原始音频数据中可以只保留包含 人声的音频片段。将原始音频数据转换为MP3格式,方便用户对原始音频数据进行切分及保存。Specifically, the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format. The voice data is input to the voice recognition engine to obtain the first text data. For example, the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only keep Audio clip of human voice. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.
一些实施例中,语音转文字系统还可以是基于深度学习Transformer模型的流式语音识别系统,该流式语音识别系统支持边录边转,即在录音的同时将音频数据转换为文本数据,也支持直接识别已有的音频数据。In some embodiments, the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model. The streaming speech recognition system supports recording and converting at the same time, that is, converting audio data into text data while recording. Supports direct identification of existing audio data.
在接收针对所述原始音频数据的第一时间进度的第一标记指令之后,记录此时所述原始音频数据的第一时间进度,同时从所述第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;After receiving the first marking instruction for the first time progress of the original audio data, recording the first time progress of the original audio data at this time, and simultaneously marking the first text data starting from the first starting position. Perform marking processing, and the first starting position corresponds to the first time progress;
接收针对所述原始音频数据的第二时间进度的第二标记指令之后,记录此时所述原始音频数据的第二时间进度,同时从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;After receiving the second marking instruction for the second time progress of the original audio data, record the second time progress of the original audio data at this time, and at the same time stop marking the first text data from the first end position. , the first end position corresponds to the second time progress;
本实施例中,计时器系统用于记录音频数据录制和播放时的时间进度。其中,原始音频数据的录制时间进度与播放时间进度相对应。文字样式系统用于录音和播放时对第一文本数据的内容进行标记处理,包括:采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或,对所述第一文本数据中的文字进行加粗,这样用户很容易从第一文本数据中识别出需要注意的内容。In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress. The text style system is used to mark the content of the first text data during recording and playback, including: marking the first text data with a first color, the first color being different from black; and/or, marking the first text data The text in the first text data is bolded, so that the user can easily identify the content that needs attention from the first text data.
在录制过程中,计时器系统在原始音频数据转换为第一文本数据的整个过程中,记录各个时间节点,包括录制的总时长、进行标记处理的起始时间点(即第一时间进度)和结束时间点(即第二时间进度),其中,第一时间进度与第二时间进度成对出现,第一时间进度的数量可以为一个或多个,每一对第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据,也就是需要校正的文本对应的原始音频数据。同时,第一时间进度与第一起始位置一一对应,第一时间进度对应的原始音频数据转换为文本后,在第一文本数据中的位置是第一起始位置;第二时间进度与第一结束位置一一对应,第二时间进度对应的原始音频数据转换为文本后,在第一文本数据中的位置是第一结束位置。 During the recording process, the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress The original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected. At the same time, the first time progress corresponds to the first starting position one by one. After the original audio data corresponding to the first time progress is converted into text, the position in the first text data is the first starting position; the second time progress corresponds to the first starting position. The end positions have a one-to-one correspondence. After the original audio data corresponding to the second time progress is converted into text, the position in the first text data is the first end position.
录音开始时,计时器系统获取电子设备当前时间毫秒值作为语音的起始时间点,然后可以通过android制定定时任务工具Timer每隔一毫秒更新一次语音的结束时间点和语音时长。When the recording starts, the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
具体地,电子设备可以在通过麦克风或麦克风阵列录音获取原始音频数据时,通过操作界面向用户实时展示与原始音频数据对应的第一文本数据,在用户第2k-1次点击或选中第一文本数据的内容时,记录该内容在第一文本数据中的位置为第一起始位置,第2k次点击或选中第一文本数据的内容时,记录该内容在第一文本数据中的位置为第一结束位置,对第一起始位置和第一结束位置之间的第一文本数据进行标记处理,k为正整数。比如,在用户第3次点击或选中第一文本数据中的“那么”时,记录“那么”所在位置为第一起始位置,在用户第4次点击或选中第一文本数据中的“写字板”时,记录“写字板”所在位置为第一结束位置,对“那么”到“写字板”之间的文本“那么,既有写字板”进行标记处理,比如加粗和/或染色。在用户第7次点击或选中第一文本数据中的“很好”时,记录“很好”所在位置为第一起始位置,在用户第8次点击或选中第一文本数据中的“规划”时,记录“规划”所在位置为第一结束位置,对“很好”到“规划”之间的文本“很好的规划”进行标记处理,比如加粗和/或染色。Specifically, when the electronic device obtains the original audio data through microphone or microphone array recording, the first text data corresponding to the original audio data can be displayed to the user in real time through the operation interface. When the user clicks or selects the first text for the 2k-1th time, When the content of the data is entered, the position of the content in the first text data is recorded as the first starting position. When the content of the first text data is clicked or selected for the 2kth time, the position of the content in the first text data is recorded as the first starting position. The end position is used to mark the first text data between the first start position and the first end position, and k is a positive integer. For example, when the user clicks or selects "then" in the first text data for the third time, the position of "then" is recorded as the first starting position. When the user clicks or selects "WordPad" in the first text data for the fourth time, ", record the position of "writing board" as the first end position, and mark the text "then, existing writing board" between "then" and "writing board", such as bolding and/or coloring. When the user clicks or selects "very good" in the first text data for the 7th time, the position of "very good" is recorded as the first starting position, and when the user clicks or selects "planning" in the first text data for the 8th time When, the position of "planning" is recorded as the first end position, and the text "very good plan" between "very good" and "planning" is marked, such as bolding and/or coloring.
本实施例中,文字样式系统用于对第一文本数据进行标记,比如染色加粗等,记录标记内容在语音内容中的位置,同步标记内容时间点等。所述语音内容为语音转文字系统返回的整段语音字符串文字;所述标记内容为开始标记到结束标记时间段内,语音转文字系统返回的标记的字符串文字。所述标记内容的起始位和结束位是标记内容在语音内容中位置,通常字符串通过角标确定。In this embodiment, the text style system is used to mark the first text data, such as coloring and bolding, recording the position of the marked content in the voice content, synchronizing the time points of the marked content, etc. The voice content is the entire voice string text returned by the speech-to-text system; the marked content is the marked string text returned by the speech-to-text system during the time period from the start mark to the end mark. The start bit and end bit of the marked content are the positions of the marked content in the voice content, and usually the character string is determined by the corner mark.
之后,存储进行标记处理后的所述第一文本数据、所述第一起始位置、所述第一结束位置、所述第一时间进度、所述第二时间进度和所述原始音频数据。Afterwards, the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data are stored.
本实施例中,语音数据操作系统用于存储进行标记处理后的所述第一文本数据、所述第一起始位置、所述第一结束位置、所述第一时间进度、所述 第二时间进度和所述原始音频数据。语音数据操作系统包含一个数据库,语音数据操作系统保存每个原始音频数据的音频文件索性、语音内容、语音时长、所有标记数据、语音内容每个文字在语音中的位置等。所述音频文件索引为音频文件的保存路径;所述标记数据为每个标记的标记内容,标记位和标记时间点,所述文字在语音中的位置为该文字对应语音中的时间进度。In this embodiment, the voice data operating system is used to store the first text data, the first starting position, the first ending position, the first time progress, and the marked processing. The second time schedule and the original audio data. The voice data operating system contains a database. The voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc. The audio file index is the storage path of the audio file; the mark data is the mark content, mark position and mark time point of each mark, and the position of the text in the speech is the time progress in the speech corresponding to the text.
本实施例中,在将原始音频数据转换为第一文本数据时,可以对其中的错误文字或重点内容进行标记处理,存储标记处理后的第一文本数据,并且存储原始音频数据对应的时间进度,这样后续在对第一文本数据进行校对时,通过选择进行标记处理后的文本,可以根据对应的时间进度快速同步到对应的原始音频数据处,方便用户对标记处理后的文本进行校正或做其他处理,能够避免用户从头到尾再听一遍原始音频数据,能够提高对录音文本进行校正的效率,改善用户体验。In this embodiment, when converting the original audio data into the first text data, the erroneous text or key content may be marked, the marked first text data may be stored, and the time progress corresponding to the original audio data may be stored. , so that when the first text data is proofread later, by selecting the marked text, it can be quickly synchronized to the corresponding original audio data according to the corresponding time progress, which facilitates the user to correct or make corrections to the marked text. Other processing can prevent users from listening to the original audio data from beginning to end, improve the efficiency of correcting the recorded text, and improve the user experience.
一些实施例中,存储进行标记处理后的所述第一文本数据、所述第一起始位置、所述第一结束位置、所述第一时间进度、所述第二时间进度和所述原始音频数据之后,在需要对第一文本数据进行校正时,所述方法还包括:In some embodiments, the marked-processed first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio are stored. After the data is obtained, when the first text data needs to be corrected, the method further includes:
接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;receiving processing instructions for a second position of the first text data, the second position being between the first starting position and the first ending position;
本实施例中,在需要对第一文本数据进行校正时,利用语音播放系统播放已录制的原始音频数据,同时在操作界面向用户实时展示与原始音频数据对应的第一文本数据,第一文本数据包括第一部分和第二部分,其中,第一部分位于每对的第一起始位置和第一结束位置之间,为需要重点关注的内容,已经进行标记处理;所述第二部分位于所述第一起始位置和所述第一结束位置之外,为未经过标记处理的内容。其中,第一部分为可能存在语音转换文字错误的部分,第二部分为不太可能出现错误的部分,因此,在对第一文本数据进行校正时,为了提高效率,仅需要对第一部分进行校正即可。In this embodiment, when the first text data needs to be corrected, the voice playback system is used to play the recorded original audio data, and at the same time, the first text data corresponding to the original audio data is displayed to the user in real time on the operation interface. The first text The data includes a first part and a second part, where the first part is located between the first starting position and the first ending position of each pair, and has been marked for content that needs to be focused on; the second part is located at the Outside a starting position and the first ending position, there is content that has not been marked. Among them, the first part is the part where speech conversion text errors may occur, and the second part is the part where errors are unlikely to occur. Therefore, when correcting the first text data, in order to improve efficiency, only the first part needs to be corrected, that is, Can.
用户可以随意点击或者选中位于第一起始位置和第一结束位置之间的第二位置,则视为接收到针对第二位置的处理指令,需要对所述第一起始位置和所述第一结束位置之间的第一文本数据进行校正。在用户点击或者选中位 于第一起始位置和第一结束位置之间任一位置时,都视为需要对所述第一起始位置和所述第一结束位置之间的第一文本数据进行校正。The user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received. The first starting position and the first ending position need to be processed. The first text data between positions is corrected. When the user clicks or selects a At any position between the first starting position and the first ending position, it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.
可以根据预先存储的与第一起始位置对应的第一时间进度、与第一结束位置对应的第二时间进度,快速定位到原始音频数据的相应位置,重新播放所述第一时间进度和所述第二时间进度之间的所述原始音频数据,具体地,在未接收到针对下一第二位置的处理指令之前,可以对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放,即从第一时间进度开始播放所述原始音频数据,到第二时间进度停止播放所述原始音频数据,然后回到第一时间进度重新开始播放所述原始音频数据;当然,也可以在播放预设次数后停止播放,比如播放一次或两次后即停止播放所述原始音频数据。The corresponding position of the original audio data can be quickly located according to the pre-stored first time progress corresponding to the first starting position and the second time progress corresponding to the first end position, and the first time progress and the first time progress can be replayed. The original audio data between the second time progress, specifically, before receiving the processing instruction for the next second position, all the original audio data between the first time progress and the second time progress can be processed. The original audio data is played in a loop, that is, starting from the first time progress to play the original audio data, stopping playing the original audio data to the second time progress, and then returning to the first time progress to restart playing the original audio data. ; Of course, you can also stop playing after playing a preset number of times, for example, stop playing the original audio data after playing it once or twice.
利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果,比如对38秒和1分12秒之间的原始音频数据进行语音识别,得到语音识别结果“那么,既有写字板又为何要用记事本”;Use a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result, such as the original audio data between 38 seconds and 1 minute and 12 seconds. Perform speech recognition and get the speech recognition result "So, why use notepad when you have a writing pad?";
截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外,比如第一部分为“那么,既有写字板又为何要记事本”;A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. In addition to the first ending position, for example, the first part is "So, why do you need a notepad when you have a writing pad?";
比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述第一部分和所述第二部分的第二文本数据。Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the first part and the second part. the second text data.
当原始音频数据经过语音识别引擎得到的语音识别结果与第一部分相比差别较大时,可以将原始音频数据输入至语音识别引擎进行多次语音识别,得到语音识别结果,以提高语音识别的精度。在第一部分与所述语音识别结果不一致时,可以利用所述语音识别结果替换所述第一部分,或者,对第一部分进行修改以使第一部分与语音识别结果一致,来对第一文本数据进行校正,得到校正后的第二文本数据。比如,可以利用语音识别结果“那么,既有 写字板又为何要用记事本”替换第一部分“那么,既有写字板又为何要记事本”。When the speech recognition result obtained by passing the original audio data through the speech recognition engine is significantly different from the first part, the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition. . When the first part is inconsistent with the speech recognition result, the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result "So, both "Why do you need a notepad if you have a writing pad?" Replace the first part "So, why do you need a notepad when you have a writing pad?"
一些实施例中,得到第二文本数据之后,所述方法还包括:In some embodiments, after obtaining the second text data, the method further includes:
存储所述第二文本数据、所述第一时间进度、所述第二时间进度和所述原始音频数据。The second text data, the first time progress, the second time progress and the original audio data are stored.
在对第一文本数据进行校正后,语音数据操作系统保存校正后得到的第二文本数据。本实施例中,由于在对第一文本数据进行校正时,仅需要重新播放第一时间进度和第二时间进度之间的原始音频数据,因此可以大大提高对语音文件进行校正的效率。After correcting the first text data, the voice data operating system saves the corrected second text data. In this embodiment, since only the original audio data between the first time schedule and the second time schedule needs to be replayed when correcting the first text data, the efficiency of correcting the voice file can be greatly improved.
本实施例中,可以利用上述方案实现对第一文本数据的校正,另外,还可以向用户播放第一时间进度和第二时间进度之间的原始音频数据,由用户在收听原始音频数据之后,手动对第一文本数据进行校正,得到第二文本数据。In this embodiment, the above solution can be used to correct the first text data. In addition, the original audio data between the first time progress and the second time progress can also be played to the user. After the user listens to the original audio data, Manually correct the first text data to obtain the second text data.
另外,本实施例在结束语音录制后,还可以进行语音搜索,所述方法还包括:In addition, in this embodiment, after the voice recording is completed, voice search can also be performed. The method further includes:
接收用户输入的语音搜索指令;Receive voice search instructions input by the user;
对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;Perform voice recognition on the voice search instructions, and convert the voice search instructions into search keywords;
在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。The search keyword is found in the first text data, and the location of the search keyword is marked.
具体地,可以在电子设备的操作界面上显示语音搜索按钮,若用户点击语音搜索按钮后输入语音,视为接收到用户输入的语音搜索指令,语音录制系统开始录音,并把录制的语音数据传递给语音转文字系统进行处理,将所述语音搜索指令转换为搜索关键词。其中,在用户点击语音搜索按钮的过程中,用户输入的语音均视为语音搜索指令。文字样式系统收到搜索关键词后,在第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理,比如对搜索关键词所在位置进行高亮显示。在用户点击高亮位置处的第一文本数据时,还可以播放对应的原始音频数据,从高亮位置的起始位置开始播放直至高亮位置的结束位置停止播放。通过本实施例的技术方案, 可以方便用户从文本数据和音频数据中找到需要的内容。Specifically, a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data. The voice-to-text system is used for processing, and the voice search instructions are converted into search keywords. Among them, when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction. After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword. When the user clicks on the first text data at the highlighted position, the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position. Through the technical solution of this embodiment, It is convenient for users to find the content they need from text data and audio data.
另外,本实施例在结束语音录制后,还可以进行错别字识别,所述方法还包括:In addition, in this embodiment, after the voice recording is completed, typos can also be recognized. The method further includes:
接收用户输入的错别字识别指令;Receive typo recognition instructions input by the user;
根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。Identify typos in the first text data according to the lexicon and context semantic recognition algorithm, and mark the typos.
具体地,可以在电子设备的操作界面上显示错别字识别按钮,若用户点击错别字识别按钮,视为接收到用户输入的错别字识别指令,可以根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理,比如对错别字高亮显示,以提醒用户,用户可以对错别字进行修改,以提高文本数据的准确性。Specifically, a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data.
本公开的实施例还提供了一种语音数据处理装置,如图3所示,包括:An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 3, including:
获取模块21,用于获取原始音频数据;Acquisition module 21, used to obtain original audio data;
本实施例的技术方案应用于电子设备中,该电子设备能够与用户进行人机交互,如图2所示,该电子设备包含语音录制系统、语音转文字系统、计时器系统、文字样式系统、语音数据操作系统、语音播放系统等。电子设备可以通过网络与后台服务器之间进行交互。The technical solution of this embodiment is applied to electronic equipment. The electronic equipment can perform human-computer interaction with the user. As shown in Figure 2, the electronic equipment includes a voice recording system, a voice-to-text system, a timer system, and a text style system. Voice data operating system, voice playback system, etc. Electronic devices can interact with backend servers through the network.
可选地,在本实施例中,上述电子设备可以是配置有目标客户端和/或目标服务端的终端设备,上述终端设备可以是麦克风或麦克风阵列,也可以是配置有麦克风的终端设备,上述电子设备可以包括但不限于以下至少之一:手机(如Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices,移动互联网设备)、PAD、台式电脑、智能电视等。目标客户端可以是视频客户端、即时通信客户端、浏览器客户端、教育客户端等。目标服务端可以是视频服务端、即时通信服务端、浏览器服务端、教育服务端等。上述网络可以包括但不限于:有线网络,无线网络,其中,该有线网络包括:局域网、城域网和广域网,该无线网络包括:蓝牙、WIFI及其他实现无线通信的网络。上述服务端可以是单一服务器,也可以是由多个服务器组成的服务器集群,或者是云服务器。上述仅是一种示例,本实施 例中对此不作任何限定。Optionally, in this embodiment, the above-mentioned electronic device may be a terminal device configured with a target client and/or a target server. The above-mentioned terminal device may be a microphone or a microphone array, or may be a terminal device configured with a microphone. The above-mentioned Electronic devices may include but are not limited to at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs wait. The target client can be a video client, an instant messaging client, a browser client, an education client, etc. The target server can be a video server, an instant messaging server, a browser server, an education server, etc. The above-mentioned networks may include but are not limited to: wired networks and wireless networks. The wired networks include local area networks, metropolitan area networks and wide area networks. The wireless networks include Bluetooth, WIFI and other networks that implement wireless communication. The above-mentioned server can be a single server, a server cluster composed of multiple servers, or a cloud server. The above is just an example. This implementation There is no restriction on this in the example.
本实施例中,可以通过麦克风或麦克风阵列录音来获取原始音频数据。原始音频数据可以为录音终端获取到的各种音频格式的数据文件,包括但不限于:ACT、REC、MP3、WAV、WMA、VY1、VY2、DVF、MSC、AIFF等格式;原始音频数据也可以是脉冲编码调制(Pulse Code Modulation,PCM)音频流数据。In this embodiment, original audio data can be obtained through microphone or microphone array recording. The original audio data can be data files in various audio formats obtained by the recording terminal, including but not limited to: ACT, REC, MP3, WAV, WMA, VY1, VY2, DVF, MSC, AIFF and other formats; the original audio data can also be It is Pulse Code Modulation (PCM) audio stream data.
电子设备可以在操作界面上显示录音按钮,用户点击录音按钮开始录音,语音录制系统开始工作,并在子线程中通过AudioRecord和AudioChunk不断循环收集音频数据,并将收集的音频数据传递给语音转文字系统,以便语音转文字系统将音频数据转化成文字。其中,所述AudioRecord为android媒体录音工具;所述AudioChunk为自定义数据盒子,包含一个byte数组并提供byte数组转化成short数组功能;所述byte数组用于接收AudioRecord返回的音频数据。The electronic device can display a recording button on the operation interface. The user clicks the recording button to start recording. The voice recording system starts to work, and continuously collects audio data in a sub-thread through AudioRecord and AudioChunk, and passes the collected audio data to speech-to-text. system so that speech-to-text systems convert audio data into text. Among them, the AudioRecord is an android media recording tool; the AudioChunk is a custom data box, which contains a byte array and provides the function of converting the byte array into a short array; the byte array is used to receive the audio data returned by AudioRecord.
第一接收模块22,用于接收针对所述原始音频数据的第一时间进度的第一标记指令;The first receiving module 22 is configured to receive a first marking instruction for the first time progress of the original audio data;
第二接收模块23,用于接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;The second receiving module 23 is configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;
本实施例中,计时器系统用于记录音频数据录制和播放时的时间进度。其中,原始音频数据的录制时间进度与播放时间进度相对应。In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress.
在录制过程中,计时器系统可以记录各个时间节点,包括录制的总时长、进行标记处理的起始时间点(即第一时间进度)和结束时间点(即第二时间进度),其中,第一时间进度与第二时间进度成对出现,第一时间进度的数量可以为一个或多个,每一对第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据。其中,第一时间进度可以是整个原始音频数据的起始时间点,也可以是原始音频数据中间的某个时间点;第二时间进度可以是整个原始音频数据的结束时间点,也可以是原始音频数据中间的某个时间点。During the recording process, the timer system can record various time points, including the total duration of recording, the starting time point for marking processing (i.e., the first time progress) and the end time point (i.e., the second time progress), where the The first time progress and the second time progress appear in pairs. The number of the first time progress can be one or more. The original audio data between each pair of the first time progress and the second time progress is the original audio that needs to be focused on. data. Among them, the first time progress can be the starting time point of the entire original audio data, or a certain time point in the middle of the original audio data; the second time progress can be the end time point of the entire original audio data, or it can be the original Some time point in the middle of the audio data.
存储模块24,用于存储所述第一时间进度、所述第二时间进度和所述原 始音频数据。Storage module 24, used to store the first time progress, the second time progress and the original time progress. Start audio data.
通过第一时间进度和第二时间进度可以确定一段语音,根据第一时间进度和第二时间进度可以对语音进行分段,其中,第一时间进度为分段后的语音的起始时间点,第二时间进度为分段后的语音的结束时间点。在原始音频数据包括多组第一时间进度和第二时间进度时,可以将原始音频数据分为多段语音。A segment of speech can be determined through the first time progress and the second time progress, and the speech can be segmented according to the first time progress and the second time progress, where the first time progress is the starting time point of the segmented speech, The second time progress is the end time point of the segmented speech. When the original audio data includes multiple sets of first time progress and second time progress, the original audio data may be divided into multiple speech segments.
录音开始时,计时器系统获取电子设备当前时间毫秒值作为语音的起始时间点,然后可以通过android制定定时任务工具Timer每隔一毫秒更新一次语音的结束时间点和语音时长。When the recording starts, the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
标记处理模块26,用于将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。The marking processing module 26 is configured to use the original audio data between the first time progress and the second time progress as target audio data, and perform marking processing on the target audio data.
第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据,为了方便用户快速确定第一时间进度和第二时间进度之间的原始音频数据,可以将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。一具体示例中,可以在原始音频数据对应的播放进度条中标识出成对的第一时间进度和第二时间进度,或者,可以采用专门的显示界面显示成对的第一时间进度和第二时间进度的信息。比如,第一时间进度为38秒,第二时间进度为58秒,则可以在原始音频数据对应的播放进度条中对38秒和58秒这两个时间点进行打标,用户可以通过标记出的38秒和58秒这两个时间点确定需要重点关注的原始音频数据;或者,在原始音频数据对应的显示界面显示38秒和58秒这两个时间点,用户可以通过记录的38秒和58秒这两个时间点确定需要重点关注的原始音频数据。The original audio data between the first time progress and the second time progress is the original audio data that needs to be focused on. In order to facilitate the user to quickly determine the original audio data between the first time progress and the second time progress, the first time progress can be The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked. In a specific example, the paired first time progress and the second time progress can be identified in the playback progress bar corresponding to the original audio data, or a special display interface can be used to display the paired first time progress and the second time progress. Time progress information. For example, if the first time progress is 38 seconds and the second time progress is 58 seconds, the two time points of 38 seconds and 58 seconds can be marked in the playback progress bar corresponding to the original audio data. The user can mark the The two time points of 38 seconds and 58 seconds determine the original audio data that needs to be focused on; alternatively, the two time points of 38 seconds and 58 seconds are displayed on the display interface corresponding to the original audio data, and the user can view the recorded 38 seconds and 58 seconds. The two time points of 58 seconds determine the original audio data that needs to be focused on.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
语音识别模块25,用于对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。The speech recognition module 25 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
由于第一时间进度和第二时间进度之间的原始音频数据为需要重点关注 的原始音频数据,因此,可以仅对第一时间进度、所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据,这样可以降低语音识别的工作量,并且可以保证用户获取到需要注意的重点内容。Since the original audio data between the first time progress and the second time progress needs to be focused on of the original audio data. Therefore, speech recognition can only be performed on the original audio data between the first time progress and the second time progress to obtain segmented text data. This can reduce the workload of speech recognition and ensure that the user Get the key content that needs attention.
当然,本实施例中,还可以对全部原始音频数据进行语音识别。一些实施例中,语音识别模块25,用于对所述原始音频数据进行语音识别,得到第一文本数据;标记处理模块26,用于从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;所述存储模块24用于存储进行标记处理后的所述第一文本数据。Of course, in this embodiment, speech recognition can also be performed on all original audio data. In some embodiments, the speech recognition module 25 is used to perform speech recognition on the original audio data to obtain the first text data; the marking processing module 26 is used to mark the first text data starting from the first starting position. processing, the first starting position corresponds to the first time progress; the marking process of the first text data is stopped from the first end position, and the first end position corresponds to the second time progress; so The storage module 24 is used to store the first text data that has been marked.
本实施例中,语音转文字系统用于转化原始音频数据成文字,实际应用时,原始音频数据可以通过自动语音识别技术(Automatic Speech Recognition,ASR)中的语音识别引擎将原始音频数据转换为第一文本数据,ASR是一种将人的语音转换为文本的技术,其目标是让计算机能够“听写”出不同人所说出的连续语音,也称之为“语音听写机”,是实现“声音”到“文字”转换的技术。在本实施例中,语音识别引擎可以为谷歌语音识别引擎、微软语音识别引擎或科大讯飞的语音识别引擎,在此不作限定,通过语音识别引擎可以将原始音频数据中的语音片段转换为文字信息。In this embodiment, the speech-to-text system is used to convert original audio data into text. In actual application, the original audio data can be converted into third-order audio data through the speech recognition engine in automatic speech recognition technology (Automatic Speech Recognition, ASR). A text data, ASR is a technology that converts human speech into text. Its goal is to enable the computer to "dictate" the continuous speech spoken by different people. It is also called a "voice dictation machine" and is the realization of " Technology for converting "sound" to "text". In this embodiment, the speech recognition engine can be Google speech recognition engine, Microsoft speech recognition engine or iFlytek's speech recognition engine, which is not limited here. The speech fragments in the original audio data can be converted into text through the speech recognition engine. information.
具体地,可以基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;将所述目标语音数据输入至语音识别引擎,得到所述第一文本数据。例如,基于FFMPEG工具将原始音频数据从PCM格式转换为MP3格式,将该MP3格式的原始音频数据进行切分,得到包含语音片段的目标语音数据,也就是说该MP3格式的原始音频数据中可以只保留包含人声的音频片段。将原始音频数据转换为MP3格式,方便用户对原始音频数据进行切分及保存。Specifically, the original audio format of the original audio data can be converted into a target audio format based on the FFMPEG tool; the original audio data in the target audio format can be segmented to obtain target voice data; and the target audio data can be converted into the target audio format. The voice data is input to the voice recognition engine to obtain the first text data. For example, the original audio data is converted from PCM format to MP3 format based on the FFMPEG tool, and the original audio data in MP3 format is segmented to obtain target voice data containing voice segments. That is to say, the original audio data in MP3 format can Only audio clips containing human voices will be retained. Convert the original audio data to MP3 format to facilitate users to segment and save the original audio data.
一些实施例中,语音转文字系统还可以是基于深度学习Transformer模型的流式语音识别系统,该流式语音识别系统支持边录边转,即在录音的同时 将音频数据转换为文本数据,也支持直接识别已有的音频数据。In some embodiments, the speech-to-text system can also be a streaming speech recognition system based on the deep learning Transformer model. The streaming speech recognition system supports recording and translating at the same time, that is, recording while Convert audio data to text data and also support direct recognition of existing audio data.
在接收针对所述原始音频数据的第一时间进度的第一标记指令之后,记录此时所述原始音频数据的第一时间进度,同时从所述第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;After receiving the first marking instruction for the first time progress of the original audio data, recording the first time progress of the original audio data at this time, and simultaneously marking the first text data starting from the first starting position. Perform marking processing, and the first starting position corresponds to the first time progress;
接收针对所述原始音频数据的第二时间进度的第二标记指令之后,记录此时所述原始音频数据的第二时间进度,同时从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;After receiving the second marking instruction for the second time progress of the original audio data, record the second time progress of the original audio data at this time, and at the same time stop marking the first text data from the first end position. , the first end position corresponds to the second time progress;
本实施例中,计时器系统用于记录音频数据录制和播放时的时间进度。其中,原始音频数据的录制时间进度与播放时间进度相对应。文字样式系统用于录音和播放时对第一文本数据的内容进行标记处理,包括:采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或,对所述第一文本数据中的文字进行加粗,这样用户很容易从第一文本数据中识别出需要注意的内容。In this embodiment, the timer system is used to record the time progress during audio data recording and playback. Among them, the recording time progress of the original audio data corresponds to the playback time progress. The text style system is used to mark the content of the first text data during recording and playback, including: marking the first text data with a first color, the first color being different from black; and/or, marking the first text data The text in the first text data is bolded, so that the user can easily identify the content that needs attention from the first text data.
在录制过程中,计时器系统在原始音频数据转换为第一文本数据的整个过程中,记录各个时间节点,包括录制的总时长、进行标记处理的起始时间点(即第一时间进度)和结束时间点(即第二时间进度),其中,第一时间进度与第二时间进度成对出现,第一时间进度的数量可以为一个或多个,每一对第一时间进度和第二时间进度之间的原始音频数据为需要重点关注的原始音频数据,也就是需要校正的文本对应的原始音频数据。同时,第一时间进度与第一起始位置一一对应,第一时间进度对应的原始音频数据转换为文本后,在第一文本数据中的位置是第一起始位置;第二时间进度与第一结束位置一一对应,第二时间进度对应的原始音频数据转换为文本后,在第一文本数据中的位置是第一结束位置。During the recording process, the timer system records various time nodes during the entire process of converting the original audio data into the first text data, including the total duration of the recording, the starting time point for marking processing (i.e., the first time progress) and The end time point (i.e. the second time progress), where the first time progress and the second time progress appear in pairs, the number of the first time progress can be one or more, each pair of the first time progress and the second time progress The original audio data between progresses is the original audio data that needs to be focused on, that is, the original audio data corresponding to the text that needs to be corrected. At the same time, the first time progress corresponds to the first starting position one by one. After the original audio data corresponding to the first time progress is converted into text, the position in the first text data is the first starting position; the second time progress corresponds to the first starting position. There is a one-to-one correspondence between the end positions. After the original audio data corresponding to the second time progress is converted into text, the position in the first text data is the first end position.
录音开始时,计时器系统获取电子设备当前时间毫秒值作为语音的起始时间点,然后可以通过android制定定时任务工具Timer每隔一毫秒更新一次语音的结束时间点和语音时长。When the recording starts, the timer system obtains the current time millisecond value of the electronic device as the starting time point of the voice, and then updates the end time point and voice duration of the voice every one millisecond through the Android scheduled task tool Timer.
具体地,电子设备可以在通过麦克风或麦克风阵列录音获取原始音频数 据时,通过操作界面向用户实时展示与原始音频数据对应的第一文本数据,在用户第2k-1次点击或选中第一文本数据的内容时,记录该内容在第一文本数据中的位置为第一起始位置,第2k次点击或选中第一文本数据的内容时,记录该内容在第一文本数据中的位置为第一结束位置,对第一起始位置和第一结束位置之间的第一文本数据进行标记处理,k为正整数。比如,在用户第3次点击或选中第一文本数据中的“那么”时,记录“那么”所在位置为第一起始位置,在用户第4次点击或选中第一文本数据中的“写字板”时,记录“写字板”所在位置为第一结束位置,对“那么”到“写字板”之间的文本“那么,既有写字板又为何要记事本”进行标记处理,比如加粗和/或染色。在用户第7次点击或选中第一文本数据中的“很好”时,记录“很好”所在位置为第一起始位置,在用户第8次点击或选中第一文本数据中的“规划”时,记录“规划”所在位置为第一结束位置,对“很好”到“规划”之间的文本“很好的规划”进行标记处理,比如加粗和/或染色。Specifically, the electronic device can obtain the original audio data by recording through a microphone or microphone array. When the user clicks or selects the content of the first text data for the 2k-1st time, the first text data corresponding to the original audio data is displayed to the user in real time through the operation interface, and the position of the content in the first text data is recorded. is the first starting position. When the content of the first text data is clicked or selected for the 2kth time, the position of the content in the first text data is recorded as the first ending position. The distance between the first starting position and the first ending position is The first text data is marked, and k is a positive integer. For example, when the user clicks or selects "then" in the first text data for the third time, the position of "then" is recorded as the first starting position. When the user clicks or selects "WordPad" in the first text data for the fourth time, ", record the position of "WordPad" as the first end position, and mark the text "So, why do you need Notepad when there is a WordPad" between "Then" and "WordPad", such as bold and /or dyeing. When the user clicks or selects "Very Good" in the first text data for the 7th time, the position of "Very Good" is recorded as the first starting position, and when the user clicks or selects "Plan" in the first text data for the 8th time When, the position of "planning" is recorded as the first end position, and the text "very good plan" between "very good" and "planning" is marked, such as bolding and/or coloring.
本实施例中,文字样式系统用于对第一文本数据进行标记,比如染色加粗等,记录标记内容在语音内容中的位置,同步标记内容时间点等。所述语音内容为语音转文字系统返回的整段语音字符串文字;所述标记内容为开始标记到结束标记时间段内,语音转文字系统返回的标记的字符串文字。所述标记内容的起始位和结束位是标记内容在语音内容中位置,通常字符串通过角标确定。In this embodiment, the text style system is used to mark the first text data, such as coloring and bolding, recording the position of the marked content in the voice content, synchronizing the time points of the marked content, etc. The voice content is the entire voice string text returned by the speech-to-text system; the marked content is the marked string text returned by the speech-to-text system during the time period from the start mark to the end mark. The start bit and end bit of the marked content are the positions of the marked content in the voice content, and usually the character string is determined by the corner mark.
存储模块24具体用于存储进行标记处理后的所述第一文本数据、所述第一起始位置、所述第一结束位置、所述第一时间进度、所述第二时间进度和所述原始音频数据。The storage module 24 is specifically configured to store the marked first text data, the first starting position, the first ending position, the first time progress, the second time progress and the original audio data.
本实施例中,语音数据操作系统用于存储进行标记处理后的所述第一文本数据、所述第一起始位置、所述第一结束位置、所述第一时间进度、所述第二时间进度和所述原始音频数据。语音数据操作系统包含一个数据库,语音数据操作系统保存每个原始音频数据的音频文件索性、语音内容、语音时长、所有标记数据、语音内容每个文字在语音中的位置等。所述音频文件索引为音频文件的保存路径;所述标记数据为每个标记的标记内容,标记位和 标记时间点,所述文字在语音中的位置为该文字对应语音中的时间进度。In this embodiment, the voice data operating system is used to store the marked first text data, the first starting position, the first ending position, the first time progress, and the second time. progress and the raw audio data. The voice data operating system contains a database. The voice data operating system saves the audio file content of each original audio data, voice content, voice duration, all mark data, the position of each text in the voice content, etc. The audio file index is the storage path of the audio file; the mark data is the mark content of each mark, the mark bit and Mark the time point, and the position of the text in the speech is the time progress in the speech corresponding to the text.
本实施例中,在将原始音频数据转换为第一文本数据时,可以对其中的错误文字或重点内容进行标记处理,存储标记处理后的第一文本数据,并且存储原始音频数据对应的时间进度,这样后续在对第一文本数据进行校对时,通过选择进行标记处理后的文本,可以根据对应的时间进度快速同步到对应的原始音频数据处,方便用户对标记处理后的文本进行校正或做其他处理,能够避免用户从头到尾再听一遍原始音频数据,能够提高对录音文本进行校正的效率,改善用户体验。In this embodiment, when converting the original audio data into the first text data, the erroneous text or key content may be marked, the marked first text data may be stored, and the time progress corresponding to the original audio data may be stored. , so that when the first text data is proofread later, by selecting the marked text, it can be quickly synchronized to the corresponding original audio data according to the corresponding time progress, which facilitates the user to correct or make corrections to the marked text. Other processing can prevent users from listening to the original audio data from beginning to end, improve the efficiency of correcting the recorded text, and improve the user experience.
一些实施例中,如图3所示,所述装置还包括:In some embodiments, as shown in Figure 3, the device further includes:
第三接收模块27,用于接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;The third receiving module 27 is configured to receive processing instructions for the second position of the first text data, the second position being located between the first starting position and the first ending position;
本实施例中,在需要对第一文本数据进行校正时,利用语音播放系统播放已录制的原始音频数据,同时在操作界面向用户实时展示与原始音频数据对应的第一文本数据,第一文本数据包括第一部分和第二部分,其中,第一部分位于每对的第一起始位置和第一结束位置之间,为需要重点关注的内容,已经进行标记处理;所述第二部分位于所述第一起始位置和所述第一结束位置之外,为未经过标记处理的内容。其中,第一部分为可能存在语音转换文字错误的部分,第二部分为不太可能出现错误的部分,因此,在对第一文本数据进行校正时,为了提高效率,仅需要对第一部分进行校正即可。In this embodiment, when the first text data needs to be corrected, the voice playback system is used to play the recorded original audio data, and at the same time, the first text data corresponding to the original audio data is displayed to the user in real time on the operation interface. The first text The data includes a first part and a second part, where the first part is located between the first starting position and the first ending position of each pair, and has been marked for content that needs to be focused on; the second part is located at the Outside a starting position and the first ending position, there is content that has not been marked. Among them, the first part is the part where speech conversion text errors may occur, and the second part is the part where errors are unlikely to occur. Therefore, when correcting the first text data, in order to improve efficiency, only the first part needs to be corrected, that is, Can.
用户可以随意点击或者选中位于第一起始位置和第一结束位置之间的第二位置,则视为接收到针对第二位置的处理指令,需要对所述第一起始位置和所述第一结束位置之间的第一文本数据进行校正。在用户点击或者选中位于第一起始位置和第一结束位置之间任一位置时,都视为需要对所述第一起始位置和所述第一结束位置之间的第一文本数据进行校正。The user can click or select the second position between the first starting position and the first ending position at will, and the processing instruction for the second position is deemed to have been received. The first starting position and the first ending position need to be processed. The first text data between positions is corrected. When the user clicks or selects any position between the first starting position and the first ending position, it is deemed that the first text data between the first starting position and the first ending position needs to be corrected.
第二处理模块28,用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。The second processing module 28 is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the first text data. 2. Text data.
可以根据预先存储的与第一起始位置对应的第一时间进度、与第一结束 位置对应的第二时间进度,快速定位到原始音频数据的相应位置,重新播放所述第一时间进度和所述第二时间进度之间的所述原始音频数据,具体地,在未接收到针对下一第二位置的处理指令之前,可以对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放,即从第一时间进度开始播放所述原始音频数据,到第二时间进度停止播放所述原始音频数据,然后回到第一时间进度重新开始播放所述原始音频数据;当然,也可以在播放预设次数后停止播放,比如播放一次或两次后即停止播放所述原始音频数据。can be based on the pre-stored first time progress corresponding to the first starting position and the first ending The second time progress corresponding to the position, quickly locate the corresponding position of the original audio data, and replay the original audio data between the first time progress and the second time progress. Specifically, if no response is received for Before the processing instruction at the next second position, the original audio data between the first time progress and the second time progress can be played in a loop, that is, the original audio data is played starting from the first time progress. , stop playing the original audio data at the second time schedule, and then return to the first time schedule to restart playing the original audio data; of course, you can also stop playing after playing a preset number of times, such as playing once or twice. That is, stop playing the original audio data.
利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果,比如对38秒和1分12秒之间的原始音频数据进行语音识别,得到语音识别结果“那么,既有写字板又为何要用记事本”;Use a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result, such as the original audio data between 38 seconds and 1 minute and 12 seconds. Perform speech recognition and get the speech recognition result "So, why use notepad when you have a writing pad?";
截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外,比如第一部分为“那么,既有写字板又为何要记事本”;A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. In addition to the first ending position, for example, the first part is "So, why do you need a notepad when you have a writing pad?";
比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述第一部分和所述第二部分的第二文本数据。Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the first part and the second part. the second text data.
当原始音频数据经过语音识别引擎得到的语音识别结果与第一部分相比差别较大时,可以将原始音频数据输入至语音识别引擎进行多次语音识别,得到语音识别结果,以提高语音识别的精度。在第一部分与所述语音识别结果不一致时,可以利用所述语音识别结果替换所述第一部分,或者,对第一部分进行修改以使第一部分与语音识别结果一致,来对第一文本数据进行校正,得到校正后的第二文本数据。比如,可以利用语音识别结果“那么,既有写字板又为何要用记事本”替换第一部分“那么,既有写字板又为何要记事本”。When the speech recognition result obtained by passing the original audio data through the speech recognition engine is significantly different from the first part, the original audio data can be input to the speech recognition engine for multiple speech recognitions to obtain the speech recognition results to improve the accuracy of speech recognition. . When the first part is inconsistent with the speech recognition result, the first part can be replaced with the speech recognition result, or the first part can be modified to make the first part consistent with the speech recognition result to correct the first text data. , to obtain the corrected second text data. For example, you can use the speech recognition result "So, why do you need a notepad when you have a writing pad?" to replace the first part of "So, why do you need a notepad when you have a writing pad?"
一些实施例中,所述存储模块24还用于存储所述第二文本数据、所述第一时间进度、所述第二时间进度和所述原始音频数据。 In some embodiments, the storage module 24 is also used to store the second text data, the first time progress, the second time progress and the original audio data.
本实施例中,可以利用上述方案实现对第一文本数据的校正,另外,还可以向用户播放第一时间进度和第二时间进度之间的原始音频数据,由用户在收听原始音频数据之后,手动对第一文本数据进行校正,得到第二文本数据。In this embodiment, the above solution can be used to correct the first text data. In addition, the original audio data between the first time progress and the second time progress can also be played to the user. After the user listens to the original audio data, Manually correct the first text data to obtain the second text data.
一些实施例中,如图4所示,所述第二处理模块28包括:In some embodiments, as shown in Figure 4, the second processing module 28 includes:
截取子模块281,用于截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;Interception sub-module 281 is used to intercept the first part and the second part of the first text data. The first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. Outside the first starting position and the first ending position;
比对子模块282,用于比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。Comparison sub-module 282 is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the result including the The speech recognition result and the second part of the second text data.
一些实施例中,如图5所示,所述语音识别模块25包括:In some embodiments, as shown in Figure 5, the speech recognition module 25 includes:
转换子模块251,用于基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;Conversion sub-module 251, used to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool;
切分子模块252,用于对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;The segmentation module 252 is used to segment the original audio data in the target audio format to obtain target voice data;
处理子模块253,用于将所述目标语音数据输入至语音识别引擎。The processing sub-module 253 is used to input the target speech data to the speech recognition engine.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
语音搜索模块,用于接收用户输入的语音搜索指令;对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。A voice search module, configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
具体地,可以在电子设备的操作界面上显示语音搜索按钮,若用户点击语音搜索按钮后输入语音,视为接收到用户输入的语音搜索指令,语音录制系统开始录音,并把录制的语音数据传递给语音转文字系统进行处理,将所述语音搜索指令转换为搜索关键词。其中,在用户点击语音搜索按钮的过程中,用户输入的语音均视为语音搜索指令。文字样式系统收到搜索关键词后,在第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理,比如对搜索关键词所在位置进行高亮显示。在用户点击高亮位置 处的第一文本数据时,还可以播放对应的原始音频数据,从高亮位置的起始位置开始播放直至高亮位置的结束位置停止播放。通过本实施例的技术方案,可以方便用户从文本数据和音频数据中找到需要的内容。Specifically, a voice search button can be displayed on the operating interface of the electronic device. If the user clicks the voice search button and then inputs voice, it is deemed to have received the voice search instruction input by the user, and the voice recording system starts recording and transmits the recorded voice data. The voice-to-text system is used for processing, and the voice search instructions are converted into search keywords. Among them, when the user clicks the voice search button, the voice input by the user is regarded as a voice search instruction. After receiving the search keyword, the text style system searches for the search keyword in the first text data and marks the location of the search keyword, such as highlighting the location of the search keyword. When the user clicks on the highlighted location When the first text data is at, the corresponding original audio data can also be played, starting from the starting position of the highlighted position and stopping playing at the end position of the highlighted position. Through the technical solution of this embodiment, it is convenient for users to find required content from text data and audio data.
一些实施例中,所述装置还包括:In some embodiments, the device further includes:
错别字识别模块,用于接收用户输入的错别字识别指令;根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。A typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
具体地,可以在电子设备的操作界面上显示错别字识别按钮,若用户点击错别字识别按钮,视为接收到用户输入的错别字识别指令,可以根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理,比如对错别字高亮显示,以提醒用户,用户可以对错别字进行修改,以提高文本数据的准确性。一些实施例中,所述第一接收模块23具体用于采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或,对所述第一文本数据中的文字进行加粗。Specifically, a typo recognition button can be displayed on the operation interface of the electronic device. If the user clicks the typo recognition button, it is deemed that the typo recognition instruction input by the user has been received, and the first text data can be identified according to the vocabulary and context semantic recognition algorithm. The typos in the text are marked, for example, the typos are highlighted to remind the user that the user can modify the typos to improve the accuracy of the text data. In some embodiments, the first receiving module 23 is specifically configured to use a first color to mark the first text data, and the first color is different from black; and/or, to mark the text in the first text data. Make it bold.
本公开的实施例还提供了一种语音数据处理装置,如图6所示,包括处理器32和存储器31,所述存储器31存储可在所述处理器32上运行的程序或指令,所述程序或指令被所述处理器32执行时实现如上所述的语音数据处理方法的步骤。An embodiment of the present disclosure also provides a voice data processing device, as shown in Figure 6, including a processor 32 and a memory 31. The memory 31 stores programs or instructions that can be run on the processor 32. When the program or instruction is executed by the processor 32, the steps of the voice data processing method as described above are implemented.
一些实施例中,所述处理器32用于获取原始音频数据;接收针对所述原始音频数据的第一时间进度的第一标记指令;接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;存储所述第一时间进度、所述第二时间进度和所述原始音频数据。In some embodiments, the processor 32 is configured to obtain original audio data; receive a first marking instruction for a first time progression of the original audio data; and receive a second marking instruction for a second time progression of the original audio data. Mark instructions, the second time schedule is later than the first time schedule; store the first time schedule, the second time schedule and the original audio data.
一些实施例中,所述处理器32用于将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。In some embodiments, the processor 32 is configured to use the original audio data between the first time schedule and the second time schedule as target audio data, and perform marking processing on the target audio data.
一些实施例中,所述处理器32用于对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。In some embodiments, the processor 32 is configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
一些实施例中,所述处理器32用于对所述原始音频数据进行语音识别, 得到第一文本数据;从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;存储进行标记处理后的所述第一文本数据。In some embodiments, the processor 32 is used to perform speech recognition on the original audio data, Obtain the first text data; start marking the first text data from a first starting position, the first starting position corresponds to the first time progress; stop processing the first text from a first end position The data is marked, and the first end position corresponds to the second time progress; the first text data after marking is stored.
一些实施例中,所述处理器32用于接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。In some embodiments, the processor 32 is configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position; Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
一些实施例中,所述处理器32用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放;利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果。In some embodiments, the processor 32 is configured to loop the original audio data between the first time progress and the second time progress; use a speech recognition engine to play the first time progress and the second time progress in a loop; The original audio data between the second time progress is subjected to speech recognition to obtain a speech recognition result.
一些实施例中,所述处理器32用于截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。In some embodiments, the processor 32 is configured to intercept the first part and the second part of the first text data, and the first part is located between the first starting position and the first ending position, so The second part is located outside the first starting position and the first ending position; compare the first part with the speech recognition result, and if the first part is inconsistent with the speech recognition result, use the The speech recognition result replaces the first part, and second text data including the speech recognition result and the second part is obtained.
一些实施例中,所述处理器32用于基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;将所述目标语音数据输入至语音识别引擎。In some embodiments, the processor 32 is configured to convert the original audio format of the original audio data into a target audio format based on the FFMPEG tool; perform segmentation processing on the original audio data in the target audio format to obtain Target speech data; input the target speech data to the speech recognition engine.
一些实施例中,所述处理器32用于接收用户输入的语音搜索指令;对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。In some embodiments, the processor 32 is configured to receive a voice search instruction input by the user; perform voice recognition on the voice search instruction, and convert the voice search instruction into a search keyword; in the first text data Find the search keyword, and mark the location of the search keyword.
一些实施例中,所述处理器32用于接收用户输入的错别字识别指令;根 据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。In some embodiments, the processor 32 is configured to receive typo identification instructions input by the user; Identify typos in the first text data based on the lexicon and context semantic recognition algorithm, and mark the typos.
一些实施例中,所述处理器32用于采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或,对所述第一文本数据中的文字进行加粗。In some embodiments, the processor 32 is configured to use a first color to mark the first text data, the first color being different from black; and/or to bold the text in the first text data. .
本公开的实施例还提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如上所述的语音数据处理方法的步骤。Embodiments of the present disclosure also provide a readable storage medium on which programs or instructions are stored. When the programs or instructions are executed by a processor, the steps of the voice data processing method as described above are implemented.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储待检测终端设备或任何其他非传输介质,可用于存储可以被计算待检测终端设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage terminal devices or any other non-transmission medium may be used to store information that can be accessed by a computing terminal device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述 实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand the above The embodiment method can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in various embodiments of this application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。 The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims (23)

  1. 一种语音数据处理方法,其特征在于,包括:A voice data processing method, characterized by including:
    获取原始音频数据;Get original audio data;
    接收针对所述原始音频数据的第一时间进度的第一标记指令;receiving a first marking instruction for a first time progression of the original audio data;
    接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;receiving a second marking instruction for a second time schedule of the original audio data, the second time schedule being later than the first time schedule;
    存储所述第一时间进度、所述第二时间进度和所述原始音频数据。The first time schedule, the second time schedule and the original audio data are stored.
  2. 根据权利要求1所述的语音数据处理方法,其特征在于,存储所述第一时间进度、所述第二时间进度和所述原始音频数据之后,所述方法还包括:The voice data processing method according to claim 1, characterized in that after storing the first time progress, the second time progress and the original audio data, the method further includes:
    将所述第一时间进度和所述第二时间进度之间的所述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。The original audio data between the first time progress and the second time progress is used as target audio data, and the target audio data is marked.
  3. 根据权利要求1所述的语音数据处理方法,其特征在于,存储所述第一时间进度、所述第二时间进度和所述原始音频数据之后,所述方法还包括:The voice data processing method according to claim 1, characterized in that after storing the first time progress, the second time progress and the original audio data, the method further includes:
    对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。Perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
  4. 根据权利要求1所述的语音数据处理方法,其特征在于,获取原始音频数据之后,所述方法还包括:The voice data processing method according to claim 1, characterized in that after obtaining the original audio data, the method further includes:
    对所述原始音频数据进行语音识别,得到第一文本数据;Perform speech recognition on the original audio data to obtain first text data;
    接收针对所述原始音频数据的第一时间进度的第一标记指令之后,所述方法还包括:After receiving the first marking instruction for the first time progress of the original audio data, the method further includes:
    从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;Mark the first text data starting from a first starting position, where the first starting position corresponds to the first time progress;
    接收针对所述原始音频数据的第二时间进度的第二标记指令之后,所述方法还包括:After receiving the second marking instruction for the second time progression of the original audio data, the method further includes:
    从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;Stop marking the first text data from a first end position, which corresponds to the second time progress;
    存储进行标记处理后的所述第一文本数据。 The first text data that has been marked is stored.
  5. 根据权利要求4所述的语音数据处理方法,其特征在于,存储进行标记处理后的所述第一文本数据之后,所述方法还包括:The voice data processing method according to claim 4, characterized in that after storing the marked first text data, the method further includes:
    接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;receiving processing instructions for a second position of the first text data, the second position being between the first starting position and the first ending position;
    对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。Speech recognition is performed on the original audio data between the first time schedule and the second time schedule, and the first text data is corrected according to the speech recognition result to obtain second text data.
  6. 根据权利要求3或5所述的语音数据处理方法,其特征在于,对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别包括:The voice data processing method according to claim 3 or 5, wherein performing voice recognition on the original audio data between the first time schedule and the second time schedule includes:
    对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放;Play the original audio data in a loop between the first time progress and the second time progress;
    利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果。A speech recognition engine is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result.
  7. 根据权利要求5所述的语音数据处理方法,其特征在于,所述根据语音识别结果对所述第一文本数据进行校正包括:The voice data processing method according to claim 5, wherein correcting the first text data according to the voice recognition result includes:
    截取出所述第一文本数据的第一部分和第二部分,所述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;A first part and a second part of the first text data are intercepted, the first part is located between the first starting position and the first ending position, and the second part is located between the first starting position and the first ending position. outside the first end position;
    比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。Compare the first part and the speech recognition result. If the first part is inconsistent with the speech recognition result, replace the first part with the speech recognition result to obtain the second part including the speech recognition result and the second speech recognition result. part of the second text data.
  8. 根据权利要求3或4所述的语音数据处理方法,其特征在于,对所述原始音频数据进行语音识别包括:The voice data processing method according to claim 3 or 4, characterized in that, performing voice recognition on the original audio data includes:
    基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;Convert the original audio format of the original audio data to the target audio format based on the FFMPEG tool;
    对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据; Perform segmentation processing on the original audio data in the target audio format to obtain target voice data;
    将所述目标语音数据输入至语音识别引擎。The target speech data is input to a speech recognition engine.
  9. 根据权利要求4所述的语音数据处理方法,其特征在于,所述方法还包括:The voice data processing method according to claim 4, characterized in that the method further includes:
    接收用户输入的语音搜索指令;Receive voice search instructions input by the user;
    对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;Perform voice recognition on the voice search instructions, and convert the voice search instructions into search keywords;
    在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。The search keyword is found in the first text data, and the location of the search keyword is marked.
  10. 根据权利要求4所述的语音数据处理方法,其特征在于,所述方法还包括:The voice data processing method according to claim 4, characterized in that the method further includes:
    接收用户输入的错别字识别指令;Receive typo recognition instructions input by the user;
    根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。Identify typos in the first text data according to the lexicon and context semantic recognition algorithm, and mark the typos.
  11. 根据权利要求4或9或10所述的语音数据处理方法,其特征在于,对所述第一文本数据进行标记处理包括:The voice data processing method according to claim 4, 9 or 10, wherein labeling the first text data includes:
    采用第一颜色标记所述第一文本数据,所述第一颜色不同于黑色;和/或using a first color to mark the first text data, the first color being different from black; and/or
    对所述第一文本数据中的文字进行加粗。Bold characters in the first text data.
  12. 一种语音数据处理装置,其特征在于,包括:A voice data processing device, characterized by including:
    获取模块,用于获取原始音频数据;Obtain module, used to obtain original audio data;
    第一接收模块,用于接收针对所述原始音频数据的第一时间进度的第一标记指令;A first receiving module configured to receive a first marking instruction for the first time progress of the original audio data;
    第二接收模块,用于接收针对所述原始音频数据的第二时间进度的第二标记指令,所述第二时间进度晚于所述第一时间进度;a second receiving module configured to receive a second marking instruction for a second time schedule of the original audio data, where the second time schedule is later than the first time schedule;
    存储模块,用于存储所述第一时间进度、所述第二时间进度和所述原始音频数据。A storage module configured to store the first time progress, the second time progress and the original audio data.
  13. 根据权利要求12所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 12, characterized in that the device further includes:
    标记处理模块,用于将所述第一时间进度和所述第二时间进度之间的所 述原始音频数据作为目标音频数据,对所述目标音频数据进行标记处理。a mark processing module, configured to mark all the points between the first time progress and the second time progress The original audio data is used as target audio data, and the target audio data is marked.
  14. 根据权利要求12所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 12, characterized in that the device further includes:
    语音识别模块,用于对所述第一时间进度和所述第二时间进度之间的原始音频数据进行语音识别,得到分段文本数据。A speech recognition module, configured to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain segmented text data.
  15. 根据权利要求12所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 12, characterized in that the device further includes:
    语音识别模块,用于对所述原始音频数据进行语音识别,得到第一文本数据;A speech recognition module, used to perform speech recognition on the original audio data to obtain the first text data;
    标记处理模块,用于从第一起始位置开始对所述第一文本数据进行标记处理,所述第一起始位置与所述第一时间进度对应;从第一结束位置停止对所述第一文本数据进行标记处理,所述第一结束位置与所述第二时间进度对应;A marking processing module, configured to start marking the first text data from a first starting position, which corresponds to the first time progress; and stop marking the first text from a first ending position. The data is marked, and the first end position corresponds to the second time progress;
    所述存储模块用于存储进行标记处理后的所述第一文本数据。The storage module is used to store the first text data after mark processing.
  16. 根据权利要求15所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 15, characterized in that the device further includes:
    第三接收模块,用于接收针对所述第一文本数据的第二位置的处理指令,所述第二位置位于所述第一起始位置和所述第一结束位置之间;A third receiving module configured to receive processing instructions for a second position of the first text data, the second position being located between the first starting position and the first ending position;
    第二处理模块,用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,根据语音识别结果对所述第一文本数据进行校正,得到第二文本数据。The second processing module is used to perform speech recognition on the original audio data between the first time schedule and the second time schedule, correct the first text data according to the speech recognition result, and obtain the second text data.
  17. 根据权利要求14或16所述的语音数据处理装置,其特征在于,所述语音识别模块具体用于对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行循环播放;利用语音识别引擎对所述第一时间进度和所述第二时间进度之间的所述原始音频数据进行语音识别,得到语音识别结果。The voice data processing device according to claim 14 or 16, characterized in that the voice recognition module is specifically configured to cycle the original audio data between the first time progress and the second time progress. Play; use a speech recognition engine to perform speech recognition on the original audio data between the first time schedule and the second time schedule to obtain a speech recognition result.
  18. 根据权利要求16所述的语音数据处理装置,其特征在于,所述第二处理模块包括:The voice data processing device according to claim 16, wherein the second processing module includes:
    截取子模块,用于截取出所述第一文本数据的第一部分和第二部分,所 述第一部分位于所述第一起始位置和所述第一结束位置之间,所述第二部分位于所述第一起始位置和所述第一结束位置之外;The interception sub-module is used to intercept the first part and the second part of the first text data, so The first portion is located between the first starting position and the first ending position, and the second portion is located outside the first starting position and the first ending position;
    比对子模块,用于比对所述第一部分与所述语音识别结果,若所述第一部分与所述语音识别结果不一致,利用所述语音识别结果替换所述第一部分,得到包括所述语音识别结果和所述第二部分的第二文本数据。A comparison submodule is used to compare the first part with the speech recognition result. If the first part is inconsistent with the speech recognition result, use the speech recognition result to replace the first part to obtain the speech including the speech recognition result. The recognition result and the second part of the second text data.
  19. 根据权利要求14或15所述的语音数据处理装置,其特征在于,所述语音识别模块包括:The voice data processing device according to claim 14 or 15, characterized in that the voice recognition module includes:
    转换子模块,用于基于FFMPEG工具将所述原始音频数据的原始音频格式转换为目标音频格式;A conversion submodule for converting the original audio format of the original audio data into a target audio format based on the FFMPEG tool;
    切分子模块,用于对所述目标音频格式下的所述原始音频数据进行切分处理,得到目标语音数据;A segmentation sub-module, used to segment the original audio data in the target audio format to obtain target voice data;
    处理子模块,用于将所述目标语音数据输入至语音识别引擎。A processing submodule is used to input the target speech data to a speech recognition engine.
  20. 根据权利要求15所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 15, characterized in that the device further includes:
    语音搜索模块,用于接收用户输入的语音搜索指令;对所述语音搜索指令进行语音识别,将所述语音搜索指令转换为搜索关键词;在所述第一文本数据中查找所述搜索关键词,并对所述搜索关键词所在位置进行标记处理。A voice search module, configured to receive voice search instructions input by the user; perform voice recognition on the voice search instructions, convert the voice search instructions into search keywords; and search for the search keywords in the first text data , and mark the location of the search keyword.
  21. 根据权利要求15所述的语音数据处理装置,其特征在于,所述装置还包括:The voice data processing device according to claim 15, characterized in that the device further includes:
    错别字识别模块,用于接收用户输入的错别字识别指令;根据词库和上下文语意识别算法识别所述第一文本数据中的错别字,并对所述错别字进行标记处理。A typo recognition module is configured to receive typo recognition instructions input by the user; identify typos in the first text data according to the vocabulary and contextual semantic recognition algorithms, and mark the typos.
  22. 一种语音数据处理装置,其特征在于,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1至11任一项所述的语音数据处理方法的步骤。A voice data processing device, characterized in that it includes a processor and a memory, the memory stores programs or instructions that can be run on the processor, and when the programs or instructions are executed by the processor, the claims are implemented The steps of the voice data processing method described in any one of 1 to 11.
  23. 一种可读存储介质,其特征在于,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1-11任一项所述的语音数据处理方法的步骤。 A readable storage medium, characterized in that the readable storage medium stores programs or instructions, and when the programs or instructions are executed by a processor, the voice data processing method according to any one of claims 1-11 is implemented. A step of.
PCT/CN2023/092438 2022-05-25 2023-05-06 Voice data processing method and apparatus WO2023226726A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210578264.X 2022-05-25
CN202210578264.XA CN114999464A (en) 2022-05-25 2022-05-25 Voice data processing method and device

Publications (1)

Publication Number Publication Date
WO2023226726A1 true WO2023226726A1 (en) 2023-11-30

Family

ID=83030036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092438 WO2023226726A1 (en) 2022-05-25 2023-05-06 Voice data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN114999464A (en)
WO (1) WO2023226726A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999464A (en) * 2022-05-25 2022-09-02 高创(苏州)电子有限公司 Voice data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986657A (en) * 2020-08-21 2020-11-24 上海明略人工智能(集团)有限公司 Audio recognition method and device, recording terminal, server and storage medium
CN112887480A (en) * 2021-01-22 2021-06-01 维沃移动通信有限公司 Audio signal processing method and device, electronic equipment and readable storage medium
CN113539313A (en) * 2021-07-22 2021-10-22 统信软件技术有限公司 Audio marking method, audio data playing method and computing equipment
WO2022022395A1 (en) * 2020-07-30 2022-02-03 华为技术有限公司 Time marking method and apparatus for text, and electronic device and readable storage medium
CN114079695A (en) * 2020-08-18 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for recording voice call content
CN114999464A (en) * 2022-05-25 2022-09-02 高创(苏州)电子有限公司 Voice data processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022022395A1 (en) * 2020-07-30 2022-02-03 华为技术有限公司 Time marking method and apparatus for text, and electronic device and readable storage medium
CN114079695A (en) * 2020-08-18 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for recording voice call content
CN111986657A (en) * 2020-08-21 2020-11-24 上海明略人工智能(集团)有限公司 Audio recognition method and device, recording terminal, server and storage medium
CN112887480A (en) * 2021-01-22 2021-06-01 维沃移动通信有限公司 Audio signal processing method and device, electronic equipment and readable storage medium
CN113539313A (en) * 2021-07-22 2021-10-22 统信软件技术有限公司 Audio marking method, audio data playing method and computing equipment
CN114999464A (en) * 2022-05-25 2022-09-02 高创(苏州)电子有限公司 Voice data processing method and device

Also Published As

Publication number Publication date
CN114999464A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN106997764B (en) Instant messaging method and instant messaging system based on voice recognition
CN109460209B (en) Control method for dictation and reading progress and electronic equipment
CN107766482B (en) Information pushing and sending method, device, electronic equipment and storage medium
US11295069B2 (en) Speech to text enhanced media editing
CN107342088B (en) Method, device and equipment for converting voice information
US20160189713A1 (en) Apparatus and method for automatically creating and recording minutes of meeting
CN102568478A (en) Video play control method and system based on voice recognition
WO2023226726A1 (en) Voice data processing method and apparatus
US8620670B2 (en) Automatic realtime speech impairment correction
CN110740275B (en) Nonlinear editing system
JP2014240940A (en) Dictation support device, method and program
US20160189103A1 (en) Apparatus and method for automatically creating and recording minutes of meeting
WO2016197708A1 (en) Recording method and terminal
WO2020182042A1 (en) Keyword sample determining method, voice recognition method and apparatus, device, and medium
CN109634501B (en) Electronic book annotation adding method, electronic equipment and computer storage medium
WO2018130173A1 (en) Dubbing method, terminal device, server and storage medium
CN109782997B (en) Data processing method, device and storage medium
JP2013152365A (en) Transcription supporting system and transcription support method
US20060195318A1 (en) System for correction of speech recognition results with confidence level indication
CN112114771A (en) Presentation file playing control method and device
JP2013025299A (en) Transcription support system and transcription support method
CN109213977A (en) The generation system of court's trial notes
KR20200046734A (en) Apparatus and method for generating lecture content
US20210064327A1 (en) Audio highlighter
CN109213970B (en) Method and device for generating notes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810799

Country of ref document: EP

Kind code of ref document: A1