WO2019227548A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents

语音识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2019227548A1
WO2019227548A1 PCT/CN2018/092568 CN2018092568W WO2019227548A1 WO 2019227548 A1 WO2019227548 A1 WO 2019227548A1 CN 2018092568 W CN2018092568 W CN 2018092568W WO 2019227548 A1 WO2019227548 A1 WO 2019227548A1
Authority
WO
WIPO (PCT)
Prior art keywords
standard
sentence
matched
frame
conversion
Prior art date
Application number
PCT/CN2018/092568
Other languages
English (en)
French (fr)
Inventor
彭捷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019227548A1 publication Critical patent/WO2019227548A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Definitions

  • the present application relates to the technical field of speech processing, and in particular, to a speech recognition method, device, computer device, and storage medium.
  • a speech recognition method includes:
  • a voice recognition device includes:
  • a voice segmentation module configured to obtain voice data input by a user according to the original text, and use the silence detection algorithm to segment the voice data into voice segments;
  • a speech recognition module configured to perform recognition conversion processing on each of the speech segments, obtain a conversion sentence and a conversion serial number of each of the conversion sentences, and create a corresponding variable storage space for each of the conversion sentences;
  • a text processing module for preprocessing the original text to obtain a standard sentence and a standard serial number of each of the standard sentences
  • a sentence segmentation module configured to determine a segmentation length according to the standard sentence, and perform string segmentation on each of the conversion sentences according to the segmentation length to obtain a character string to be matched;
  • a text matching module is configured to match each of the to-be-matched strings with the standard sentence using the to-be-matched string, and store the standard sequence number of the successfully matched standard sentence to the place where the to-be-matched string is located. Variable storage space corresponding to the conversion statement;
  • An analysis and processing module configured to analyze and process the standard sequence number in the variable storage space to obtain the incorrectly converted speech segment and the standard sentence corresponding to the speech segment;
  • An error correction processing module is configured to store the converted speech segment and its corresponding standard sentence in a speech database as a data set, and train a speech recognition model based on the data set to pass the trained speech recognition model Error correction is performed on speech data in which polyphonic characters or similar accents are detected.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the above-mentioned speech recognition method when the processor executes the computer-readable instructions. step.
  • One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to implement the above-mentioned voice Identify method steps.
  • FIG. 1 is a schematic diagram of an application environment of a speech recognition method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S2 in FIG. 2;
  • FIG. 4 is a specific flowchart of step S3 in FIG. 2; FIG.
  • FIG. 5 is a specific flowchart of step S5 in FIG. 2;
  • FIG. 6 is a specific flowchart of step S6 in FIG. 2;
  • FIG. 7 is a schematic block diagram of a speech recognition device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the speech recognition method provided in this application can be applied in the application environment shown in FIG. 1, which includes a server and a client, where the server and the client are connected through a network, and the user performs voice input through the client.
  • the server recognizes the voice input by the user, and trains the voice recognition model according to the recognition result.
  • the client can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • the speech recognition method provided in the embodiment of the present application is applied to a server.
  • FIG. 2 shows a flowchart of the speech recognition method in this embodiment.
  • the method is applied to the server in FIG. 1 to train a speech recognition model.
  • the speech recognition method includes steps S1 to S7, which are detailed as follows:
  • S1 Obtain the voice data input by the user according to the original text, and use the silence detection algorithm to cut the voice data into voice segments.
  • the original text is a text template provided for the user.
  • the user reads aloud according to the text template on the client.
  • the client uploads the recorded voice data to the server.
  • the server uses the acquired voice data as training samples. Training of speech recognition models.
  • the server uses the silence detection algorithm to segment the voice data.
  • the frame energy of each voice frame is calculated, and the mute segment of the audio data is determined based on the frame energy.
  • the mute and pause are accurately recognized, and the speech data is segmented according to the sentence to obtain a speech segment with a speech length less than a preset time length.
  • the preset time length may be 10 seconds, but it is not limited to this. , Can be set according to the needs of the actual application, there is no limitation here.
  • S2 Recognize and convert each speech segment, obtain the conversion sentence and the conversion serial number of each conversion sentence, and create a corresponding variable storage space for each conversion sentence.
  • speech recognition is performed for each speech segment, and the form is converted to text. Punctuation marks are deleted from the text, and empty text is removed to obtain a conversion sentence.
  • the conversion sentence may be in the form of an array or a matrix.
  • the form is stored in the database, and a conversion sequence number is assigned to each conversion statement according to the chronological order of the speech segments in the speech data, and a corresponding variable storage space is created for each conversion statement.
  • the conversion sentence may be stored in the database in the form of an array.
  • the conversion sentence of each speech segment is an element, the conversion sentence is stored in the array, and the character array str is defined as identification information of the speech data.
  • the array str includes X + 1 elements from str [0] to str [X], str [0] is the first conversion statement, str [1] is the second conversion statement ... str [X] is the X + 1th conversion statement
  • str0, str1 ... strX are the conversion numbers of the conversion statements
  • FLAG0, FLAG1 ... FLAGX are the variable storage spaces corresponding to each conversion statement.
  • sentence segmentation is performed on the original text according to punctuation marks, the punctuation marks in the original text are deleted, and the original text is traversed. If the original text contains a non-Chinese character string, the non-Chinese character string is used. Converted into Chinese, such as "1" into “one”, “kg” into “kilogram”, etc. After segmenting the original text, a standard sentence is obtained.
  • the standard sentence can be in the form of an array or a matrix. Stored in the database and assigned a standard sequence number for each standard statement.
  • the standard sentence can be stored in the database in the form of an array.
  • Each standard sentence is an element, the standard sentence is stored in the array, and the character array arr is defined as the identification information of the original text.
  • the array arr includes arr [0 ] To Y + 1 elements of arr [Y], arr [0] is the first standard sentence, arr [1] is the second standard sentence ... arr [Y] is the Y + 1th standard sentence, and, arr0, arr1 ... arrY are the standard numbers of standard sentences.
  • S4 Determine the segmentation length according to the standard sentence, and perform string segmentation on each conversion statement according to the segmentation length to obtain the string to be matched.
  • the minimum value of the string length of the standard sentence is obtained, the minimum value is determined as the segmentation length, and the character string is cut for each conversion statement according to the segmentation length. Points to get the string to be matched.
  • the character string to be matched is used to match the standard sentence. If the content matching the character string to be matched is matched in the standard sentence, it is confirmed that the match is successful, and the successful standard sentence is matched.
  • the standard serial number arrY is stored in the variable storage space FLAGX corresponding to the conversion statement where the character string to be matched is located, where the variable storage space can store multiple standard serial numbers.
  • S6 Analyze and process the standard sequence number in the variable storage space to obtain the incorrectly converted speech segment and the standard sentence corresponding to the speech segment.
  • the standard serial number in each variable storage space is traversed. If the same standard serial number exists, only any one of them is retained, and the remaining standard serial numbers are deleted.
  • variable storage space when the variable storage space is empty, the standard serial numbers in the variable storage space are discontinuous or there are duplicates, it means that the conversion statement corresponding to the variable storage space fails to match, and the conversion statement corresponding to the variable storage space is voice. Identify the text content of the conversion error. The standard sentence corresponding to the standard serial number stored in the variable storage space is the correct text content.
  • S7 Store the incorrectly converted speech segments and their corresponding standard sentences in the speech database as a data set, and train the speech recognition model based on the data set to detect polyphonic words or the same type through the trained speech recognition model. Accented voice data for error correction.
  • the incorrectly converted speech segment and the corresponding standard sentence obtained in step S6 are stored in the speech database as a data set, and the speech database is a speech corpus of the server.
  • the server can use the data set in the speech database to pair
  • the speech recognition model is trained to enhance the adaptability of the speech recognition model, so that the trained speech recognition model can adapt to more environments and accents.
  • polyphonic words are detected or voice data of the same type of accent is encountered, it has The ability to self-adjust and correct errors improves the accuracy of speech recognition models for speech recognition.
  • the voice data is segmented into speech segments by using a silence detection algorithm.
  • the conversion sentence is segmented to obtain a string to be matched. It is used to match the standard sentence, and store the standard sequence number of the successfully matched standard sentence in the variable storage space corresponding to the conversion statement where the string to be matched is located, and finally analyze and process the standard sequence number in the variable storage space.
  • the incorrectly converted speech segments and their corresponding standard sentences are stored in the speech database.
  • this embodiment provides a detailed description of a specific implementation method of segmenting voice data into voice segments using the silence detection algorithm mentioned in step S2.
  • FIG. 3 illustrates a specific flowchart of step S2, which is detailed as follows:
  • S21 Pre-process the voice data to obtain audio data, where the audio data includes sample values of n sampling points, where n is a positive integer.
  • pulse code modulation technology (pulse code modulation, PCM) is used to encode the acquired voice data, and the analog signal of the voice data is sampled at a sampling point every preset time to discretize it.
  • the preset time is determined according to the sampling frequency of the PCM code.
  • the specific sampling frequency can be set according to historical experience. For example, the sampling frequency can be set to 8000 Hz per second, which means that 8000 sampling signals are collected per second, or it can be based on actual applications. Make settings, there is no restriction here.
  • sampling signals of n sampling points are quantized, and the quantized digital signal is output as the sampling value of the sampling points in the form of binary code groups to obtain audio data, where there are n sampling points and n is the time of the voice data.
  • S22 Perform frame processing on the audio data according to a preset frame length and a preset step size to obtain a K-frame voice frame, where K is a positive integer.
  • the audio data is divided into frames that do not overlap between frames according to a preset frame length and step length.
  • the frame length is the length of the acquired voice frame
  • the step length is the time interval for acquiring the voice frame.
  • the frame length setting value can be in the range of 0.01s-0.03s, and the voice signal in this short period of time is relatively stable.
  • the frame length is set to 0.01s, it can also be set according to the actual application needs. Here No restrictions.
  • the audio data is determined as a voice frame for frame processing according to 80 sampling values. If the sample value of the last voice frame is less than 80, information data with a sample value of 0 is added to the last voice frame, so that the last voice frame includes 80 sample values.
  • the frame energy is the short-term energy of the speech signal, which reflects the data amount of the speech information of the speech frame, and the frame energy of each speech frame is calculated according to formula (1).
  • Ene [i] is the frame energy of the i-th voice frame
  • A is a preset adjustment factor
  • sum (Xi 2 ) is the sum of the squares of the sampling values of the sampling points included in the i-th voice frame.
  • A is a preset adjustment factor
  • the adjustment factor is preset according to the characteristics of the voice data, to avoid the low discrimination between the sentence and the mute because the volume of the sentence in the voice data is too small or the background noise is too large , And affect the accuracy of speech segmentation.
  • the frame energy threshold is a preset parameter. If the calculated frame energy is less than the frame energy threshold, the corresponding voice frame is marked as a mute frame.
  • the frame energy threshold may be specifically set according to historical experience. For example, if the frame energy threshold is set to 0.5, specific analysis settings can also be made based on the calculated frame energy of each voice frame, which is not limited here.
  • the threshold for the number of silent frames is a preset parameter. If it is detected that the number of consecutive silent frames is greater than a preset threshold for the number of silent frames, the continuous silent frame is marked as a silent segment, and the frame energy is
  • the threshold can be specifically set according to historical experience. For example, the threshold for the number of mute frames is set to 5, or the frame energy of each voice frame can be calculated for specific analysis settings. There is no limitation here.
  • S26 Determine the segmentation frame of the voice data according to the mute segment, and use the segmentation frame to segment the voice data to obtain a voice segment.
  • the middle frame of the continuous frame number of the silent section is used as the separation point. If the number of continuous frame numbers is even, the continuous frame is taken.
  • the smaller frame number in the middle of the number is marked as the split frame, and the smaller frame number in the middle of the continuous frame number may be marked as the split frame, which is not limited here.
  • the frame energy threshold is 0.5 and the number of silent frames is 5, then the frame energy Ene1, Ene2, Ene8, Ene13, Ene14, Ene15, Ene16, Ene17, Ene18 are all less than 0.5, and the screen energy is less than the frame energy threshold.
  • the frame number of the voice frame is marked as a mute frame, and then the frame numbers of consecutive frame numbers greater than 5 frames are filtered, and the frame numbers corresponding to Ene13, Ene14, Ene15, Ene16, Ene17, and Ene18 are marked as mute segments to obtain consecutive frame numbers The smaller frame number in the middle, and the 15th voice frame is marked as the split frame.
  • the audio data is segmented according to the segmented frame, and the frames between the segmented points are combined into an independent speech segment.
  • the voice data is preprocessed to obtain audio data, the audio data is divided into multiple voice frames, and the frame energy of each frame of the voice frame is calculated according to the sampled value of the audio data. If the frame energy threshold is set, the voice frame is marked as a mute frame. Further, if it is detected that the number of consecutive mute frames is greater than a preset number of mute frames, the continuous mute frame is marked as a mute segment, and segmentation is determined. The frame number of the frame, and finally the speech data is segmented according to the segmented frame to obtain a speech segment.
  • the frame energy of each voice frame is calculated, and the mute segment of the audio data is determined according to the frame energy, so that the mute and pause in the voice data can be accurately identified, and the sentence can be correctly segmented. , To avoid destroying the integrity of the sentence, to achieve the correct segmentation of speech data.
  • this embodiment provides a detailed description of a specific implementation method for preprocessing the original text mentioned in step S3 to obtain a standard sentence and a standard serial number of each standard sentence.
  • FIG. 4 illustrates a specific flowchart of step S3, which is detailed as follows:
  • the preset punctuation mark may be a colon, a comma, a semicolon, a period, a question mark, or an exclamation mark, but it is not limited thereto, and may be specifically set according to actual application requirements, and is not limited herein.
  • the original text is traversed. If a preset punctuation mark is detected, the original text is segmented according to the sentence, the original text is segmented according to the sentence, divided into a single sentence, and all the punctuation marks in the single sentence are deleted. To get the split statement.
  • S32 Traverse each segmented statement. If the segmented statement contains a non-Chinese character string, convert the non-Chinese character string to Chinese to obtain a standard sentence, and assign a standard sequence number to each standard sentence.
  • the character string includes a Chinese character string and a non-Chinese character string.
  • the segmentation statement is obtained according to step S31, and all segmentation statements are traversed and searched. If it is detected that the segmentation statement contains a non-Chinese character string, then Get the content of the non-Chinese string in the segmentation statement and convert the non-Chinese string to Chinese.
  • the content of the non-Chinese character string belongs to the date content
  • the year, month, and date are converted according to a preset requirement
  • the preset requirement is set according to the recognition and conversion form of the date by the voice recognition model, which is not limited here .
  • the preset demand may be specifically that the year is less than 1000 or greater than 2500
  • the year needs to be converted into Chinese, but the month and date are not converted to Chinese.
  • the digital content is an integer
  • take each Arabic number of the integer from left to right use the preset Chinese number array to replace the Arabic number with Chinese
  • use the preset The digital weights array is the converted Chinese matching weights, such as "213" is converted to "213".
  • the content of the number is a decimal number
  • divide the decimal number into an integer part and a decimal part Take each Arabic number of the integer part from left to right, use the preset Chinese number array to replace the Arabic numbers with Chinese, and use the preset.
  • the array of digital weights is the converted Chinese matching weights.
  • the decimal point is converted to "point", which is added between the integer part and the decimal part, and the Chinese part after the integer part and the decimal part are converted, such as "20.3" is converted to "twenty three".
  • the digital content is de-zeroed after being converted to Chinese. Specifically, if the last single digit in the integer part of the integer or decimal is "0", after converting the digital content to Chinese, only the maximum content is retained. The non- “0” Chinese characters on the right are converted to Chinese, and the Chinese characters after the "0" numbers on the right are deleted. For example, "1000” will be converted to "1000”, and then "0 hundred and ten” "Zero" is deleted and you get "One Thousand”.
  • the content of the non-Chinese character string includes a physical unit
  • the physical unit is directly converted into Chinese, such as "kg” is converted to “kilogram”, “cm” is converted to “cm”, and the like.
  • the original text of the text template is:
  • the original text is segmented according to sentences according to preset punctuation marks to obtain segmented sentences, and the original text is divided into sentence forms, which can improve the matching efficiency with speech-transformed text.
  • After obtaining the segmented sentences iterate through each segmented sentence, convert non-Chinese strings into Chinese, and obtain standard sentences for matching with the strings to be matched, which can improve the matching rate between standard text and speech-transformed text. , To avoid reducing the accuracy of speech recognition due to different display forms of text content.
  • this embodiment provides that for each character string to be matched mentioned in step S5, the character string to be matched is used to match a standard sentence, and the standard serial number of the successfully matched standard sentence is stored.
  • the specific implementation method in the variable storage space corresponding to the conversion statement where the string to be matched is located will be described in detail.
  • FIG. 5 shows a specific flowchart of step S5, which is detailed as follows:
  • S51 Set the first standard sentence as the starting point of the match, and determine the matching range according to the starting point of the match.
  • the first standard sentence is set as the starting point of the match, and the matching range is determined according to the starting point of the match, which is used for matching with the first character string to be matched.
  • the matching range is a standard sentence obtained from the starting point of the matching according to the value of the matching range and in the order of the standard sequence number.
  • the matching range can be a preset matching range value according to the string length of the standard sentence, such as the standard sentence. The longer the string, the smaller the value of the matching range. You can also generate a matching range value based on the number of standard sentences. For example, the value of the matching range can be set to 5.
  • S52 Match each string to be matched with the standard sentence in the matching range according to the sequence number of the conversion statement. If the standard sentence in the matching range matches the content that is consistent with the string to be matched, then Confirm that the match was successful, otherwise confirm that the match failed.
  • the matching method is to sequentially obtain the strings to be matched divided by the conversion statements in the order of the conversion sequence numbers of the conversion statements, and to use the strings to be matched with The standard sentence is matched. If the content in the standard sentence matches the string to be matched, the match is confirmed to be successful, otherwise the match is confirmed to be failed.
  • the standard sentence at the starting point of the match does not contain the current string to be matched, then the standard sentence continues to be obtained within the matching range for matching. If the string to be matched successfully matches the standard sentence within the matching range, the The standard sequence number of the successfully matched standard sentence is stored in the variable storage space corresponding to the conversion sentence where the string to be matched is located, and the successfully matched standard sentence is used as the starting point of the next string to be matched.
  • the matching range is not changed, and the next string to be matched is used to match the standard sentence in the matching range.
  • arr [0] does not contain the content of the string to be matched, “I am mine”, then use the next standard sentence arr [1] for matching. If arr [1] matches the string to be matched, “I am If the content is consistent, the standard sequence number arr1 of the currently matched standard sentence is stored in the variable storage space FLAG1 corresponding to the conversion sentence str [1] where the string to be matched is located, and the successfully matched standard sentence is used. arr [1] is the starting point of the next string to be matched. At the same time, the matching range is arr [1], arr [2], arr [3], arr [4], and arr [5].
  • arr [0], arr [1], arr [2], arr [3], and arr [4] in the matching range do not contain the content of the character string "I am my own", it is confirmed as The match fails, the matching range is not changed, and the next string to be matched is used to match the standard sentence in the matching range.
  • the match by setting the first standard sentence as the starting point of the match and determining the matching range according to the starting point of the match, there is no need to perform matching in all the standard sentences, thereby improving resource utilization.
  • the match is successful, the standard sequence number of the successfully matched standard sentence is stored in the variable storage space corresponding to the conversion sentence where the string to be matched is located, and the successfully matched standard sentence is used as the next string to be matched. Starting point, there is no need to match from the first standard sentence, which improves the matching efficiency.
  • the match fails, the next string to be matched is used to match the standard sentence in the matching range until all the strings to be matched are matched, and the standard serial number stored in the storage space of each variable is obtained.
  • the matching is performed in the order of the conversion sequence number and the standard sequence number, and the matching method of the matching range is limited, thereby improving the matching rate of the converted text and the original text.
  • this embodiment provides a detailed implementation method of analyzing and processing the standard sequence number in the variable storage space mentioned in step S6 to obtain a conversion error speech segment and a standard sentence corresponding to the speech segment. Instructions.
  • FIG. 6 illustrates a specific flowchart of step S6, which is detailed as follows:
  • S61 Deduplicate the standard serial number in each variable storage space. If there are at least two identical standard serial numbers in the variable storage space, any one of the standard serial numbers is retained, and the remaining standard serial numbers are deleted.
  • the matching is completed on all the strings to be matched, and the standard serial numbers stored in each variable storage space are deduplicated according to the standard serial numbers stored in each variable storage space. If the variable storage space is detected, If there are at least two identical standard serial numbers, any one of them is retained, and the remaining standard serial numbers are deleted, so that the standard serial numbers stored in each variable storage space are different.
  • variable storage spaces are discontinuous or have duplicates, it is confirmed that the conversion sentences corresponding to these variable storage spaces do not match the standard sentences, and there is an error in the recognition and conversion of the voice data by the server.
  • the variable storage space where consecutive standard serial numbers are located, or the variable storage space where repeated standard serial numbers are located is to be corrected.
  • FLAG35 For example, if the standard serial number stored in FLAG35 is [arr35, arr37], mark FLAG35 as the space to be corrected; if the standard serial number stored in FLAG35 is [arr35] and the standard serial number stored in FLAG36 is [arr37], then FLAG35 and FLAG36 are marked as spaces to be corrected; if the standard sequence number stored in FLAG35 is [arr35] and the standard sequence number stored in FLAG36 is [arr35], FLAG35 and FLAG36 are marked as spaces to be corrected.
  • variable storage space If the variable storage space is empty, use the variable storage space and its adjacent two variable storage spaces as the space to be corrected.
  • variable storage space if the variable storage space is empty, confirm that the conversion statements corresponding to these variable storage spaces do not match the standard sentences, and that there is an error in the recognition and conversion of the voice data by the server. Two variable storage spaces are marked as spaces to be corrected.
  • FLAG3 does not store any standard sequence number, it means that the conversion sentence str [3] corresponding to FLAG3 does not match the standard sentence, and FLAG2, FLAG3, and FLAG4 are marked as spaces to be corrected.
  • step S62 there is no necessary sequential execution order between step S62 and step S63, and it can be a parallel execution relationship, which is not limited here.
  • a space to be corrected is obtained, and the speech segment and the speech of the conversion error are determined in the speech segment and the standard sentence according to the space to be corrected and the standard serial number contained therein.
  • FLAG35 and FLAG36 are marked as the space to be corrected, according to str [35] and str [36 corresponding to FLAG35 and FLAG36. ].
  • str [35] and str [36] conversion sentences determine the speech segment that was converted incorrectly, and obtain the standard sequence number [arr35] according to the standard sequence numbers [arr35] and [arr37] contained in FLAG35 and FLAG36 ] And [arr37] The standard sentence corresponding to the correct content of the wrong speech segment.
  • the standard sequence numbers in all the variable storage spaces are analyzed and processed to find the case where the conversion statement does not match the standard sentence, and the variable storage space where there is a mismatch is marked as the space to be corrected.
  • the space to be corrected and the standard serial number contained in it to obtain the incorrectly converted speech segment and the standard sentence corresponding to the speech segment, so that it can identify conversion errors, missing, redundant words or words in the text after speech conversion, and enhance speech correction. Wrong ability.
  • a voice recognition device in one embodiment, corresponds to the voice recognition method in the embodiment described above.
  • the voice recognition device includes a voice segmentation module 61, a voice recognition module 62, a text processing module 63, a sentence segmentation module 64, a text matching module 65, an analysis processing module 66, and an error correction processing module 67.
  • the detailed description of each function module is as follows:
  • a voice segmentation module 61 configured to obtain voice data input by a user according to the original text, and use the silence detection algorithm to segment the voice data into voice segments;
  • the speech recognition module 62 is configured to perform recognition conversion processing on each speech segment, obtain a conversion sentence and a conversion serial number of each conversion sentence, and create a corresponding variable storage space for each conversion sentence;
  • a text processing module 63 configured to preprocess the original text to obtain a standard sentence and a standard serial number of each standard sentence
  • the sentence segmentation module 64 is configured to determine a segmentation length according to a standard sentence, and perform string segmentation on each conversion sentence according to the segmentation length to obtain a character string to be matched;
  • the text matching module 65 is configured to match each string to be matched with a standard sentence for each string to be matched, and store the standard sequence number of the successfully matched standard sentence in the corresponding conversion sentence where the string to be matched is stored.
  • An analysis processing module 66 is configured to analyze and process a standard serial number in a variable storage space to obtain a speech segment that is incorrectly converted and a standard sentence corresponding to the speech segment;
  • An error correction processing module 67 is configured to store the incorrectly converted speech segments and their corresponding standard sentences in a speech database as a data set, and train a speech recognition model based on the data set, so as to pass the trained speech recognition model to Polyphony or the same type of accent is used to correct the speech data.
  • the speech recognition module 62 includes:
  • a voice data processing unit 621 configured to preprocess the voice data to obtain audio data, where the audio data includes sample values of n sampling points, where n is a positive integer;
  • the audio data framing unit 622 is configured to perform framing processing on the audio data according to a preset frame length and a preset step length to obtain a K-frame voice frame, where K is a positive integer;
  • a frame energy calculation unit 623 configured to calculate a frame energy of each speech frame according to a sample value
  • a mute frame marking unit 624 for each frame of the speech frame, if the frame energy of the speech frame is less than a preset frame energy threshold, mark the speech frame as a mute frame;
  • a mute segment marking unit 625 configured to mark the continuous mute frame as a mute segment if the number of consecutive mute frames is detected to be greater than a preset mute frame number threshold;
  • the voice segment obtaining unit 626 is configured to determine a segmented frame of voice data according to the mute segment, and use the segmented frame to segment the voice data to obtain a voice segment.
  • the text processing module 63 includes:
  • a text segmentation unit 631 configured to segment the original text according to sentences according to preset punctuation marks to obtain segmented sentences
  • a text conversion unit 632 is used to traverse each segmented sentence. If the segmented sentence contains a non-Chinese character string, the non-Chinese character string is converted into Chinese to obtain a standard sentence, and a standard serial number is assigned to each standard sentence. .
  • the text matching module 65 includes:
  • the object creation unit 651 is configured to set the first standard sentence as the starting point of the match, and determine the matching range according to the starting point of the match;
  • the text matching unit 652 is configured to match each string to be matched with a standard sentence in a matching range in the order of the conversion sequence number of the conversion statement. If the standard sentence in the matching range is matched with the string to be matched, If the content is consistent, the matching is confirmed to be successful; otherwise, the matching is unsuccessful;
  • the first matching unit 653 is configured to store the standard sequence number of the successfully matched standard sentence in the variable storage space corresponding to the conversion sentence where the string to be matched is matched if the match is successful, and use the successfully matched standard sentence as The starting point of the next string to be matched;
  • the second matching unit 654 is configured to, if the matching fails, use the next character string to be matched with the standard sentence in the matching range until all the character strings to be matched are matched.
  • analysis processing module 66 includes:
  • a data analysis and processing unit 661 is configured to perform deduplication processing on a standard serial number in each variable storage space. If at least two identical standard serial numbers exist in the variable storage space, any one of the standard serial numbers is retained, and the remaining standards are deleted. Serial number
  • the first data identification unit 662 is configured to: if the standard serial numbers in all the variable storage spaces are discontinuous or overlapped, the variable storage space where the discontinuous standard serial numbers are located, or the variable storage space where the standard serial numbers are repeated As space to be corrected;
  • a second data identification unit 663 configured to: if the variable storage space is empty, use the variable storage space and two adjacent variable storage spaces as a space to be corrected;
  • the target data acquisition unit 664 is configured to determine a translation error speech segment and a standard sentence corresponding to the speech segment according to the space to be corrected and a standard sequence number contained in the space.
  • Each module in the above-mentioned speech recognition device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a speech recognition method.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement the speech recognition of the foregoing embodiment.
  • the steps in the method are, for example, steps S1 to S7 shown in FIG. 2, or when the processor executes computer-readable instructions, the functions of each module / unit of the speech recognition device in the foregoing embodiment are implemented, for example, module 61 shown in FIG. 7 Function to module 67. To avoid repetition, we will not repeat them here.
  • a non-volatile readable storage medium on which computer-readable instructions are stored.
  • the steps in the speech recognition method of the foregoing embodiment are implemented, for example, as shown in FIG. Steps S1 to S7 shown in FIG. 2 or, when the processor executes computer-readable instructions, the functions of the modules / units of the speech recognition device in the foregoing embodiment are implemented, for example, the functions of modules 61 to 67 shown in FIG. 7. To avoid repetition, we will not repeat them here.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Abstract

提供了一种语音识别方法、装置、计算机设备及存储介质,其中该语音识别方法包括:获取用户根据原文文本输入的语音数据,将语音数据切分为语音段(S1),并进行识别转换处理,得到转换语句和转换序号,为每个转换语句创建变量存储空间(S2),对原文文本预处理得到标准语句和标准序号(S3),对每个转换语句切分得到待匹配字符串(S4),用于与标准语句进行匹配,将匹配成功的标准语句的标准序号存储到变量存储空间进行分析处理(S5),并将得到的转换错误的语音段和标准语句存储到语音库作为数据集(S6),基于数据集对语音识别模型进行训练,使得训练后的语音识别模型对检测到的语音数据进行纠错(S7)。该方法能够增强语音识别模型的语音纠错能力,提高语音识别模型的准确率。

Description

语音识别方法、装置、计算机设备及存储介质
本申请以2018年05月31日提交的申请号为201810548082.1,名称为“语音识别方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及语音处理技术领域,尤其涉及一种语音识别方法、装置、计算机设备及存储介质。
背景技术
近年来语音识别技术发展迅速,其应用领域不断扩大,各种各样的语音识别系统产品出现在市场上,通过使用语音识别转换工具将语音识别,转化为文字输出后,广泛应用于在模型训练、媒体检索、字幕生成、语音鉴别等方面。
但是,实际的用户类型却是多种多样的,许多人的发音与标准发音相差甚远。所以目前在使用语音识别转换工具将语音识别转换成文字时,由于一些用户的发音问题或者文字的多音字问题等原因,语音识别转换工具不能准确地识别这部分语音数据,并且没有具备一定的纠错能力,导致通过语音识别转换工具转换生成的文字与正确的文本内容存在不一致的情况,实际应用效果差。
发明内容
基于此,有必要针对上述技术问题,提供一种可以提高语音识别准确率的语音识别方法、装置、计算机设备及存储介质。
一种语音识别方法,包括:
获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
对每个所述语音段进行识别转换处理,得到转换语句和每个所述转换语句的转换序号,并为每个所述转换语句创建对应的变量存储空间;
对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
一种语音识别装置,包括:
语音切分模块,用于获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
语音识别模块,用于对每个所述语音段进行识别转换处理,得到转换语句和每个所 述转换语句的转换序号,并为每个所述转换语句创建对应的变量存储空间;
文本处理模块,用于对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
语句分割模块,用于根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
文本匹配模块,用于针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
分析处理模块,用于对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
纠错处理模块,用于将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述语音识别方法的步骤。
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现上述语音识别方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中语音识别方法的一应用环境示意图;
图2是本申请一实施例中语音识别方法的一流程图;
图3是图2中步骤S2的一具体流程图;
图4是图2中步骤S3的一具体流程图;
图5是图2中步骤S5的一具体流程图;
图6是图2中步骤S6的一具体流程图;
图7是本申请一实施例中语音识别装置的一原理框图;
图8是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的语音识别方法,可应用在如图1的应用环境中,该应用环境包括服务端和客户端,其中,服务端和客户端之间通过网络进行连接,用户通过客户端进行语音输入,服务端对用户输入的语音进行识别,并根据识别结果对语音识别模型进行训练。客户端具体可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式 可穿戴设备,服务端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。本申请实施例提供的语音识别方法应用于服务端。
在一实施例中,图2示出本实施例中语音识别方法的一流程图,该方法应用在图1中的服务端,用于训练语音识别模型。如图2所示,该语音识别方法包括步骤S1至步骤S7,详述如下:
S1:获取用户根据原文文本输入的语音数据,并使用静音检测算法将该语音数据切分为语音段。
在本实施例中,原文文本是为用户提供的文本模板,用户在客户端根据文本模板进行朗读,客户端将录下的语音数据上传服务端,服务端以获取到的语音数据作为训练样本进行语音识别模型的训练。
需要说明的是,服务端若转写较长的语音数据会对系统资源产生较大的消耗,并且由于在对较长的语音数据进行语音识别的过程中,服务端自动对齐原因的影响,会降低语音识别的准确率。所以在服务端中使用静音检测算法对语音数据进行切分,通过将语音数据进行分帧处理,计算出每帧语音帧的帧能量,根据帧能量确定音频数据的静音段,从而能够对语音数据中的静音和停顿进行准确识别,并将语音数据按照语句进行切分,得到语音长度小于预设时间长度的语音段进行训练,其中,预设时间长度具体可以是10秒,但并不限于此,具体可以根据实际应用的需要进行设置,此处不做限制。
S2:对每个语音段进行识别转换处理,得到转换语句和每个转换语句的转换序号,并为每个转换语句创建对应的变量存储空间。
在本实施例中,对每个语音段进行语音识别,转换为文本的形式,并删除文本中的标点符号,同时移除空文本,得到转换语句,该转换语句具体可以以数组的形式或者矩阵的形式存储于数据库中,根据语音段在语音数据中的时间顺序为每个转换语句分配转换序号,并为每个转换语句创建对应的变量存储空间。
优选地,转换语句具体可以以数组的形式存储于数据库中,以每个语音段的转换语句为一个元素,将转换语句存储于数组中,定义字符数组str作为语音数据的标识信息,数组str包括str[0]至str[X]的X+1个元素,str[0]为第一个转换语句,str[1]为第二个转换语句…str[X]为第X+1个转换语句,同时,str0、str1…strX为转换语句的转换序号,FLAG0、FLAG1…FLAGX为每个转换语句对应的变量存储空间。
S3:对原文文本进行预处理,得到标准语句和每个标准语句的标准序号。
在本实施例中,按照标点符号对原文文本进行语句切分,并将原文文本中的标点符号删除,再对原文文本进行遍历,若原文文本中包含非中文字符串,则将非中文字符串转换为中文,如“1”转换为“一”,“kg”转换为“千克”等,经过对原文文本的切分转换,得到标准语句,该标准语句具体可以以数组的形式或者矩阵的形式存储于数据库中,并为每个标准语句分配标准序号。
优选地,标准语句具体可以以数组的形式存储于数据库中,以每个标准语句为一个元素,将标准语句存储于数组中,定义字符数组arr作为原文文本的标识信息,数组arr包括arr[0]至arr[Y]的Y+1个元素,arr[0]为第一个标准语句,arr[1]为第二个标准语句…arr[Y]为第Y+1个标准语句,同时,arr0、arr1…arrY为标准语句的标准序号。
S4:根据标准语句确定切分长度,并按照该切分长度对每个转换语句进行字符串切分,得到待匹配字符串。
在本实施例中,在所有的标准语句中,获取其中标准语句的字符串长度的最小值,将该最小值确定为切分长度,并按照该切分长度对每个转换语句进行字符串切分,得到待匹配字符串。
例如,数组str中的转换语句str[0]=“我是一个中国人”,str[1]=“我为我自豪”,若切分长度为4个字符,则在str[0]中截取“我是一个”和“中国人”作为待匹配字符串,在str[1]中 截取“我为我自”和“豪”作为待匹配字符串,直至将数组str中所有的转换语句完成字符串的切分。
S5:针对每个待匹配字符串,使用该待匹配字符串与标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中。
具体地,针对每个待匹配字符串,使用待匹配字符串与标准语句进行匹配,若在标准语句中匹配到与待匹配字符串一致的内容,则确认匹配成功,并将匹配成功的标准语句的标准序号arrY,存储到该待匹配字符串所在的转换语句对应的变量存储空间FLAGX中,其中,变量存储空间可以存储多个标准序号。
S6:对变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句。
在本实施例中,对每个变量存储空间中的标准序号进行遍历,若存在相同的标准序号,则只保留其中任意一个标准序号,删除掉其余的标准序号。
具体地,在对变量存储空间中的标准序号进行去重处理后,获取匹配失败的转换语句,将该匹配失败的转换语句对应的语音段标记为转换错误的语音段,并确定转换错误的语音段对应的标准语句。
需要说明的是,当变量存储空间为空、变量存储空间中的标准序号之间不连续或者存在重复时,表示该变量存储空间对应的转换语句匹配失败,该变量存储空间对应的转换语句为语音识别转换错误的文本内容,该变量存储空间存储的标准序号对应的标准语句为正确的文本内容。
S7:将转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于该数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
具体地,将步骤S6获取到转换错误的语音段及其对应的标准语句,存储到语音库中作为数据集,该语音库为服务端的语音语料库,服务端能够通过利用语音库中的数据集对语音识别模型进行训练,加强语音识别模型的自适应性,以使通过训练后的语音识别模型能够适应更多的环境和口音,在检测到多音字或遇到同类型口音的语音数据时,具备自我调整纠错的能力,提高语音识别模型对语音识别的准确率。
在本实施例中,通过使用静音检测算法将语音数据切分为语音段,在对每个语音段进行识别转换处理,以及对原文文本进行预处理之后,将转换语句切分得到待匹配字符串用于与标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,最后对变量存储空间中的标准序号进行分析处理,得到转换错误的语音段及其对应的标准语句存储到语音库。通过使用经语音识别转换后的文本与标准文本进行匹配,能够识别出语音转换的文本中转换错误、缺失或者冗余的词语,并存储到语音库中用于进行机器模型学习,加强语音识别模型的自适应性,可以适应更多的环境和口音,具备调整纠错的能力,从而提高语音识别模型对语音识别的准确率。
在一实施例中,本实施例提供对步骤S2中所提及的使用静音检测算法将语音数据切分为语音段的具体实现方法进行详细说明。
请参阅图3,图3示出了步骤S2的一具体流程图,详述如下:
S21:对语音数据进行预处理,得到音频数据,其中,音频数据包含n个采样点的采样值,n为正整数。
在本实施例中,采用脉冲编码调制技术(pulse code modulation,PCM)对获取到的语音数据进行编码,对语音数据的模拟信号每隔预设的时间对一个采样点进行采样,使其离散化,该预设的时间根据PCM编码的采样频率进行确定,具体的采样频率可以根据历 史经验设定,如采样频率可以设置为每秒8000Hz,表示每秒采集8000个采样信号,也可以根据实际应用进行设置,此处不做限制。
进一步地,将n个采样点的采样信号量化,以二进制码组的方式输出量化后的数字信号作为采样点的采样值,得到音频数据,其中,采样点为n个,n为语音数据的时间长度与采样频率的乘积。
S22:按照预设的帧长和预设的步长对音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数。
在本实施例中,按照预设的帧长和步长,对音频数据进行帧间不重叠的分帧,帧长为获取的语音帧的长度,步长为获取语音帧的时间间隔,当帧长等于步长时,能够使得分帧之后得到的各个语音帧之间不会出现重叠现象,得到K帧语音帧,其中,K为语音数据的时间长度除以语音帧的时间长度的商。
具体地,帧长设置的值可以在0.01s-0.03s的范围内,这段短时间内的语音信号相对平稳,如帧长设置为0.01s,也可以根据实际应用的需要进行设置,此处不作限制。
例如,若帧长设置为0.01s,步长设置为0.01s,采样频率为8000Hz,每秒采集8000个采样信号,则将音频数据按照80个采样值确定为一帧语音帧进行分帧处理,若最后一帧语音帧的采样值不足80个,则对最后一帧语音帧进行添加采样值为0的信息数据,使得最后一个语音帧包括80个采样值。
S23:根据采样值计算每帧语音帧的帧能量。
具体地,帧能量是语音信号的短时能量,反映了语音帧的语音信息的数据量,根据公式(1)进行计算每帧语音帧的帧能量。
Ene[i]=A×sum(Xi 2)    公式(1)
其中,Ene[i]为第i帧语音帧的帧能量,A为预设的调节因子,sum(Xi 2)为第i帧语音帧中包含的采样点的采样值的平方和。
需要说明的是,A为预设的调节因子,该调节因子根据语音数据的特性进行预设,避免由于语音数据中语句的音量过小或者背景噪声过大,使得语句与静音的区分度不高,而影响语音切分的准确率。
S24:针对每帧语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧。
在本实施例中,帧能量阈值为预先设定的参数,若计算得到的帧能量小于帧能量阈值,则将对应的语音帧标记为静音帧,该帧能量阈值具体可以根据历史经验进行设置,如帧能量阈值设置为0.5,也可以根据计算得到各个语音帧的帧能量进行具体分析设置,此处不做限制。
S25:若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段。
在本实施例中,静音帧数量阈值为预先设定的参数,若检测到存在连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段,该帧能量阈值具体可以根据历史经验进行设置,如静音帧数量阈值设置为5,也可以根据计算得到各个语音帧的帧能量进行具体分析设置,此处不做限制。
S26:根据静音段确定语音数据的切分帧,并使用切分帧对语音数据进行切分,得到语音段。
具体地,为了确保不会切分到语句,并保证语句前后都有一定的时长,将静音段的连续帧号的中间帧作为分隔点,若连续帧号的个数为偶数,则取连续帧号中间其中较小的帧号标记为切分帧,也可以取连续帧号中间其中较小的帧号标记为切分帧,此处不做 限制。
例如,若帧能量阈值为0.5,静音帧数量阈值为5,则筛选得到帧能量Ene1,Ene2,Ene8,Ene13,Ene14,Ene15,Ene16,Ene17,Ene18均为小于0.5,将筛选得到小于帧能量阈值的语音帧的帧号标记为静音帧,再进行筛选出连续帧号大于5帧的帧号,将Ene13,Ene14,Ene15,Ene16,Ene17,Ene18对应的帧号标记为静音段,获取连续帧号中间其中较小的帧号,并将第15帧语音帧标记为切分帧。
进一步地,根据标记的切分帧,将音频数据按照切分帧进行切分,将各切分点之间的帧合并为一个独立语音段。
在本实施例中,通过对语音数据进行预处理,得到音频数据,将音频数据分成多个语音帧,根据音频数据的采样值计算每帧语音帧的帧能量,若语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧,进一步地,若检测到连续的静音帧数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段,并确定切分帧的帧号,最后按照切分帧对语音数据进行切分,得到语音段。通过将语音数据进行分帧处理,计算出每帧语音帧的帧能量,根据帧能量确定音频数据的静音段,从而能够对语音数据中的静音和停顿进行准确识别,实现对语句的正确切分,避免破坏语句的完整性,实现了对语音数据的正确切分。
在一实施例中,本实施例提供对步骤S3中所提及的对原文文本进行预处理,得到标准语句和每个标准语句的标准序号的具体实现方法进行详细说明。
请参阅图4,图4示出了步骤S3的一具体流程图,详述如下:
S31:根据预设的标点符号,对原文文本按照语句进行切分,得到切分语句。
在本实施例中,预设的标点符号可以为顿号、逗号、分号、句号、问号或者感叹号,但并不限于此,具体可以根据实际应用的需要进行设置,此处不做限制。
具体地,对原文文本进行遍历,若检测到预设的标点符号,则从该标点符号处进行切分,将原文文本按照语句进行切分,分割成单句,并将单句中所有的标点符号删除,得到切分语句。
例如,原文文本为“小王只会把心思花在钻法律的空子上,像这样的“聪明人”还是少一点好。”,根据预设的标点符号进行切分后,得到“小王只会把心思花在钻法律的空子上,”和“像这样的“聪明人”还是少一点好。”两个单句,再将单句中所有的标点符号删除,得到切分语句“小王只会把心思花在钻法律的空子上”和“像这样的聪明人还是少一点好”。
S32:对每个切分语句进行遍历,若该切分语句包含非中文字符串,则将非中文字符串转换为中文,得到标准语句,并为每个标准语句分配标准序号。
在本实施例中,字符串包括中文字符串和非中文字符串,根据步骤S31得到切分语句,对所有的切分语句进行遍历查找,若检测到该切分语句包含非中文字符串,则获取切分语句中的非中文字符串的内容,并将该非中文字符串转换为中文。
具体地,若非中文字符串的内容属于日期内容,则根据预设的需求对年份、月份和日期转换,该预设的需求是根据语音识别模型对日期的识别转换形式而设置,此处不作限制。
例如,若服务端在对处于1000年至2500年的年份、月份和日期进行识别转换后,转换得到的日期以数字形式输出,则预设的需求具体可以是,年份小于1000年或大于2500年的年份需要将年份做中文转换处理,而月份和日期不做中文转换处理。
进一步地,若非中文字符串的内容不属于日期内容且为数字内容,则使用预设的中文数字数组{‘零’,‘一’,‘二’,‘三’,‘四’,‘五’,‘六’,‘七’,‘八’,‘九’},以及预设的数字权 位数组{‘’,‘十’,‘百’,‘千’,‘万’,‘十万’,‘百万’,‘千万’,‘亿’},对数字内容进行转换,其中,数字单位数组中个位数的单位为空。
具体地,先判断该数字内容的类型,若该数字内容为整数,则从左到右取出整数的每个阿拉伯数字,使用预设的中文数字数组将阿拉伯数字替换为中文,并使用预设的数字权位数组为转换后的中文匹配权位,如“213”转换为“二百一十三”。
若该数字内容为小数,则将该小数分为整数部分与小数部分,从左到右取出整数部分的每个阿拉伯数字,使用预设的中文数字数组将阿拉伯数字替换为中文,并使用预设的数字权位数组为转换后的中文匹配权位。再从左到右取出小数部分的每个阿拉伯数字,使用预设的中文数字数组将小数部分替换为中文。最后将小数点转换为“点”,添加在整数部分与小数部分之间,将整数部分与小数部分转换后的中文连接,如“20.3”转换为“二十点三”。
同时,对数字内容转换为中文后做去零处理,具体地,若整数或者小数的整数部分中,最后的个位数是“0”,则在将数字内容转换为中文后,只保留到最右边的非“0”数字转换后的中文,将右边的数字“0”转换后的中文删除,如“1000”转换后会得到“一千零百零十零”,则将“零百零十零”删除,得到“一千”。
若整数或者小数的整数部分中,两个非“0”数字之间存在数字“0”,则在将数字内容转换为中文后,将该数字“0”转换后的中文删除,并用“零”替换,如“1001”转换后会得到“一千零一”。
进一步地,若非中文字符串的内容包含物理单位,则直接将该物理单位转换为中文,如“kg”转换为“千克”、“cm”转换为“厘米”等。
进一步地,若非中文字符串的内容包含百分号,则将“%”删除,并在数字部分转换成的中文之前增加“百分之”,如“33%”转换后会得到“百分之三十三”。
为了更好的理解本步骤,下面通过一个具体的例子对将非中文字符串转换为中文,得到标准语句进行说明。
例如,文本模板的原文文本为:
2017年5月12日第55位读者阅读第47297章节完成33%预计200天完成阅读。
则转换后得到的标准文本为:
2017年5月12日第五十五位读者阅读第四万七千二百九十七章节完成百分之三十三预计二百天完成阅读。
在得到标准语句后,为每个标准语句分配标准序号。
在本实施例中,通过根据预设的标点符号,对原文文本按照语句进行切分,得到切分语句,将原文文本切分成语句的形式,能够提高与语音转换的文本的匹配效率。在得到切分语句后,对每个切分语句进行遍历,将非中文字符串转换为中文,得到标准语句用于与待匹配字符串进行匹配,能够提高标准文本与语音转换的文本的匹配率,避免因文本内容的展现形式不一样,而降低语音识别准确率。
在一实施例中,本实施例提供对步骤S5中所提及的针对每个待匹配字符串,使用该待匹配字符串与标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中的具体实现方法进行详细说明。
请参阅图5,图5示出了步骤S5的一具体流程图,详述如下:
S51:将第一句标准语句设置为匹配起点,并根据匹配起点确定匹配范围。
在本实施例中,将第一句标准语句设置为匹配起点,并根据匹配起点确定匹配范围,用于与第一个待匹配字符串进行匹配。
其中,匹配范围为根据匹配范围的值,按照标准序号的顺序从匹配起点开始获取的标准语句,该匹配范围具体可以根据标准语句的字符串长度,进行预设匹配范围的值, 如标准语句的字符串越长,则匹配范围的值就越小,也可以根据标准语句的数量来生成一个匹配范围的值,如匹配范围的值具体可以设置为5。
S52:按照转换语句的转换序号的顺序,将每个待匹配字符串与匹配范围内的标准语句进行匹配,若在匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败。
具体地,匹配的方式为按照转换语句的转换序号的顺序,依次获取由转换语句切分成的待匹配字符串,并按照匹配范围内标准序号的顺序,使用该待匹配字符串与匹配范围内的标准语句进行匹配,若在标准语句中匹配到与待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败。
例如,由转换语句str[0]中截取了待匹配字符串“我是一个”,使用其与标准语句arr[0]=“我是一个中国人”进行匹配,能够在标准语句中匹配到与该待匹配字符串一致的内容,则确认为匹配成功。
S53:若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点。
具体地,若匹配起点的标准语句中不包含当前的待匹配字符串,则在匹配范围内继续往后获取标准语句进行匹配,若待匹配字符串与匹配范围内的标准语句匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点。
S54:若匹配失败,则使用下一个待匹配字符串与匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
进一步地,若在匹配范围内匹配不到与待匹配字符串一致的内容,则确认为匹配失败,匹配范围不作改变,并使用下一个待匹配字符串与该匹配范围内的标准语句进行匹配。按照转换序号的顺序获取待匹配字符串用于进行匹配,直至全部待匹配字符串匹配完为止,得到存储到各个变量存储空间中的标准序号。
为了更好的理解待匹配字符串与标准语句的匹配方式,下面以一个具体的例子对匹配方式进行说明。
假设当前用于待匹配字符串为“我为我自”,是由str[1]中切分得到,匹配起点为arr[0],匹配范围的值为5。
若arr[0]不包含待匹配字符串“我为我自”的内容,则使用下一个标准语句arr[1]进行匹配,若在arr[1]中匹配到与待匹配字符串“我为我自”一致的内容,则将当前匹配成功的标准语句标准序号arr1,存储到该待匹配字符串所在的转换语句str[1]对应的变量存储空间FLAG1中,并以该匹配成功的标准语句arr[1]作为下一个待匹配字符串的匹配起点,同时,匹配范围为arr[1]、arr[2]、arr[3]、arr[4]和arr[5]。
若匹配范围内的标准语句arr[0]、arr[1]、arr[2]、arr[3]和arr[4]都不包含待匹配字符串“我为我自”的内容,则确认为匹配失败,匹配范围不作改变,并使用下一个待匹配字符串与该匹配范围内的标准语句进行匹配。
按照转换序号的顺序获取待匹配字符串,使用上述匹配方式进行匹配,直至全部待匹配字符串匹配完为止。
在本实施例中,通过将第一句标准语句设置为匹配起点,并根据匹配起点确定匹配范围,不需要在所有的标准语句中进行匹配,提高了资源利用率。若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点,不需再从第一句标准语句进行匹配,提高了匹配效率。若匹配失败,则使用下一个待匹配字符串与匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止,得到存储到各个 变量存储空间中的标准序号。通过按照转换序号和标准序号的顺序依次进行匹配,并且限定了匹配范围的匹配方式,提高了语音转换后的文本与原文文本的匹配率。
在一实施例中,本实施例提供对步骤S6中所提及的对变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句的具体实现方法进行详细说明。
请参阅图6,图6示出了步骤S6的一具体流程图,详述如下:
S61:对每个变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的标准序号,则保留其中任意一个标准序号,删除其余的标准序号。
在本实施例中,在全部待匹配字符串完成匹配,根据得到存储到各个变量存储空间中的标准序号,对每个变量存储空间中的标准序号进行去重处理,若检测到该变量存储空间中存在至少两个相同的标准序号,则保留其中任意一个标准序号,删除其余的标准序号,使得每个变量存储空间中存储的标准序号不同。
S62:若所有的变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间。
具体地,若所有的变量存储空间中的标准序号之间不连续或者存在重复,则确认这些变量存储空间对应的转换语句与标准语句不匹配,服务端对语音数据的识别转换存在错误,将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间。
例如,若FLAG35中存储的标准序号有[arr35,arr37],则将FLAG35标记为待纠正空间;若FLAG35中存储的标准序号为[arr35],FLAG36中存储的标准序号为[arr37],则将FLAG35和FLAG36标记为待纠正空间;若FLAG35中存储的标准序号为[arr35],FLAG36中存储的标准序号为[arr35],则将FLAG35和FLAG36标记为待纠正空间。
S63:若变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为待纠正空间。
具体地,若存在变量存储空间为空的情况,则确认这些变量存储空间对应的转换语句与标准语句不匹配,服务端对语音数据的识别转换存在错误,将该变量存储空间及其相邻的两个变量存储空间标记为待纠正空间。
例如,若FLAG3中没有存储任何标准序号,则表示FLAG3对应的转换语句str[3]与标准语句不匹配,将FLAG2、FLAG3和FLAG4标记为待纠正空间。
需要说明的是,步骤S62和步骤S63之间没有必然的先后执行顺序,其可以是并列执行的关系,此处不做限制。
S64:根据待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
进一步地,在对变量存储空间中的标准序号进行分析处理后,得到待纠正空间,并根据待纠正空间及其包含的标准序号,在语音段和标准语句中确定转换错误的语音段和该语音段对应的标准语句。
例如,若FLAG35中存储的标准序号为[arr35],FLAG36中存储的标准序号为[arr37],则将FLAG35和FLAG36标记为待纠正空间,根据FLAG35和FLAG36对应的str[35]和str[36],获取转换语句为str[35]和str[36]对应的语音段,确定为转换错误的语音段,并根据FLAG35和FLAG36包含的标准序号[arr35]和[arr37],获取标准序号[arr35]和[arr37]对应的标准语句,作为转换错误的语音段的正确内容。
在本实施例中,通过对所有的变量存储空间中的标准序号进行分析处理,找出转换语句与标准语句不匹配的情况,并将存在不匹配的情况的变量存储空间标记待纠正空间,根据待纠正空间及其包含的标准序号,获取转换错误的语音段和该语音段对应的标 准语句,从而能够识别出语音转换后的文本中转换错误、缺失、冗余的字或者词语,增强语音纠错的能力。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种语音识别装置,该语音识别装置与上述实施例中语音识别方法一一对应。如图7所示,该语音识别装置包括:语音切分模块61、语音识别模块62、文本处理模块63、语句分割模块64、文本匹配模块65、分析处理模块66和纠错处理模块67。各功能模块详细说明如下:
语音切分模块61,用于获取用户根据原文文本输入的语音数据,并使用静音检测算法将该语音数据切分为语音段;
语音识别模块62,用于对每个语音段进行识别转换处理,得到转换语句和每个转换语句的转换序号,并为每个转换语句创建对应的变量存储空间;
文本处理模块63,用于对原文文本进行预处理,得到标准语句和每个标准语句的标准序号;
语句分割模块64,用于根据标准语句确定切分长度,并按照该切分长度对每个转换语句进行字符串切分,得到待匹配字符串;
文本匹配模块65,用于针对每个待匹配字符串,使用该待匹配字符串与标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
分析处理模块66,用于对变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
纠错处理模块67,用于将转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
进一步地,语音识别模块62包括:
语音数据处理单元621,用于对语音数据进行预处理,得到音频数据,其中,音频数据包含n个采样点的采样值,n为正整数;
音频数据分帧单元622,用于按照预设的帧长和预设的步长对音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数;
帧能量计算单元623,用于根据采样值计算每帧语音帧的帧能量;
静音帧标记单元624,用于针对每帧语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧;
静音段标记单元625,用于若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段;
语音段获取单元626,用于根据静音段确定语音数据的切分帧,并使用切分帧对语音数据进行切分,得到语音段。
进一步地,文本处理模块63包括:
文本切分单元631,用于根据预设的标点符号,对原文文本按照语句进行切分,得到切分语句;
文本转换单元632,用于对每个切分语句进行遍历,若该切分语句包含非中文字符串,则将非中文字符串转换为中文,得到标准语句,并为每个标准语句分标准序号。
进一步地,文本匹配模块65包括:
对象创建单元651,用于将第一句标准语句设置为匹配起点,并根据匹配起点确定匹配范围;
文本匹配单元652,用于按照转换语句的转换序号的顺序,将每个待匹配字符串与匹配范围内的标准语句进行匹配,若在匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败;
第一匹配单元653,用于若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点;
第二匹配单元654,用于若匹配失败,则使用下一个待匹配字符串与匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
进一步地,分析处理模块66包括:
数据分析处理单元661,用于对每个变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的标准序号,则保留其中任意一个标准序号,删除其余的标准序号;
第一数据标识单元662,用于若所有的变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间;
第二数据标识单元663,用于若变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为待纠正空间;
目标数据获取单元664,用于根据待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
关于语音识别装置的具体限定可以参见上文中对于语音识别方法的限定,在此不再赘述。上述语音识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音识别方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例语音识别方法中的步骤,例如图2所示的步骤S1至步骤S7,或者,处理器执行计算机可读指令时实现上述实施例中语音识别装置的各模块/单元的功能,例如图7所示模块61至模块67的功能。为避免重复,这里不再赘述。
在一个实施例中,提供了一种非易失性可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述实施例语音识别方法中的步骤,例如图2所示的步骤S1至步骤S7,或者,处理器执行计算机可读指令时实现上述实施例中语音识别装置的各模块/单元的功能,例如图7所示模块61至模块67的功能。为避免重复,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存 储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音识别方法,其特征在于,所述语音识别方法包括:
    获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
    对每个所述语音段进行识别转换处理,得到转换语句和每个所述转换语句的转换序号,并为每个所述转换语句创建对应的变量存储空间;
    对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
    根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
    针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
    对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
    将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
  2. 如权利要求1所述的语音识别方法,其特征在于,所述使用静音检测算法将所述语音数据切分为语音段,包括:
    对所述语音数据进行预处理,得到音频数据,其中,所述音频数据包含n个采样点的采样值,n为正整数;
    按照预设的帧长和预设的步长对所述音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数;
    根据所述采样值计算每帧所述语音帧的帧能量;
    针对每帧所述语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧;
    若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段;
    根据所述静音段确定所述语音数据的切分帧,并使用所述切分帧对所述语音数据进行切分,得到所述语音段。
  3. 如权利要求1所述的语音识别方法,其特征在于,所述对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号,包括:
    根据预设的标点符号,对所述原文文本按照语句进行切分,得到切分语句;
    对每个所述切分语句进行遍历,若该切分语句包含非中文字符串,则将所述非中文字符串转换为中文,得到所述标准语句,并为每个所述标准语句分配所述标准序号。
  4. 如权利要求1所述的语音识别方法,其特征在于,所述针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,包括:
    将第一句所述标准语句设置为匹配起点,并根据所述匹配起点确定匹配范围;
    按照所述转换语句的转换序号的顺序,将每个所述待匹配字符串与所述匹配范围内的标准语句进行匹配,若在所述匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败;
    若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的 转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点;
    若匹配失败,则使用下一个待匹配字符串与所述匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
  5. 如权利要求1所述的语音识别方法,其特征在于,所述对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句,包括:
    对每个所述变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的所述标准序号,则保留其中任意一个标准序号,删除其余的标准序号;
    若所有的所述变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间;
    若所述变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为所述待纠正空间;
    根据所述待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
  6. 一种语音识别装置,其特征在于,所述语音识别装置包括:
    语音切分模块,用于获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
    语音识别模块,用于对每个所述语音段进行识别转换处理,得到转换语句和每个所述转换语句的转换序号,并为每个所述转换语句创建对应的变量存储空间;
    文本处理模块,用于对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
    语句分割模块,用于根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
    文本匹配模块,用于针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
    分析处理模块,用于对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
    纠错处理模块,用于将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
  7. 如权利要求6所述的语音识别装置,其特征在于,所述语音切分模块包括:
    语音数据处理单元,用于对所述语音数据进行预处理,得到音频数据,其中,所述音频数据包含n个采样点的采样值,n为正整数;
    音频数据分帧单元,用于按照预设的帧长和预设的步长对所述音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数;
    帧能量计算单元,用于根据所述采样值计算每帧所述语音帧的帧能量;
    静音帧标记单元,用于针对每帧所述语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧;
    静音段标记单元,用于若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段;
    语音段获取单元,用于根据所述静音段确定所述语音数据的切分帧,并使用所述切分帧对所述语音数据进行切分,得到所述语音段。
  8. 如权利要求6所述的语音识别装置,其特征在于,所述文本处理模块包括:
    文本切分单元,用于根据预设的标点符号,对所述原文文本按照语句进行切分,得到切分语句;
    文本转换单元,用于对每个所述切分语句进行遍历,若该切分语句包含非中文字符串,则将所述非中文字符串转换为中文,得到所述标准语句,并为每个所述标准语句分配所述标准序号。
  9. 如权利要求6所述的语音识别装置,其特征在于,所述文本匹配模块包括:
    将第一句所述标准语句设置为匹配起点,并根据所述匹配起点确定匹配范围;
    按照所述转换语句的转换序号的顺序,将每个所述待匹配字符串与所述匹配范围内的标准语句进行匹配,若在所述匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败;
    若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点;
    若匹配失败,则使用下一个待匹配字符串与所述匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
  10. 如权利要求6所述的语音识别装置,其特征在于,所述分析处理模块包括:
    对每个所述变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的所述标准序号,则保留其中任意一个标准序号,删除其余的标准序号;
    若所有的所述变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间;
    若所述变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为所述待纠正空间;
    根据所述待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
    对每个所述语音段进行识别转换处理,得到转换语句和每个所述转换语句的转换序号,并为每个所述转换语句创建对应的变量存储空间;
    对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
    根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
    针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
    对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
    将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
  12. 如权利要求11所述的计算机设备,其特征在于,所述使用静音检测算法将所述语音数据切分为语音段,包括:
    对所述语音数据进行预处理,得到音频数据,其中,所述音频数据包含n个采样点的采样值,n为正整数;
    按照预设的帧长和预设的步长对所述音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数;
    根据所述采样值计算每帧所述语音帧的帧能量;
    针对每帧所述语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧;
    若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段;
    根据所述静音段确定所述语音数据的切分帧,并使用所述切分帧对所述语音数据进行切分,得到所述语音段。
  13. 如权利要求11所述的计算机设备,其特征在于,所述对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号,包括:
    根据预设的标点符号,对所述原文文本按照语句进行切分,得到切分语句;
    对每个所述切分语句进行遍历,若该切分语句包含非中文字符串,则将所述非中文字符串转换为中文,得到所述标准语句,并为每个所述标准语句分配所述标准序号。
  14. 如权利要求11所述的计算机设备,其特征在于,所述针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,包括:
    将第一句所述标准语句设置为匹配起点,并根据所述匹配起点确定匹配范围;
    按照所述转换语句的转换序号的顺序,将每个所述待匹配字符串与所述匹配范围内的标准语句进行匹配,若在所述匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败;
    若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点;
    若匹配失败,则使用下一个待匹配字符串与所述匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
  15. 如权利要求11所述的计算机设备,其特征在于,所述对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句,包括:
    对每个所述变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的所述标准序号,则保留其中任意一个标准序号,删除其余的标准序号;
    若所有的所述变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间;
    若所述变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为所述待纠正空间;
    根据所述待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取用户根据原文文本输入的语音数据,并使用静音检测算法将所述语音数据切分为语音段;
    对每个所述语音段进行识别转换处理,得到转换语句和每个所述转换语句的转换序 号,并为每个所述转换语句创建对应的变量存储空间;
    对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号;
    根据所述标准语句确定切分长度,并按照所述切分长度对每个所述转换语句进行字符串切分,得到待匹配字符串;
    针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中;
    对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句;
    将所述转换错误的语音段及其对应的标准语句存储到语音库作为数据集,并基于所述数据集对语音识别模型进行训练,以通过训练后的语音识别模型对检测到多音字或同类型口音的语音数据进行纠错。
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述使用静音检测算法将所述语音数据切分为语音段,包括:
    对所述语音数据进行预处理,得到音频数据,其中,所述音频数据包含n个采样点的采样值,n为正整数;
    按照预设的帧长和预设的步长对所述音频数据进行分帧处理,得到K帧语音帧,其中,K为正整数;
    根据所述采样值计算每帧所述语音帧的帧能量;
    针对每帧所述语音帧,若该语音帧的帧能量小于预设的帧能量阈值,则标记该语音帧为静音帧;
    若检测到连续的静音帧的数量大于预设的静音帧数量阈值,则标记该连续的静音帧为静音段;
    根据所述静音段确定所述语音数据的切分帧,并使用所述切分帧对所述语音数据进行切分,得到所述语音段。
  18. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述原文文本进行预处理,得到标准语句和每个所述标准语句的标准序号,包括:
    根据预设的标点符号,对所述原文文本按照语句进行切分,得到切分语句;
    对每个所述切分语句进行遍历,若该切分语句包含非中文字符串,则将所述非中文字符串转换为中文,得到所述标准语句,并为每个所述标准语句分配所述标准序号。
  19. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述针对每个所述待匹配字符串,使用该待匹配字符串与所述标准语句进行匹配,并将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,包括:
    将第一句所述标准语句设置为匹配起点,并根据所述匹配起点确定匹配范围;
    按照所述转换语句的转换序号的顺序,将每个所述待匹配字符串与所述匹配范围内的标准语句进行匹配,若在所述匹配范围内的标准语句中匹配到与该待匹配字符串一致的内容,则确认匹配成功,否则确认匹配失败;
    若匹配成功,则将匹配成功的标准语句的标准序号,存储到该待匹配字符串所在的转换语句对应的变量存储空间中,并以该匹配成功的标准语句作为下一个待匹配字符串的匹配起点;
    若匹配失败,则使用下一个待匹配字符串与所述匹配范围内的标准语句进行匹配,直至全部待匹配字符串匹配完为止。
  20. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述变量存储空间中的标准序号进行分析处理,得到转换错误的语音段和该语音段对应的标准语句,包括:
    对每个所述变量存储空间中的标准序号进行去重处理,若该变量存储空间中存在至少两个相同的所述标准序号,则保留其中任意一个标准序号,删除其余的标准序号;
    若所有的所述变量存储空间中的标准序号之间不连续或者存在重复,则将不连续的标准序号所在的变量存储空间,或者重复的标准序号所在的变量存储空间作为待纠正空间;
    若所述变量存储空间为空,则将该变量存储空间及其相邻的两个变量存储空间作为所述待纠正空间;
    根据所述待纠正空间及其包含的标准序号,确定转换错误的语音段和该语音段对应的标准语句。
PCT/CN2018/092568 2018-05-31 2018-06-25 语音识别方法、装置、计算机设备及存储介质 WO2019227548A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810548082.1A CN108766437B (zh) 2018-05-31 2018-05-31 语音识别方法、装置、计算机设备及存储介质
CN201810548082.1 2018-05-31

Publications (1)

Publication Number Publication Date
WO2019227548A1 true WO2019227548A1 (zh) 2019-12-05

Family

ID=64000980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092568 WO2019227548A1 (zh) 2018-05-31 2018-06-25 语音识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN108766437B (zh)
WO (1) WO2019227548A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133309A (zh) * 2020-09-22 2020-12-25 掌阅科技股份有限公司 音频和文本的同步方法、计算设备及存储介质
CN113569974A (zh) * 2021-08-04 2021-10-29 网易(杭州)网络有限公司 编程语句纠错方法、装置、电子设备及存储介质
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634935A (zh) * 2018-11-07 2019-04-16 重庆海特科技发展有限公司 语音处理方法、存储介质和装置
CN109599114A (zh) * 2018-11-07 2019-04-09 重庆海特科技发展有限公司 语音处理方法、存储介质和装置
CN109461459A (zh) * 2018-12-07 2019-03-12 平安科技(深圳)有限公司 语音评分方法、装置、计算机设备及存储介质
CN110059168A (zh) * 2019-01-23 2019-07-26 艾肯特公司 对基于自然智能的人机交互系统进行训练的方法
CN110085210B (zh) * 2019-03-15 2023-10-13 平安科技(深圳)有限公司 交互信息测试方法、装置、计算机设备及存储介质
CN109948124B (zh) * 2019-03-15 2022-12-23 腾讯科技(深圳)有限公司 语音文件切分方法、装置及计算机设备
CN110033769B (zh) * 2019-04-23 2022-09-06 施永兵 一种录入语音处理方法、终端及计算机可读存储介质
CN110211571B (zh) * 2019-04-26 2023-05-26 平安科技(深圳)有限公司 错句检测方法、装置及计算机可读存储介质
CN110246493A (zh) * 2019-05-06 2019-09-17 百度在线网络技术(北京)有限公司 通讯录联系人查找方法、装置及存储介质
CN110047467B (zh) * 2019-05-08 2021-09-03 广州小鹏汽车科技有限公司 语音识别方法、装置、存储介质及控制终端
CN110232923B (zh) * 2019-05-09 2021-05-11 海信视像科技股份有限公司 一种语音控制指令生成方法、装置及电子设备
CN110126725B (zh) * 2019-05-22 2021-04-13 广州小鹏汽车科技有限公司 车辆仪表盘指示灯的提示方法、装置及车辆
CN110310626A (zh) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 语音训练数据生成方法、装置、设备及可读存储介质
CN110502631B (zh) * 2019-07-17 2022-11-04 招联消费金融有限公司 一种输入信息响应方法、装置、计算机设备和存储介质
US11462208B2 (en) * 2019-09-11 2022-10-04 Oracle International Corporation Implementing a correction model to reduce propagation of automatic speech recognition errors
CN111105785B (zh) * 2019-12-17 2023-06-16 广州多益网络股份有限公司 一种文本韵律边界识别的方法及装置
CN111046666B (zh) * 2019-12-19 2023-05-05 天津新开心生活科技有限公司 事件识别方法及装置、计算机可读存储介质、电子设备
CN113111652B (zh) * 2020-01-13 2024-02-13 阿里巴巴集团控股有限公司 数据处理方法、装置及计算设备
CN111429880A (zh) * 2020-03-04 2020-07-17 苏州驰声信息科技有限公司 一种切割段落音频的方法、系统、装置、介质
CN111161711B (zh) * 2020-04-01 2020-07-03 支付宝(杭州)信息技术有限公司 对流式语音识别文本进行断句的方法及装置
CN111708861B (zh) * 2020-04-29 2024-01-23 平安科技(深圳)有限公司 基于双重匹配的匹配集获取方法、装置和计算机设备
CN111785260B (zh) * 2020-07-08 2023-10-27 泰康保险集团股份有限公司 分句方法与装置、存储介质、电子设备
CN112101003B (zh) * 2020-09-14 2023-03-14 深圳前海微众银行股份有限公司 语句文本的切分方法、装置、设备和计算机可读存储介质
CN112259092B (zh) * 2020-10-15 2023-09-01 深圳市同行者科技有限公司 一种语音播报方法、装置及语音交互设备
CN112151014B (zh) * 2020-11-04 2023-07-21 平安科技(深圳)有限公司 语音识别结果的测评方法、装置、设备及存储介质
CN112434131B (zh) * 2020-11-24 2023-09-29 平安科技(深圳)有限公司 基于人工智能的文本错误检测方法、装置、计算机设备
CN113012701B (zh) * 2021-03-16 2024-03-22 联想(北京)有限公司 一种识别方法、装置、电子设备及存储介质
CN113672760B (zh) * 2021-08-19 2023-07-11 北京字跳网络技术有限公司 一种文本对应关系构建方法及其相关设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122507A (zh) * 2010-01-08 2011-07-13 龚澍 一种运用人工神经网络进行前端处理的语音检错方法
CN103680495A (zh) * 2012-09-26 2014-03-26 中国移动通信集团公司 语音识别模型训练方法和装置及终端
CN105374356A (zh) * 2014-08-29 2016-03-02 株式会社理光 语音识别方法、语音评分方法、语音识别系统及语音评分系统
CN107993653A (zh) * 2017-11-30 2018-05-04 南京云游智能科技有限公司 语音识别设备的错误发音自动纠正更新方法和更新系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04141699A (ja) * 1990-10-02 1992-05-15 Sharp Corp 音声認識装置
JP2001312293A (ja) * 2000-04-28 2001-11-09 Matsushita Electric Ind Co Ltd 音声認識方法およびその装置、並びにコンピュータ読み取り可能な記憶媒体
CN101105939B (zh) * 2007-09-04 2012-07-18 安徽科大讯飞信息科技股份有限公司 发音指导方法
CN104991889B (zh) * 2015-06-26 2018-02-02 江苏科技大学 一种基于模糊分词的非多字词错误自动校对方法
CN107045496B (zh) * 2017-04-19 2021-01-05 畅捷通信息技术股份有限公司 语音识别后文本的纠错方法及纠错装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122507A (zh) * 2010-01-08 2011-07-13 龚澍 一种运用人工神经网络进行前端处理的语音检错方法
CN103680495A (zh) * 2012-09-26 2014-03-26 中国移动通信集团公司 语音识别模型训练方法和装置及终端
CN105374356A (zh) * 2014-08-29 2016-03-02 株式会社理光 语音识别方法、语音评分方法、语音识别系统及语音评分系统
CN107993653A (zh) * 2017-11-30 2018-05-04 南京云游智能科技有限公司 语音识别设备的错误发音自动纠正更新方法和更新系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133309A (zh) * 2020-09-22 2020-12-25 掌阅科技股份有限公司 音频和文本的同步方法、计算设备及存储介质
CN112133309B (zh) * 2020-09-22 2021-08-24 掌阅科技股份有限公司 音频和文本的同步方法、计算设备及存储介质
CN113569974A (zh) * 2021-08-04 2021-10-29 网易(杭州)网络有限公司 编程语句纠错方法、装置、电子设备及存储介质
CN113569974B (zh) * 2021-08-04 2023-07-18 网易(杭州)网络有限公司 编程语句纠错方法、装置、电子设备及存储介质
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Also Published As

Publication number Publication date
CN108766437B (zh) 2020-06-23
CN108766437A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019227548A1 (zh) 语音识别方法、装置、计算机设备及存储介质
CN107220235B (zh) 基于人工智能的语音识别纠错方法、装置及存储介质
CN110717031B (zh) 一种智能会议纪要生成方法和系统
WO2020186778A1 (zh) 错词纠正方法、装置、计算机装置及存储介质
WO2021042503A1 (zh) 信息分类抽取方法、装置、计算机设备和存储介质
WO2020258506A1 (zh) 文本信息匹配度检测方法、装置、计算机设备和存储介质
WO2020224119A1 (zh) 用于语音识别的音频语料筛选方法、装置及计算机设备
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
WO2022142613A1 (zh) 训练语料扩充方法及装置、意图识别模型训练方法及装置
CN108140019B (zh) 语言模型生成装置、语言模型生成方法以及记录介质
CN111814466A (zh) 基于机器阅读理解的信息抽取方法、及其相关设备
WO2022095353A1 (zh) 语音识别结果的测评方法、装置、设备及存储介质
CN109582787B (zh) 一种火力发电领域语料数据的实体分类方法及装置
CN112287680B (zh) 一种问诊信息的实体抽取方法、装置、设备及存储介质
CN111695343A (zh) 错词纠正方法、装置、设备及存储介质
CN112784009B (zh) 一种主题词挖掘方法、装置、电子设备及存储介质
CN109712616B (zh) 基于数据处理的电话号码纠错方法、装置及计算机设备
CN113220782A (zh) 多元测试数据源生成方法、装置、设备及介质
US9679566B2 (en) Apparatus for synchronously processing text data and voice data
CN113642316A (zh) 中文文本纠错方法、装置、电子设备及存储介质
CN111144118B (zh) 口语化文本中命名实体的识别方法、系统、设备和介质
CN110503956B (zh) 语音识别方法、装置、介质及电子设备
US11270085B2 (en) Generating method, generating device, and recording medium
US11645474B2 (en) Computer-implemented method for text conversion, computer device, and non-transitory computer readable storage medium
WO2021000412A1 (zh) 文本匹配度检测方法、装置、计算机设备和可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920911

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18920911

Country of ref document: EP

Kind code of ref document: A1