WO2021051564A1 - 语音识别方法、装置、计算设备和存储介质 - Google Patents

语音识别方法、装置、计算设备和存储介质 Download PDF

Info

Publication number
WO2021051564A1
WO2021051564A1 PCT/CN2019/117675 CN2019117675W WO2021051564A1 WO 2021051564 A1 WO2021051564 A1 WO 2021051564A1 CN 2019117675 W CN2019117675 W CN 2019117675W WO 2021051564 A1 WO2021051564 A1 WO 2021051564A1
Authority
WO
WIPO (PCT)
Prior art keywords
standard
audio data
text
question
matching
Prior art date
Application number
PCT/CN2019/117675
Other languages
English (en)
French (fr)
Inventor
王健宗
彭俊清
瞿晓阳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051564A1 publication Critical patent/WO2021051564A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of natural language processing technology, and in particular to a voice recognition method, device, computing device, and computer non-volatile readable storage medium.
  • the inventor of the present application realizes that it is impossible for ordinary people to speak like broadcasting, which causes the volume of certain words in the sentence to fail to meet the specific requirements of speech recognition. In this way, the correct content cannot be accurately recognized by using the speech recognition model alone, thereby reducing the accuracy of speech recognition.
  • the purpose of this application is to provide a voice recognition method, device, computing device, and computer non-volatile readable storage medium.
  • a speech recognition method including:
  • the text unit in the preliminary recognition result For each text unit in the preliminary recognition result, if the sound pressure of the audio segment corresponding to the text unit satisfies a predetermined condition, the text unit in the preliminary recognition result is replaced with a placeholder to obtain a The intermediate result corresponding to the recognition result;
  • a target standard text sequence is selected from each standard text sequence in the standard text library as the final recognition result.
  • a speech recognition device including:
  • the input module is configured to input the acquired audio data into a pre-established speech recognition model to obtain a preliminary recognition result in text form output by the speech recognition model;
  • the first determining module is configured to determine the audio segment in the audio data corresponding to each text unit in the preliminary recognition result
  • the replacement module is configured to, for each text unit in the preliminary recognition result, if the sound pressure of the audio segment corresponding to the text unit meets a predetermined condition, replace the text unit in the preliminary recognition result with a placeholder To obtain an intermediate result corresponding to the preliminary recognition result;
  • the second determining module is configured to determine the degree of matching between the intermediate result and each standard text sequence in the preset standard text library
  • the recognition module is configured to select a target standard text sequence from each standard text sequence in the standard text library as a final recognition result based on the matching degree.
  • a computing device including a memory and a processor, the memory is configured to store a program for voice recognition of the processor, and the processor is configured to implement the above by executing the program for voice recognition Voice recognition method.
  • a computer non-volatile readable storage medium storing computer readable instructions
  • a voice recognition program is stored thereon.
  • the voice recognition program is executed by a processor to implement the above voice recognition method .
  • the above-mentioned speech recognition method, device, computing device and computer non-volatile readable storage medium on the basis of inputting audio data into the speech recognition model to obtain the preliminary recognition result of the speech recognition model, by combining the corresponding audio in the preliminary recognition result
  • the text unit whose sound pressure of the segment meets the predetermined condition is replaced with a placeholder to obtain the intermediate result, and the intermediate result is used for final recognition, which reduces the existence of text units whose sound pressure does not meet certain requirements when the final recognition result is further determined.
  • the final recognition result may be inaccurate, thereby improving the accuracy of speech recognition.
  • Fig. 1 is a schematic diagram showing an application scenario of a voice recognition method according to an exemplary embodiment
  • Fig. 2 is a flow chart showing a method for speech recognition according to an exemplary embodiment
  • FIG. 3 is a flowchart of steps after step 250 in an embodiment according to the embodiment corresponding to FIG. 2;
  • FIG. 4 is a detailed flowchart of step 240 of an embodiment shown according to the embodiment corresponding to FIG. 2;
  • Fig. 5 is a block diagram showing a voice recognition device according to an exemplary embodiment
  • Fig. 6 is an exemplary block diagram showing a computing device that implements the above voice recognition method according to an exemplary embodiment
  • Fig. 7 shows a non-volatile computer readable storage medium for realizing the above voice recognition method according to an exemplary embodiment.
  • Speech recognition refers to a technology that converts vocabulary in human speech into computer-readable input. For example, converting human speech into a sequence of characters composed of words, symbols, etc. is speech recognition. The content of speech and speech is completely information of two different dimensions. Only humans have been able to extract the information in speech and process it. With the development of natural language processing technologies such as speech recognition, the use of computer equipment and other machines to It has become possible to realize speech recognition, and the speech recognition method provided in this application is a solution that is further improved on the basis of the existing speech recognition technology, and can produce a series of beneficial effects such as improving the accuracy of speech recognition.
  • the implementation terminal of this application can be any device with computing, processing and communication functions.
  • the device can be connected to an external device to receive or send information.
  • It can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, or a PDA. (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as server clusters or cloud computing physical infrastructure, etc. .
  • the implementation terminal of this application may be a server, a server cluster, or a physical infrastructure of cloud computing.
  • Fig. 1 is a schematic diagram showing an application scenario of a voice recognition method according to an exemplary embodiment. As shown in Figure 1, it includes a server 100, a first user terminal 110, and a second user terminal 120. Each user terminal communicates with the server 100 through a communication link. The communication link can be used to send and receive data.
  • the server 100 is the implementation terminal of this application.
  • the voice data can be input into the user terminal through a voice recording device (such as a microphone, etc.) embedded in the user terminal or connected to the user terminal. After the voice data, the voice data can be sent to the server 100, and the server 100 will perform the voice recognition task.
  • a voice recording device such as a microphone, etc.
  • the server 100 can return the voice recognition result to the user terminal that sent the voice data, that is, the voice data entered with the user Corresponding text information.
  • the server 100 may be embedded with a trained voice recognition model.
  • the voice data may be input into the voice recognition model, and the voice recognition model may output preliminary
  • the server 100 can perform further recognition on the basis of the preliminary recognition result to obtain the final recognition result.
  • the ellipsis in FIG. 1 indicates that the number of user terminals that establish a communication link connection with the server 100 and can send voice data to the server 100 is not fixed, and can be any greater or lesser number.
  • Fig. 1 is only an embodiment of the present application.
  • the implementation terminal of this application is a server, and the two processes of direct acquisition of voice data and voice recognition are performed on two different terminals, in other embodiments or specific applications, Any one of the aforementioned types of terminals is selected as the implementation terminal of this application according to needs, and the two tasks of direct acquisition of voice data and voice recognition can also be performed on the same terminal.
  • This application does not make any limitation on this. The scope of protection should not be restricted as a result.
  • Fig. 2 is a flow chart showing a method for speech recognition according to an exemplary embodiment. This embodiment can be executed by the server in the embodiment of FIG. 1. As shown in Figure 2, it includes the following steps:
  • Step 210 Input the acquired audio data into a pre-established speech recognition model, and obtain a preliminary recognition result in text form output by the speech recognition model.
  • the audio data can be acquired by means of audio streams or by means of audio files.
  • the format of the audio files can be arbitrary, including but not limited to .WAV, .MP3, etc. Audio data can be data directly received and processed by the local end, or data generated by other terminals outside the local end.
  • step 210 it may further include: receiving audio data sent from the target terminal to obtain audio data.
  • calling the set voice recognition interface will call the pre-established voice recognition model, by calling the set voice recognition interface, input the acquired audio data into the pre-built voice recognition model, and get the recognition returned by the interface
  • the result is a preliminary recognition result in the text form output by the speech recognition model.
  • the pre-established speech recognition model can be any type of trained speech recognition model, can include multiple sub-models or components, and can also be referred to as a speech recognition system.
  • the speech recognition model can be a traditional speech recognition model that includes acoustic models, language models, and decoders, or it can be an end-to-end speech recognition model.
  • Acoustic models in traditional speech recognition models include but are not limited to: GMM-HMM (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model) model, DNN-HMM (Deep Neural Networks-Hidden Markov Model, Deep neural network-hidden Markov model), DFSMN (Deep feedforward sequential memory networks, deep feedforward sequential memory neural network) model, etc.
  • Language models include but are not limited to n-gram, Transformer model, etc.
  • end-to-end speech recognition Models include, but are not limited to, w av2letter++ framework model, LSTM-CTC (Long Short-Term Memory-Connectionist temporal classification) model, etc.
  • the speech recognition process based on the GMM-HMM model is as follows: Perform Voice Activity Detection (VAD) on the voice data, cut off the mute part at the beginning and the end of the voice data; Perform pre-emphasis to improve the high-frequency part of the voice data; window the voice data to reduce the impact of the voice end; frame the voice data; extract the features of each frame to obtain the characteristics of the acoustic feature dimension *N Matrix, where N is the total number of frames, and feature types include but are not limited to LPC (inear predictive coding), MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency cepstrum coefficient), etc.; input the feature matrix into GMM-
  • the HMM acoustic model allows the GMM-HMM acoustic model to calculate the acoustic model score of the phoneme sequence according to the conditional probability of the phoneme and state of the frame.
  • GMM is used to model the distribution of speech features
  • HMM is used to sequence the sequence Perform modeling; use the language model to obtain the language model scores of the corresponding text combinations of each phoneme sequence; combine the acoustic model scores and language model scores of the phoneme sequence to determine the output phoneme sequence, and use the dictionary to obtain the text corresponding to the output phoneme sequence, and finally get Speech recognition results.
  • Preliminary recognition results include but are not limited to characters, numbers and other symbols, where the language corresponding to the characters can be arbitrary, and the types of characters include but are not limited to Chinese characters, English words, etc.
  • the phoneme can be initials, vowels, etc., for English, the phoneme can be one or more of the 39 phoneme set provided by Carnegie Mellon University.
  • the preliminary recognition result output by the speech recognition model in text form means that the preliminary recognition result output by the speech recognition model is output in the form of text and recorded in the form of text, usually in the form of a text file.
  • the format of the text file includes but is not limited to .doc, .txt, JSON format, XML format, HTML format, etc.
  • Step 220 Determine an audio segment in the audio data corresponding to each text unit in the preliminary recognition result.
  • the text unit is the basic language unit of the preset text level. For example, for Chinese, the text unit can be a single character or word, and for English, the text unit can be a word. Determining the audio segment corresponding to the text unit in the preliminary recognition result in the audio data can be implemented by using the speech recognition model itself. For example, for the GMM-HMM model, each frame of speech data corresponds to the state, and the state corresponds to the phoneme, and the phoneme corresponds to the text. According to this correspondence chain, the audio data can be determined from the text unit in the preliminary recognition result. The corresponding voice data frame, and then the corresponding audio segment is obtained.
  • Step 230 For each text unit in the preliminary recognition result, if the sound pressure of the audio segment corresponding to the text unit satisfies a predetermined condition, replace the text unit in the preliminary recognition result with a placeholder to obtain The intermediate result corresponding to the preliminary recognition result.
  • the predetermined condition is used to indicate that the sound pressure of the audio segment corresponding to the text unit is low.
  • the placeholder can be any type of symbol or a combination of symbols, for example, it can be symbols such as &, %, #.
  • Sound pressure is the value of the ordinate of the sound signal corresponding to the audio clip in the waveform diagram. It can be used to measure the loudness of the audio signal. The loudness of the audio is generally positively correlated with the volume. Therefore, the audio clip corresponding to a text unit Low sound pressure means that the text unit is recognized by audio data of lower volume.
  • the predetermined condition is: the maximum value of the sound pressure of the audio segment corresponding to the text unit is lower than a preset sound pressure average value threshold.
  • the maximum value of the sound pressure of the audio segment is the amplitude of the sound pressure in the audio segment.
  • the predetermined condition is: the minimum value of the sound pressure of the audio segment corresponding to the text unit is lower than a preset sound pressure average value threshold.
  • the advantage of this embodiment is that the minimum value of the sound pressure of the audio segment corresponding to a text unit is generally a very small value, as long as the minimum value of the sound pressure of the audio segment corresponding to a text unit is lower than the preset average sound pressure threshold.
  • the text unit will be replaced with a placeholder, which reduces the standard for replacing text units with placeholders, increases the number of text units replaced with placeholders in the preliminary recognition result, and makes all The sound pressures of the audio segments corresponding to the text units retained in the preliminary recognition results are all large enough to improve the accuracy of speech recognition to a certain extent.
  • the predetermined condition is: the average value of the sound pressure of the audio segment corresponding to the text unit is lower than a preset average sound pressure threshold.
  • the average value of the sound pressure of an audio segment reflects the central tendency of the sound pressure in the audio segment.
  • the advantage of this embodiment is that the average value of the sound pressure of the audio segment is used as an indicator to determine whether to replace the corresponding text unit with The placeholder standard achieves a balance between the number of text units retained in the preliminary recognition result and the sound pressure level of the audio segment corresponding to the text units retained in the preliminary recognition result.
  • the integral of the sound pressure of the audio segment corresponding to the text unit is calculated, and then the ratio of the integral value to the length of the integration interval is determined as the average value of the sound pressure of the audio segment corresponding to the text unit.
  • the average value is compared with the preset sound pressure average threshold value to determine whether the sound pressure of the audio segment corresponding to the text unit meets the predetermined condition.
  • the predetermined condition is: arbitrarily take a predetermined number of frames in the audio segment corresponding to the text unit; if the average sound pressure of each frame is lower than the preset sound pressure average threshold, then The text unit in the preliminary recognition result is replaced with a placeholder.
  • sampling can indirectly reflect the sound pressure distribution of the entire audio segment. By extracting a certain number of frames from the audio segment, it is determined whether the corresponding text unit needs to be replaced with a placeholder. It can reduce the amount of calculation.
  • the method may further include: matching consecutive multiple placeholders in the preliminary recognition result into one placeholder.
  • the merged placeholder is one of the merged placeholders.
  • the merged placeholders are different from the merged placeholders.
  • the conforming and combining multiple consecutive placeholders in the preliminary recognition result into one placeholder includes: starting from the first placeholder in the preliminary recognition result, for each Placeholder, to determine whether the character after the placeholder is a placeholder; if it is, then the placeholder matches the placeholder after the placeholder and becomes a placeholder.
  • Step 240 Determine the degree of match between the intermediate result and each standard text sequence in the preset standard text library.
  • the basic constituent elements in a standard text sequence can be symbols such as words and numbers, and a standard text sequence can be a phrase, a sentence or a paragraph.
  • step 240 may include: for each standard text sequence, obtaining the number of text units included in the standard text sequence and the intermediate result and the number of all text units included in the intermediate result. The ratio is used as the degree of matching between the intermediate result and the standard text sequence.
  • step 240 may include: using a preset dictionary to respectively establish a vector for each standard text sequence in the intermediate result and the preset standard text library; for each standard text sequence, the standard text
  • the Euclidean distance between the vector of the sequence and the vector of the intermediate result is used as the degree of matching between the standard text sequence and the intermediate result.
  • the dictionary records the vector element value corresponding to each word, and the vector element values corresponding to words with similar semantics are similar.
  • the intermediate result is "I&Love&You”
  • the vector generated for the intermediate result can be (35, 450, 37)
  • a standard text sequence in the standard text library is "I like you”.
  • the vector generated by the standard text sequence can be (35, 452, 37), and the similarity between the intermediate result and the standard text sequence can be obtained by calculating the Euclidean distance of the two vectors.
  • Step 250 based on the matching degree, select a target standard text sequence from each standard text sequence in the standard text library as a final recognition result.
  • the target standard text sequence is the standard text sequence selected as the final recognition result among the standard text sequences in the standard text library. In this sense, the target standard text sequence is the same as the final recognition result.
  • selecting a target standard text sequence from each standard text sequence of the standard text library as the final recognition result includes: obtaining the corresponding all the text from the standard text library.
  • the standard text sequence with the greatest degree of matching is used as the target standard text sequence, and the target standard text sequence is used as the final recognition result.
  • selecting a target standard text sequence from each standard text sequence of the standard text library as the final recognition result includes: obtaining the corresponding all the text from the standard text library.
  • the standard text sequence whose matching degree is greater than the predetermined matching degree threshold is used as the candidate standard text sequence; any one of the candidate standard text sequences is selected as the target standard text sequence, and the target standard text sequence is used as the final recognition result.
  • the matching degree cannot be completely objectively measured in some cases, whether a standard text sequence should be selected as the target standard text sequence, that is, the final recognition result, especially when multiple standard text sequences have sufficient matching degrees with the intermediate results
  • the standard text sequence with smaller matching degree among multiple standard text sequences may be more suitable as the final recognition result. Therefore, the advantage of this embodiment is that standard text sequences that have a sufficiently large matching degree with the intermediate result have the same possibility of being selected as the final recognition result, which improves the fairness of recognition.
  • the audio data is question audio data
  • the standard text library is a standard question library
  • the standard text sequence is a standard question
  • the standard question corresponds to a standard answer
  • the method may further include: step 260, obtaining a standard answer corresponding to the final recognition result.
  • standard questions and corresponding standard answers are correspondingly stored in the standard question library.
  • the final recognition result that is, the selected The standard question
  • the final recognition result corresponds to the stored standard answer as the standard answer corresponding to the final recognition result.
  • a standard question and standard answer correspondence database is preset, and the standard question and standard answer correspondence database correspondingly stores the identification of the standard question and the corresponding standard answer, and the standard text database further includes The identification corresponding to each standard question, the obtaining the standard answer corresponding to the final recognition result includes: obtaining the identification corresponding to the final recognition result from the standard text library; The standard answer stored corresponding to the identifier is acquired in the correspondence database as the standard answer corresponding to the final recognition result.
  • Step 270 Output the standard solution.
  • the way of outputting standard answers on the local end can be arbitrary.
  • the acquired audio data is the audio data received from the target terminal by this end
  • the output of the standard answer includes: sending the standard answer to the target terminal, so that the target terminal can display all the standard answers.
  • the standard solution is the audio data received from the target terminal by this end
  • the local terminal has a display screen
  • the output of the standard answer includes: printing the standard answer to the display screen of the local terminal.
  • the local end has a display unit
  • the output of the standard answer includes: pushing a pop-up window containing the standard answer to the display unit of the local end.
  • the matching degree is the first matching degree
  • the standard question library further includes standard audio data corresponding to each standard question
  • the matching degree is based on the matching degree
  • Selecting a target standard text sequence from each standard text sequence of the standard text library as the final recognition result includes: selecting candidate standard questions from each standard question of the standard question library based on the first matching degree; Obtain the standard audio data corresponding to each candidate standard question in the standard text library; determine the degree of matching between each standard audio data and the question audio data as the second degree of matching; For the second matching degree of the standard audio data, the target standard question is selected among the candidate standard questions as the final recognition result.
  • the advantage of this embodiment is that on the basis of selecting several candidate standard questions according to the first matching degree, the target standard question is selected as the final recognition result by further selecting the target standard question according to the second matching degree of the standard audio data and the question audio data.
  • the determination of the final recognition result depends on the two elements of the first matching degree and the second matching degree at the same time, which improves the accuracy of the final recognition result obtained.
  • the selecting candidate standard questions from the standard questions in the standard question library based on the first matching degree includes: selecting the corresponding first standard question from the standard question library.
  • the standard question whose matching degree is greater than the preset first matching degree threshold is regarded as the candidate standard question; according to the second matching degree of the standard audio data corresponding to each candidate standard question, the target standard question is selected as the final recognition among the candidate standard questions
  • the result includes: taking the second candidate standard question with the largest matching degree of the corresponding standard audio data as the target standard question, and taking the target standard question as the final recognition result.
  • the determining the degree of matching between each standard audio data and the question audio data as the second degree of matching includes: separately dividing the standard audio data and the question audio data into frames; Extract the feature vector of each frame of audio data; construct the feature matrix of the question audio data and each of the standard audio data according to the feature vector of each frame of audio data of the standard audio data and the question audio data ; For each standard audio data, determine the similarity between the feature matrix of the standard audio data and the feature matrix of the question audio data as the degree of matching between the standard audio data and the question audio data.
  • the method before separately dividing the standard audio data and the question audio data into frames, the method further includes: scaling the standard audio data to be the same as the question audio data in the time dimension Said dividing the standard audio data and the question audio data into frames respectively includes: dividing the question audio data and the stretched standard audio data into frames respectively.
  • the advantage of this embodiment is that by scaling the standard audio data in the time dimension, the stretched standard audio data and the question audio data have the same length, so that the finally obtained standard audio data
  • the size of the feature matrix of the audio data of the question is the same, so that it is easy to calculate the similarity of the feature matrix.
  • the extracted feature vector of each frame of audio data is a vector composed of MFCC features.
  • the frames into which the question audio data is divided can be directly obtained.
  • the step of constructing the question audio data and the feature matrix of each of the standard audio data respectively according to the feature vector of the audio data of each frame of the standard audio data and the question audio data includes : For the question audio data or each of the standard audio data, the feature vectors of each frame of audio data in the audio data are arranged in the sequence of each frame of audio data to obtain the feature matrix of the audio data; For each standard audio data, determining the similarity between the feature matrix of the standard audio data and the feature matrix of the question audio data, as the degree of matching between the standard audio data and the question audio data, includes: The feature matrix of a standard audio data and the feature matrix of the question audio data are flattened into a one-dimensional vector; for each standard audio data, the one-dimensional vector corresponding to the feature matrix of the standard audio data and the question audio are determined The Euclidean distance between the one-dimensional vectors of the feature matrix of the data is taken as the similarity between the feature matrix of the standard audio data and the feature matrix of the question audio data, and the similarity is
  • the determining the degree of matching between each standard audio data and the question audio data as the second degree of matching includes: expanding each standard audio data to match the question in the time dimension. Sentence audio data of the same length; select a predetermined number of equidistant time points within the time length; obtain the sound pressure values of each standard audio data and the question audio data at the selected time points, and target For each standard audio data or the question audio data, the sound pressure value of the audio data at each time point is formed into a vector; for each standard audio data, the vector of the standard audio data and the question audio data are obtained The Euclidean distance of the vector of is used as the second degree of matching between the standard audio data and the question audio data.
  • the speech recognition method provided in the embodiment of FIG. 2, on the basis of inputting audio data into the speech recognition model to obtain the preliminary recognition result of the speech recognition model, the sound pressure of the corresponding audio segment in the preliminary recognition result
  • the text units that meet the predetermined conditions are replaced with placeholders to obtain intermediate results, and the intermediate results are used for final recognition, which reduces the final recognition result caused by the existence of text units whose sound pressure does not meet certain requirements when the final recognition result is further determined.
  • the possibility of inaccuracy improves the accuracy of speech recognition.
  • FIG. 4 is a detailed flowchart of step 240 of an embodiment shown according to the embodiment corresponding to FIG. 2. As shown in Figure 4, it includes the following steps:
  • Step 241 For each standard text sequence, a ratio of the number of text units included in the standard text sequence and the intermediate result to the number of all text units included in the intermediate result is obtained as a first ratio.
  • the text unit is a text unit included in the standard text sequence and the intermediate result.
  • the text unit whose sound pressure of the corresponding audio segment satisfies the predetermined condition has been replaced with a placeholder. Therefore, in the intermediate result, the standard text sequence and the text units included in the intermediate result are different from each other. There may also be placeholders.
  • Step 243 For each standard text sequence and for each target placeholder, obtain the two text units before and after the target placeholder in the intermediate result and determine the two text units before and after the standard text sequence. Whether there is a placeholder between the two text units before and after the same.
  • the target placeholder is determined based on the text units included in the intermediate result and the standard text sequence, so the two text units before and after the target placeholder in the intermediate result also have the same text unit in the corresponding standard text sequence , And there may be placeholders between corresponding text units.
  • Step 244 if yes, mark the placeholder as a corresponding placeholder.
  • Step 245 For each standard text sequence, obtain the ratio of the number of corresponding placeholders determined for the standard text sequence to the number of target placeholders, as the second ratio.
  • Corresponding placeholders are selected according to the corresponding target placeholders, so the number of corresponding placeholders is generally less than the number of standard placeholders.
  • Step 246 Determine the degree of matching between the intermediate result and each standard text sequence in a preset standard text library based on the first ratio and the second ratio obtained for each standard text sequence.
  • the weighted sum of the first ratio and the second ratio obtained for the standard text sequence is determined as the degree of matching between the intermediate result and the standard text sequence.
  • the advantage of the embodiment shown in FIG. 4 is that the two dimensions of the ratio of the number of text units and the ratio of the number of corresponding placeholders are combined to determine the matching degree between the intermediate result and the standard text sequence. To a certain extent, the accuracy of the determined matching degree is improved.
  • the present application also provides a voice recognition device, and the following are device embodiments of the present application.
  • Fig. 5 is a block diagram showing a speech recognition device according to an exemplary embodiment. As shown in FIG. 5, the device 500 includes:
  • the input module 510 is configured to input the acquired audio data into a pre-established speech recognition model to obtain a preliminary recognition result in text form output by the speech recognition model;
  • the first determining module 520 is configured to determine an audio segment in the audio data corresponding to each text unit in the preliminary recognition result
  • the replacement module 530 is configured to, for each text unit in the preliminary recognition result, if the sound pressure of the audio segment corresponding to the text unit meets a predetermined condition, replace the text unit in the preliminary recognition result with a placeholder Symbol to obtain an intermediate result corresponding to the preliminary recognition result;
  • the second determining module 540 is configured to determine the degree of matching between the intermediate result and each standard text sequence in the preset standard text library
  • the recognition module 550 is configured to select a target standard text sequence from each standard text sequence in the standard text library as a final recognition result based on the matching degree.
  • the computing equipment includes:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute as shown in any of the above exemplary embodiments.
  • Voice recognition method The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute as shown in any of the above exemplary embodiments.
  • the computing device 600 according to this embodiment of the present application will be described below with reference to FIG. 6.
  • the computing device 600 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the computing device 600 is represented in the form of a general-purpose computing device.
  • the components of the computing device 600 may include, but are not limited to: the aforementioned at least one processing unit 610, the aforementioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610).
  • the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary methods described in the "Methods of Embodiments" section of this specification. Steps of implementation.
  • the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 620 may also include a program/utility tool 624 having a set of (at least one) program module 625.
  • program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 630 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the computing device 600 may also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the computing device 600, and/or communicate with Any device (eg, router, modem, etc.) that enables the computing device 600 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 650.
  • the computing device 600 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660. As shown in the figure, the network adapter 660 communicates with other modules of the computing device 600 through the bus 630.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer non-volatile readable storage medium on which is stored a program product capable of implementing the above-mentioned method in this specification.
  • various aspects of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a computer non-volatile readable storage medium 700 for implementing the above method according to an embodiment of the present application is described, which may adopt a portable compact disk read-only memory (CD-ROM) and includes program code , And can run on terminal devices, such as personal computers.
  • CD-ROM portable compact disk read-only memory
  • the program product of this application is not limited to this.
  • the computer non-volatile readable storage medium can be any tangible medium that contains or stores a program.
  • the program can be used by or in combination with an instruction execution system, device, or device. In conjunction with.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages-such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Abstract

一种语音识别方法、装置、计算设备和计算机非易失性可读存储介质,涉及自然语言处理领域。该方法包括:将音频数据输入语音识别模型,得到输出的初步识别结果(210);确定初步识别结果中文本单位对应的音频数据中的音频片段(220);针对每一文本单位,若该文本单位对应音频片段的声压满足预定条件,则将初步识别结果中该文本单位替换为占位符,得到中间结果(230);确定中间结果与预设标准文本库中各标准文本序列的匹配度(240);基于匹配度,在标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果(250)。此方法降低了由于对应的音频片段的声压不符合一定要求的文本单位的存在所导致的语音识别不准确的可能性,提高了语音识别的准确率。

Description

语音识别方法、装置、计算设备和存储介质 技术领域
本申请基于并要求2019年9月17日申请的、申请号为CN 201910877191.2、名称为“语音识别方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容在此并入作为参考。
本申请涉及自然语言处理技术领域,特别是涉及一种语音识别方法、装置、计算设备和计算机非易失性可读存储介质。
背景技术
随着移动互联网的发展,包括语音识别在内的与自然语言处理相关技术获得了蓬勃发展。目前,为了实现语音识别,人们常构建语音识别模型,然后将用户输入的语音数据输入至语音识别模型,语音识别模型会输出相应的文本信息,从而完成语音识别。
发明概述
技术问题
然而,本申请发明人意识到,一般人说话时不可能像播音那样说话,导致句子中的某些词的音量无法满足语音识别的特定需求。这样,单纯使用语音识别模型就无法准确识别出正确的内容,从而降低了语音识别的准确率。
问题的解决方案
技术解决方案
在自然语言处理技术领域,为了解决上述技术问题,本申请的目的在于提供一种语音识别方法、装置、计算设备和计算机非易失性可读存储介质。
第一方面,提供了一种语音识别方法,包括:
将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压 满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度;
基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
第二方面,提供了一种语音识别装置,包括:
输入模块,被配置为将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
第一确定模块,被配置为确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
替换模块,被配置为针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
第二确定模块,被配置为确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度;
识别模块,被配置为基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
第三方面,提供了一种计算设备,包括存储器和处理器,所述存储器用于存储所述处理器的语音识别的程序,所述处理器配置为经由执行所述语音识别的程序来实现上述语音识别方法。
第四方面,提供了一种存储有计算机可读指令的计算机非易失性可读存储介质,其上存储有语音识别的程序,所述语音识别的程序被处理器执行时实现上述语音识别方法。
上述语音识别方法、装置、计算设备和计算机非易失性可读存储介质,在将音频数据输入至语音识别模型得到语音识别模型的初步识别结果的基础上,通过将初步识别结果中对应的音频片段的声压满足预定条件的文本单位替换为占位符,得到中间结果,并利用中间结果进行最终识别,降低了在进一步确定最终识别结果时由于声压不符合一定要求的文本单位的存在导致的最终识别结果不 准确的可能性,从而提高了语音识别的准确率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
发明的有益效果
对附图的简要说明
附图说明
图1是根据一示例性实施例示出的一种语音识别方法的应用场景示意图;
图2是根据一示例性实施例示出的一种语音识别方法的流程图;
图3是根据图2对应实施例示出的一实施例的步骤250之后步骤的流程图;
图4是根据图2对应实施例示出的一实施例的步骤240的细节流程图;
图5是根据一示例性实施例示出的一种语音识别装置的框图;
图6是根据一示例性实施例示出的一种实现上述语音识别方法的计算设备的示例框图;
图7是根据一示例性实施例示出的一种实现上述语音识别方法的计算机非易失性可读存储介质。
发明实施例
本发明的实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。
本申请首先提供了一种语音识别方法。语音识别是指将人类语音中的词汇转换为计算机可读的输入的一种技术,比如将人类语音转换为由文字、符号等组成 的字符序列就是语音识别。语音和语音中的内容完全是两个不同维度的信息,一直以来只有人类能够提取语音中的信息,并进行加工处理,随着语音识别等自然语言处理技术的发展,通过利用计算机设备等机器来实现语言识别已经成为可能,而本申请提供的语音识别方法就是在现有语音识别技术的基础上进一步改进的方案,并能够产生提高语音识别的准确率等一系列有益效果。
本申请的实施终端可以是任何具有运算、处理以及通信功能的设备,该设备可与外部设备相连,用于接收或者发出信息,可以是便携移动设备,例如智能手机、平板电脑、笔记本电脑、PDA(Personal Digital Assistant)等,也可以是固定式设备,例如,计算机设备、现场终端、台式电脑、服务器、工作站等,还可以是多个设备的集合,比如服务器集群或者云计算的物理基础设施等。
可选地,本申请的实施终端可以为服务器、服务器集群或者云计算的物理基础设施。
图1是根据一示例性实施例示出的一种语音识别方法的应用场景示意图。如图1所示,包括服务器100、第一用户终端110以及第二用户终端120,其中,每一用户终端分别通过通信链路与服务器100,通信链路可以用于发送和接收数据,在本实施例中,服务器100为本申请的实施终端。当用户终端的用户需要录入语音并转换为对应的文本信息时,可以通过用户终端内嵌的或者与用户终端连接的语音录入装置(如麦克风等)向用户终端录入语音数据,用户终端在接收到语音数据后,可以将该语音数据向服务器100发送,由服务器100来执行语音识别任务,待任务完成后,服务器100可以向发送语音数据的用户终端返回语音识别结果,即与用户录入的语音数据对应的文本信息。具体地,服务器100可以内嵌有训练好的语音识别模型,当服务器100接收到用户通过用户终端发来的语音数据后,可以将该语音数据输入至语音识别模型中,语音识别模型可以输出初步识别结果,然后服务器100可以在该初步识别结果的基础上进行进一步的识别,得到最终识别结果。另外,图1中的省略号表示与服务器100建立通信链路的连接并可以向服务器100发送语音数据的用户终端的数量是不固定的,可以是任意更多或者更少的数量。
值得一提的是,图1仅为本申请的一个实施例。虽然在图1实施例中,本申请的 实施终端为服务器,并且语音数据的直接获取和语音识别两个过程是在两个不同的终端上进行的,但在其他实施例或者具体应用中,可以根据需要选择前述的任意一种类型的终端作为本申请的实施终端,并且语音数据的直接获取和语音识别两个任务还可以是在同一终端上执行,本申请对此不作任何限定,本申请的保护范围也不应因此而受到任何限制。
图2是根据一示例性实施例示出的一种语音识别方法的流程图。本实施例可以由图1实施例中的服务器来执行。如图2所示,包括以下步骤:
步骤210,将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果。
音频数据的获取可以是用音频流的方式获取的,也可以是用音频文件的方式获取的,音频文件的格式可以是任意的,包括但不限于.WAV、.MP3等。音频数据可以是由本端直接接收并加工生成的数据,也可以是由本端外的其他终端生成的数据。
在一个实施例中,在步骤210之前,还可以包括:接收从目标终端发送来的音频数据,以获取音频数据。
在一个实施例中,调用已设置的语音识别接口会调用预先建立的语音识别模型,通过调用已设置的语音识别接口将获取的音频数据输入至预先建立的语音识别模型,并得到接口返回的识别结果,作为所述语音识别模型输出的文本形式的初步识别结果。
预先建立的语音识别模型可以是任何类型的已训练好的语音识别模型,可以包括多个子模型或者组件,亦可以称为一个语音识别系统。比如,语音识别模型可以是包含了声学模型、语言模型和解码器等部分的传统的语音识别模型,也可以是端到端(end-to-end)的语音识别模型。传统的语音识别模型中的声学模型包括但不限于:GMM-HMM(Gaussian Mixed Model-Hidden Markov Model,高斯混合模型-隐马尔可夫模型)模型、DNN-HMM(Deep Neural Networks-Hidden Markov Model,深度神经网络-隐马尔可夫模型)、DFSMN(Deep feedforward sequential memory networks,深层前馈序列记忆神经网络)模型等,语言模型包括但不限于n-gram、Transformer模型等;而端到端的语音识别模型包括但不限于w av2letter++框架模型、LSTM-CTC(Long Short-Term Memory-Connectionist temporal classification,长短期记忆网络-联结主义时间分类)模型等。
在一个实施例中,基于GMM-HMM模型这一语音识别模型的语音识别过程是这样的:对语音数据进行语音端点检测(Voice Activity Detection,VAD),切除语音数据首尾端的静音部分;对语音数据进行预加重以提升语音数据中的高频部分;对语音数据加窗处理以减少语音端边缘的影响;对语音数据进行分帧;提取每一帧的特征,得到声学特征维数*N的特征矩阵,其中,N为总帧数,特征的类型包括但不限于LPC(inear predictive coding,线性预测编码)、MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)等;将特征矩阵输入GMM-HMM声学模型,使GMM-HMM声学模型可以根据帧的音素和状态的条件概率来计算音素序列的声学模型分数,其中,GMM用于对语音特征的分布进行建模,HMM用于对序列的时序进行建模;利用语言模型获取各音素序列对应文字组合的语言模型分数;综合音素序列的声学模型分数和语言模型分数,确定输出音素序列,并利用词典获取与输出音素序列对应的文字,最终得到语音识别结果。
初步识别结果包括但不限于文字、数字等符号,其中,文字对应的语言可以是任意的,文字的类型包括但不限于中文的汉字、英文的单词等。对于中文,音素可以为声母、韵母等,对于英文,音素可以为卡内基梅隆大学提供的由39个音素组成的音素集中的一个或多个。
语音识别模型输出的初步识别结果为文本形式是指语音识别模型输出的初步识别结果以文本的形式输出并以文本的形式被记录,通常以文本文件的形式存在。文本文件的格式包括但不限于.doc、.txt、JSON格式、XML格式、HTML格式等。
步骤220,确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段。
文本单位是预先设置的文字级别的基本语言单位。比如对于中文,文本单位可以是单字或者词语,对于英文,文本单位可以是单词。在音频数据中确定与初步识别结果中文本单位对应的音频片段可以利用语音识别模型本身来实现。比 如,对于GMM-HMM模型,每一帧语音数据与状态对应,而状态又与音素对应,音素与文字对应,根据这条对应关系链,可以在音频数据中确定出与初步识别结果中文本单位对应的语音数据帧,进而得到对应的音频片段。
步骤230,针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果。
在一个实施例中,所述预定条件用于指示与文本单位对应的音频片段的声压低。
占位符可以是任何类型的符号或者符号的组合,比如可以为&、%、#等符号。声压即音频片段对应的声音信号在波形图中纵坐标的值,可以用于衡量音频信号的响度大小,而音频的响度一般与音量呈正相关的关系,因此,一个文本单位对应的音频片段的声压低,意味着该文本单位是由较低音量的音频数据识别出来的。
在一个实施例中,所述预定条件为:与文本单位对应的音频片段的声压的最大值低于预设声压平均值阈值。音频片段的声压的最大值即音频片段内声压的幅值。本实施例的好处在于,通过将与文本单位对应的音频片段的声压的最大值低于预设声压平均值阈值作为预定条件,提高了将文本单位替换为占位符的标准,降低了所述初步识别结果中被替换为占位符的文本单位的数量,使得所述初步识别结果能够保留更多的原始识别信息。
在一个实施例中,所述预定条件为:与文本单位对应的音频片段的声压的最小值低于预设声压平均值阈值。本实施例的好处在于,与文本单位对应的音频片段的声压的最小值一般是很小的值,只要一个文本单位对应的音频片段的声压的最小值低于预设声压平均值阈值,该文本单位就会被替换为占位符,这样就降低了将文本单位替换为占位符的标准,提高了所述初步识别结果中被替换为占位符的文本单位的数量,使得所述初步识别结果中保留的文本单位对应的音频片段的声压都是足够大的,能够在一定程度上提高语音识别的精度。
在一个实施例中,所述预定条件为:与文本单位对应的音频片段的声压的平均值低于预设声压平均值阈值。一个音频片段的声压的平均值反映了该音频片段 内声压的集中趋势,本实施例的好处在于,通过音频片段的声压的平均值这一指标作为确定是否将对应的文本单位替换为占位符的标准,实现了初步识别结果中保留的文本单位的数量和初步识别结果中保留的文本单位对应的音频片段的声压大小之间的平衡。
在一个实施例中,通过计算与文本单位对应的音频片段的声压的积分,然后确定积分值与积分区间长度的比值,作为与文本单位对应的音频片段的声压的平均值,然后将该平均值与预设声压平均值阈值进行比较,即可判断与文本单位对应的音频片段的声压是否满足预定条件。
在一个实施例中,所述预定条件为:在与文本单位对应的音频片段中任取预定数目帧;若获取的各帧的声压的平均值低于预设声压平均值阈值,则将所述初步识别结果中的该文本单位替换为占位符。本实施例的好处在于,抽样能够间接地反映整个音频片段的声压分布情况,通过在音频片段中抽取一定帧数来进行对应的文本单位是否需要被替换为占位符的判断,在一定程度上能够降低计算量。
在一个实施例中,在步骤230之后,所述方法还可以包括:将所述初步识别结果中的连续多个占位符合并为一个占位符。
在一个实施例中,合并成的占位符为被合并的占位符中的一个。
在一个实施例中,各被合并的占位符与合并成的占位符均不相同。
在一个实施例中,所述将所述初步识别结果中的连续多个占位符合并为一个占位符,包括:从所述初步识别结果中的第一个占位符开始,针对每一占位符,判断该占位符后的字符是否为占位符;如果是,则将该占位符与该占位符后的占位符合并为一个占位符。
步骤240,确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度。
标准文本序列中的基本组成元素可以为文字、数字等符号,一个标准文本序列可以为一个短语、一个句子或者一个段落。
在一个实施例中,步骤240可以包括:针对每一标准文本序列,获取该标准文本序列与所述中间结果中共同包含的文本单位的数目与所述中间结果中包含的 所有文本单位的数目的比值,作为所述中间结果与该标准文本序列的匹配度。
一个标准文本序列与所述中间结果中共同包含的文本单位的数目越多,说明该标准文本序列与所述中间结果越相似,所以可以将标准文本序列与所述中间结果中共同包含的文本单位的数目与所述中间结果中包含的所有文本单位的数目的比值作为中间结果与标准文本序列的匹配度。
在一个实施例中,步骤240可以包括:利用预设的词典分别为所述中间结果与预设的标准文本库中的每一标准文本序列建立向量;针对每一标准文本序列,将该标准文本序列的向量与所述中间结果的向量之间的欧式距离作为该标准文本序列与所述中间结果的匹配度。词典中记录了每一词对应的向量元素值,语义相似的词对应的向量元素值相近。比如,所述中间结果为“我&爱&你”,为该中间结果生成的向量可以为(35,450,37),标准文本库中的一个标准文本序列为“我喜欢你”,为该标准文本序列生成的向量可以为(35,452,37),则可以通过计算两个向量的欧式距离,得到所述中间结果与该标准文本序列的相似度。
步骤250,基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
目标标准文本序列即所述标准文本库的各标准文本序列中被选为最终识别结果的标准文本序列,从这个意义上来说,目标标准文本序列与最终识别结果是相同的。
在一个实施例中,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:从所述标准文本库中获取对应的所述匹配度最大的标准文本序列作为目标标准文本序列,并将所述目标标准文本序列作为最终识别结果。
在一个实施例中,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:从所述标准文本库中获取对应的所述匹配度大于预定匹配度阈值的标准文本序列,作为候选标准文本序列;在各候选标准文本序列中任取一个作为目标标准文本序列,并将所述目标标准文本序列作为最终识别结果。
由于匹配度在某些情况下不能完全客观地衡量是否应当将一个标准文本序列选 择作为目标标准文本序列,即最终识别结果,特别是当多个标准文本序列与所述中间结果的匹配度都足够大时,多个标准文本序列中匹配度较小的标准文本序列可能更适合作为最终识别结果。所以本实施例的好处在于,使与所述中间结果的匹配度足够大的标准文本序列都有相同的被选择为最终识别结果的可能性,提高了识别的公平性。
在一个实施例中,所述音频数据为问句音频数据,所述标准文本库为标准问题库,所述标准文本序列为标准问题,所述标准问题与标准解答对应,参考图3所示,在步骤250之后,所述方法还可以包括:步骤260,获取与所述最终识别结果对应的标准解答。
在一个实施例中,标准问题和对应的标准解答在所述标准问题库中对应存储,通过查询所述标准问题库,在所述标准问题库中获取与所述最终识别结果(即选出的标准问题)对应存储的标准解答作为与所述最终识别结果对应的标准解答。
在一个实施例中,预先设有标准问题与标准解答对应关系数据库,所述标准问题与标准解答对应关系数据库中对应存储了标准问题的标识和对应的标准解答,所述标准文本库还包括与每一标准问题对应的标识,所述获取与所述最终识别结果对应的标准解答,包括:从所述标准文本库中获取与所述最终识别结果对应的标识;在所述标准问题与标准解答对应关系数据库中获取与所述标识对应存储的标准解答作为与所述最终识别结果对应的标准解答。
步骤270,将所述标准解答输出。
本端输出标准解答的方式可以是任意的。
在一个实施例中,获取的音频数据为本端从目标终端接收到的音频数据,所述将所述标准解答输出,包括:将所述标准解答发送至目标终端,以使目标终端能够显示所述标准解答。
在一个实施例中,本端具有显示屏幕,所述将所述标准解答输出,包括:将所述标准解答打印至本端的显示屏幕上。
在一个实施例中,本端具有显示单元,所述将所述标准解答输出,包括:将包含所述标准解答的弹窗推送至本端的显示单元。
在一个实施例中,所述将所述标准解答输出,包括:根据预设的电子邮箱地址 将所述标准解答通过邮件的方式发送至预设的电子邮箱。
在一个实施例中,对于图3所示实施例,所述匹配度为第一匹配度,所述标准问题库还包括与每一标准问题对应的标准音频数据,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题;在所述标准文本库中获取与每一候选标准问题对应的标准音频数据;确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度;根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果。
本实施例的好处在于,在根据第一匹配度选出若干候选标准问题的基础上,进一步通过根据标准音频数据和问句音频数据的第二匹配度来选出目标标准问题作为最终识别结果,使得最终识别结果的确定同时依赖于第一匹配度和第二匹配度两个要素,提高了获取的最终识别结果的准确率。
在一个实施例中,所述基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题,包括:在所述标准问题库中选出对应的所述第一匹配度大于预设第一匹配度阈值的标准问题作为候选标准问题;所述根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果,包括:将对应的标准音频数据的第二匹配度最大的候选标准问题作为目标标准问题,并将所述目标标准问题作为最终识别结果。
在一个实施例中,所述确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度,包括:分别将所述标准音频数据和所述问句音频数据分成帧;提取每一帧音频数据的特征向量;根据所述标准音频数据和所述问句音频数据各帧音频数据的特征向量,分别构建所述问句音频数据和每一所述标准音频数据的特征矩阵;针对每一标准音频数据,确定该标准音频数据的特征矩阵与所述问句音频数据的特征矩阵的相似度,作为该标准音频数据与所述问句音频数据的匹配度。
在一个实施例中,在分别将所述标准音频数据和所述问句音频数据分成帧之前,所述方法还包括:在时间维度将所述标准音频数据伸缩至与所述问句音频数 据相同的长度;所述分别将所述标准音频数据和所述问句音频数据分成帧,包括:分别将所述问句音频数据和伸缩后的所述标准音频数据分成帧。
本实施例的好处在于,通过对所述标准音频数据进行时间维度上的伸缩,使伸缩后的所述标准音频数据和所述问句音频数据具有相同的长度,从而使得最终获得的标准音频数据和所述问句音频数据的特征矩阵的大小是相同的,从而易于计算特征矩阵的相似度。
在一个实施例中,提取的每一帧音频数据的特征向量为由MFCC特征组成的向量。
在一个实施例中,由于在获得初步识别结果之前已经将所述问句音频数据分成帧,所以可以直接获取问句音频数据被分成的帧。
在一个实施例中,所述根据所述标准音频数据和所述问句音频数据各帧音频数据的特征向量,分别构建所述问句音频数据和每一所述标准音频数据的特征矩阵,包括:针对所述问句音频数据或每一所述标准音频数据,将该音频数据中各帧音频数据的特征向量按照各帧音频数据的先后顺序排列,以得到该音频数据的特征矩阵;所述针对每一标准音频数据,确定该标准音频数据的特征矩阵与所述问句音频数据的特征矩阵的相似度,作为该标准音频数据与所述问句音频数据的匹配度,包括:分别将每一标准音频数据的特征矩阵和所述问句音频数据的特征矩阵展平为一维向量;针对每一标准音频数据,确定该标准音频数据的特征矩阵对应的一维向量与所述问句音频数据的特征矩阵的一维向量之间的欧式距离,作为该标准音频数据的特征矩阵与所述问句音频数据的特征矩阵的相似度,并将所述相似度作为该标准音频数据与所述问句音频数据的匹配度。
在一个实施例中,所述确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度,包括:在时间维度将每一所述标准音频数据伸缩至与所述问句音频数据相同的长度;在所述时间长度内选取预定数目个等距的时间点;分别获取每一标准音频数据与所述问句音频数据在选取的时间点上的声压值,并针对每一标准音频数据或所述问句音频数据,将该音频数据在各时间点上的声压值组成向量;针对每一标准音频数据,获取该标准音频数据的向量与所述问 句音频数据的向量的欧式距离,作为该标准音频数据与所述问句音频数据的第二匹配度。
比如,若所述时间长度为200ms,而所述预定数目为9,那么在所述时间长度内的9个等距的时间点是这样确定的:首先确定在所述时间长度内选取的时间点的间距为:200ms/(9+1)=20ms;然后,从所述时间长度的首端开始,每隔20ms选取一个时间点,直至选取的时间点为所述时间长度的尾端;将在所述时间长度的首端和尾端之间的所有时间点作为在所述时间长度内选取的预定数目个等距的时间点。
综上所述,根据图2实施例提供的语音识别方法,在将音频数据输入至语音识别模型得到语音识别模型的初步识别结果的基础上,通过将初步识别结果中对应的音频片段的声压满足预定条件的文本单位替换为占位符,得到中间结果,并利用中间结果进行最终识别,降低了在进一步确定最终识别结果时由于声压不符合一定要求的文本单位的存在导致的最终识别结果不准确的可能性,从而提高了语音识别的准确率。
图4是根据图2对应实施例示出的一实施例的步骤240的细节流程图。如图4所示,包括以下步骤:
步骤241,针对每一标准文本序列,获取该标准文本序列与所述中间结果中共同包含的文本单位的数目与所述中间结果中包含的所有文本单位的数目的比值,作为第一比值。
若一个标准文本序列中的一个文本单位存在于所述中间结果中,则该文本单位为该标准文本序列与所述中间结果共同包含的文本单位。
步骤242,针对每一标准文本序列,在所述中间结果中确定出该标准文本序列与所述中间结果中共同包含的各文本单位之间的占位符,作为目标占位符。
中间结果中对应的音频片段的声压满足预定条件的文本单位已经被替换为占位符,所以在所述中间结果中,标准文本序列与所述中间结果中共同包含的各文本单位彼此之间也可能存在占位符。
步骤243,针对每一标准文本序列,针对每一目标占位符,在所述中间结果中获取该目标占位符的前后两个文本单位并确定该标准文本序列中与该前后两个 文本单位相同的前后两个文本单位之间是否存在占位符。
目标占位符是基于中间结果和标准文本序列共同包含的文本单位来确定的,所以所述中间结果中目标占位符的前后两个文本单位在对应的标准文本序列中也存在相同的文本单位,并且对应相同的文本单位之间可能存在占位符。
步骤244,如果是,将所述占位符标记为对应占位符。
步骤245,针对每一标准文本序列,获取针对该标准文本序列确定的对应占位符的数目与目标占位符的数目的比值,作为第二比值。
对应占位符根据目标占位符对应选择出来的,所以对应占位符的数目一般小于标占位符的数目。
步骤246,基于针对每一标准文本序列获取的所述第一比值和所述第二比值,确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度。
在一个实施例中,针对每一标准文本序列,确定针对该标准文本序列获取的所述第一比值和所述第二比值的加权和,作为所述中间结果与该标准文本序列的匹配度。
综上所述,图4所示实施例的好处在于,通过综合文本单位数目的比值和对应占位符数目的比值两个维度的指标来共同确定中间结果与标准文本序列的匹配度,在一定程度上提高了确定出的匹配度的准确性。
本申请还提供了一种语音识别装置,以下是本申请的装置实施例。
图5是根据一示例性实施例示出的一种语音识别装置的框图。如图5所示,装置500包括:
输入模块510,被配置为将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
第一确定模块520,被配置为确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
替换模块530,被配置为针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
第二确定模块540,被配置为确定所述中间结果与预设的标准文本库中的每一 标准文本序列的匹配度;
识别模块550,被配置为基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
根据本申请的第三方面,还提供了一种计算设备,执行上述任一所示的语音识别方法的全部或者部分步骤。该计算设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上述任一个示例性实施例所示出的语音识别方法。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图6来描述根据本申请的这种实施方式的计算设备600。图6显示的计算设备600仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算设备600以通用计算设备的形式表现。计算设备600的组件可以包括但不限于:上述至少一个处理单元610、上述至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元610执行,使得所述处理单元610执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。
存储单元620可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)621和/或高速缓存存储单元622,还可以进一步包括只读存储单元(ROM)623。
存储单元620还可以包括具有一组(至少一个)程序模块625的程序/实用工具624,这样的程序模块625包括但不限于:操作系统、一个或者多个应用程序、其 它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
计算设备600也可以与一个或多个外部设备800(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该计算设备600交互的设备通信,和/或与使得该计算设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,计算设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器660通过总线630与计算设备600的其它模块通信。应当明白,尽管图中未示出,可以结合计算设备600使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请的第四方面,还提供了一种计算机非易失性可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图7所示,描述了根据本申请的实施方式的用于实现上述方法的计算机非易失性可读存储介质700,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包 括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,计算机非易失性可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言-诸如Java、C++等,还包括常规的过程式程序设计语言-诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说 明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种语音识别方法,包括:
    将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
    确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
    针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
    确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度;
    基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
  2. 根据权利要求1所述的方法,其中,所述音频数据为问句音频数据,所述标准文本库为标准问题库,所述标准文本序列为标准问题,所述标准问题与标准解答对应,在基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果之后,所述方法还包括:
    获取与所述最终识别结果对应的标准解答;
    将所述标准解答输出。
  3. 根据权利要求2所述的方法,其中,所述匹配度为第一匹配度,所述标准问题库还包括与每一标准问题对应的标准音频数据,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题;
    在所述标准文本库中获取与每一候选标准问题对应的标准音频数据;
    确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度;
    根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果。
  4. 根据权利要求2所述的方法,其中,所述匹配度为第一匹配度,所述标准问题库还包括与每一标准问题对应的标准音频数据,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题;
    在所述标准文本库中获取与每一候选标准问题对应的标准音频数据;
    确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度;
    根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果。
  5. 根据权利要求1所述的方法,其中,所述预定条件为:与文本单位对应的音频片段的声压的平均值低于预设声压平均值阈值。
  6. 根据权利要求1所述的方法,其中,所述确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度,包括:
    针对每一标准文本序列,获取该标准文本序列与所述中间结果中共同包含的文本单位的数目与所述中间结果中包含的所有文本单位的数目的比值,作为第一比值;
    针对每一标准文本序列,在所述中间结果中确定出该标准文本序列与所述中间结果中共同包含的各文本单位之间的占位符,作为目标占位符;
    针对每一标准文本序列,针对每一目标占位符,在所述中间结果中获取该目标占位符的前后两个文本单位并确定该标准文本序列 中与该前后两个文本单位相同的前后两个文本单位之间是否存在占位符;
    如果是,将所述占位符标记为对应占位符;
    针对每一标准文本序列,获取针对该标准文本序列确定的对应占位符的数目与目标占位符的数目的比值,作为第二比值;
    基于针对每一标准文本序列获取的所述第一比值和所述第二比值,确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度。
  7. 根据权利要求1所述的方法,其中,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    从所述标准文本库中获取对应的所述匹配度最大的标准文本序列作为目标标准文本序列,并将所述目标标准文本序列作为最终识别结果。
  8. 一种语音识别装置,包括:
    输入模块,被配置为将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
    第一确定模块,被配置为确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
    替换模块,被配置为针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
    第二确定模块,被配置为确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度;
    识别模块,被配置为基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
  9. 根据权利要求8所述的装置,其中,所述音频数据为问句音频数据,所述标准文本库为标准问题库,所述标准文本序列为标准问题,所述标准问题与标准解答对应,所述识别模块还被配置为在基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果之后:
    获取与所述最终识别结果对应的标准解答;
    将所述标准解答输出。
  10. 根据权利要求9所述的装置,其中,所述匹配度为第一匹配度,所述标准问题库还包括与每一标准问题对应的标准音频数据,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题;
    在所述标准文本库中获取与每一候选标准问题对应的标准音频数据;
    确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度;
    根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果。
  11. 根据权利要求10所述的装置,其中,所述确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度,包括:
    分别将所述标准音频数据和所述问句音频数据分成帧;
    提取每一帧音频数据的特征向量;
    根据所述标准音频数据和所述问句音频数据各帧音频数据的特征向量,分别构建所述问句音频数据和每一所述标准音频数据的特征矩阵;
    针对每一标准音频数据,确定该标准音频数据的特征矩阵与所述问句音频数据的特征矩阵的相似度,作为该标准音频数据与所述 问句音频数据的匹配度。
  12. 根据权利要求8所述的装置,其中,所述预定条件为:与文本单位对应的音频片段的声压的平均值低于预设声压平均值阈值。
  13. 根据权利要求8所述的装置,其中,所述确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度,包括:
    针对每一标准文本序列,获取该标准文本序列与所述中间结果中共同包含的文本单位的数目与所述中间结果中包含的所有文本单位的数目的比值,作为第一比值;
    针对每一标准文本序列,在所述中间结果中确定出该标准文本序列与所述中间结果中共同包含的各文本单位之间的占位符,作为目标占位符;
    针对每一标准文本序列,针对每一目标占位符,在所述中间结果中获取该目标占位符的前后两个文本单位并确定该标准文本序列中与该前后两个文本单位相同的前后两个文本单位之间是否存在占位符;
    如果是,将所述占位符标记为对应占位符;
    针对每一标准文本序列,获取针对该标准文本序列确定的对应占位符的数目与目标占位符的数目的比值,作为第二比值;
    基于针对每一标准文本序列获取的所述第一比值和所述第二比值,确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度。
  14. 根据权利要求8所述的装置,其中,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    从所述标准文本库中获取对应的所述匹配度最大的标准文本序列作为目标标准文本序列,并将所述目标标准文本序列作为最终识别结果。
  15. 一种计算设备,包括存储器和处理器,所述存储器中存储有计算 机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行:
    将获取的音频数据输入至预先建立的语音识别模型,得到所述语音识别模型输出的文本形式的初步识别结果;
    确定所述初步识别结果中每一文本单位对应的所述音频数据中的音频片段;
    针对所述初步识别结果中每一文本单位,若该文本单位对应的音频片段的声压满足预定条件,则将所述初步识别结果中的该文本单位替换为占位符,得到与所述初步识别结果对应的中间结果;
    确定所述中间结果与预设的标准文本库中的每一标准文本序列的匹配度;
    基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果。
  16. 根据权利要求15所述的计算设备,其中,所述音频数据为问句音频数据,所述标准文本库为标准问题库,所述标准文本序列为标准问题,所述标准问题与标准解答对应,在基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果之后,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行:
    获取与所述最终识别结果对应的标准解答;
    将所述标准解答输出。
  17. 根据权利要求16所述的计算设备,其中,所述匹配度为第一匹配度,所述标准问题库还包括与每一标准问题对应的标准音频数据,所述基于所述匹配度,在所述标准文本库的各标准文本序列中选出目标标准文本序列作为最终识别结果,包括:
    基于所述第一匹配度,在所述标准问题库的各标准问题中选出候选标准问题;
    在所述标准文本库中获取与每一候选标准问题对应的标准音频数 据;
    确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度;
    根据各候选标准问题对应的标准音频数据的第二匹配度,在各候选标准问题中选出目标标准问题作为最终识别结果。
  18. 根据权利要求17所述的计算设备,其中,所述确定每一标准音频数据与所述问句音频数据的匹配度,作为第二匹配度,包括:
    分别将所述标准音频数据和所述问句音频数据分成帧;
    提取每一帧音频数据的特征向量;
    根据所述标准音频数据和所述问句音频数据各帧音频数据的特征向量,分别构建所述问句音频数据和每一所述标准音频数据的特征矩阵;
    针对每一标准音频数据,确定该标准音频数据的特征矩阵与所述问句音频数据的特征矩阵的相似度,作为该标准音频数据与所述问句音频数据的匹配度。
  19. 根据权利要求15所述的计算设备,其中,所述预定条件为:与文本单位对应的音频片段的声压的平均值低于预设声压平均值阈值。
  20. 一种存储有计算机可读指令的计算机非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行权利要求1至7任一项所述的方法。
PCT/CN2019/117675 2019-09-17 2019-11-12 语音识别方法、装置、计算设备和存储介质 WO2021051564A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910877191.2A CN110503956B (zh) 2019-09-17 2019-09-17 语音识别方法、装置、介质及电子设备
CN201910877191.2 2019-09-17

Publications (1)

Publication Number Publication Date
WO2021051564A1 true WO2021051564A1 (zh) 2021-03-25

Family

ID=68592054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117675 WO2021051564A1 (zh) 2019-09-17 2019-11-12 语音识别方法、装置、计算设备和存储介质

Country Status (2)

Country Link
CN (1) CN110503956B (zh)
WO (1) WO2021051564A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053392A (zh) * 2021-03-26 2021-06-29 京东数字科技控股股份有限公司 语音识别方法、语音识别装置、电子设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552777B (zh) * 2020-04-24 2023-09-26 北京达佳互联信息技术有限公司 一种音频识别方法、装置、电子设备及存储介质
CN115881128B (zh) * 2023-02-07 2023-05-02 北京合思信息技术有限公司 一种基于历史匹配度的语音行为交互方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952219A (zh) * 2015-06-18 2015-09-30 惠州Tcl移动通信有限公司 一种基于智能器件寻找物品的方法及智能器件
JP2017016131A (ja) * 2015-06-30 2017-01-19 三星電子株式会社Samsung Electronics Co.,Ltd. 音声認識装置及び方法と電子装置
CN108428446A (zh) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 语音识别方法和装置
CN109920414A (zh) * 2019-01-17 2019-06-21 平安城市建设科技(深圳)有限公司 人机问答方法、装置、设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825478B2 (en) * 2011-01-10 2014-09-02 Nuance Communications, Inc. Real time generation of audio content summaries
US9082414B2 (en) * 2011-09-27 2015-07-14 General Motors Llc Correcting unintelligible synthesized speech
US20150058006A1 (en) * 2013-08-23 2015-02-26 Xerox Corporation Phonetic alignment for user-agent dialogue recognition
CN110162770B (zh) * 2018-10-22 2023-07-21 腾讯科技(深圳)有限公司 一种词扩展方法、装置、设备及介质
CN110111798B (zh) * 2019-04-29 2023-05-05 平安科技(深圳)有限公司 一种识别说话人的方法、终端及计算机可读存储介质
CN110136687B (zh) * 2019-05-20 2021-06-15 深圳市数字星河科技有限公司 一种基于语音训练克隆口音及声韵方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952219A (zh) * 2015-06-18 2015-09-30 惠州Tcl移动通信有限公司 一种基于智能器件寻找物品的方法及智能器件
JP2017016131A (ja) * 2015-06-30 2017-01-19 三星電子株式会社Samsung Electronics Co.,Ltd. 音声認識装置及び方法と電子装置
CN108428446A (zh) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 语音识别方法和装置
CN109920414A (zh) * 2019-01-17 2019-06-21 平安城市建设科技(深圳)有限公司 人机问答方法、装置、设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053392A (zh) * 2021-03-26 2021-06-29 京东数字科技控股股份有限公司 语音识别方法、语音识别装置、电子设备及介质
CN113053392B (zh) * 2021-03-26 2024-04-05 京东科技控股股份有限公司 语音识别方法、语音识别装置、电子设备及介质

Also Published As

Publication number Publication date
CN110503956A (zh) 2019-11-26
CN110503956B (zh) 2023-05-12

Similar Documents

Publication Publication Date Title
US11848001B2 (en) Systems and methods for providing non-lexical cues in synthesized speech
CN107134279B (zh) 一种语音唤醒方法、装置、终端和存储介质
CN109313896B (zh) 可扩展的动态类语言建模方法、用于生成话语转录的系统、计算机可读介质
CN109887497B (zh) 语音识别的建模方法、装置及设备
WO2020147404A1 (zh) 文本的语音合成方法、装置、计算机设备及计算机非易失性可读存储介质
US11688391B2 (en) Mandarin and dialect mixed modeling and speech recognition
WO2020043123A1 (zh) 命名实体识别方法、命名实体识别装置、设备及介质
CN109686383B (zh) 一种语音分析方法、装置及存储介质
CN110277088B (zh) 智能语音识别方法、装置及计算机可读存储介质
WO2021051564A1 (zh) 语音识别方法、装置、计算设备和存储介质
WO2021051514A1 (zh) 一种语音识别方法、装置、计算机设备及非易失性存储介质
US11151996B2 (en) Vocal recognition using generally available speech-to-text systems and user-defined vocal training
JP2015187684A (ja) N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN112669842A (zh) 人机对话控制方法、装置、计算机设备及存储介质
CN111243599A (zh) 语音识别模型构建方法、装置、介质及电子设备
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
CN112346696A (zh) 虚拟助理的语音比较
CN117043856A (zh) 高效流式非递归设备上的端到端模型
Thennattil et al. Phonetic engine for continuous speech in Malayalam
WO2023035529A1 (zh) 基于意图识别的信息智能查询方法、装置、设备及介质
CN111883133B (zh) 客服语音识别方法、装置、服务器及存储介质
CN110895938B (zh) 语音校正系统及语音校正方法
CN109036379B (zh) 语音识别方法、设备及存储介质
CN113421587B (zh) 语音评测的方法、装置、计算设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946130

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946130

Country of ref document: EP

Kind code of ref document: A1