CN113516994B - Real-time voice recognition method, device, equipment and medium - Google Patents

Real-time voice recognition method, device, equipment and medium Download PDF

Info

Publication number
CN113516994B
CN113516994B CN202110374258.8A CN202110374258A CN113516994B CN 113516994 B CN113516994 B CN 113516994B CN 202110374258 A CN202110374258 A CN 202110374258A CN 113516994 B CN113516994 B CN 113516994B
Authority
CN
China
Prior art keywords
voice
sentence
word
text
break
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110374258.8A
Other languages
Chinese (zh)
Other versions
CN113516994A (en
Inventor
刘轶
聂吉昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
Peking University Shenzhen Graduate School
Original Assignee
PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PKU-HKUST SHENZHEN-HONGKONG INSTITUTION, Peking University Shenzhen Graduate School filed Critical PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
Priority to CN202110374258.8A priority Critical patent/CN113516994B/en
Publication of CN113516994A publication Critical patent/CN113516994A/en
Application granted granted Critical
Publication of CN113516994B publication Critical patent/CN113516994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and discloses a real-time voice recognition method, which comprises the following steps: acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user; performing end point detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user; and when the voice starting point is detected, performing voice recognition on the second voice stream in real time to obtain a voice text. In addition, the application also relates to a real-time voice recognition method, a real-time voice recognition device, equipment and a storage medium. The method and the device can solve the problems of low efficiency of voice recognition and poor readability of a recognition result.

Description

Real-time voice recognition method, device, equipment and medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a real-time speech recognition method, device, and storage medium.
Background
With the development of science and technology, the application of artificial intelligence technology in various fields is more and more extensive. Speech recognition is an important application of artificial intelligence technology and is widely used in various fields. For example, in the customer service question answering, the user intention can be positioned by recognizing the user voice, so that personalized service is provided for the user in a targeted manner, or the service attitudes or service levels of different customer services can be judged by recognizing the customer service voice, so that the work of the customer services is supervised.
Most of the existing real-time speech recognition methods are to acquire the whole speech of a user, recognize the whole speech and output the recognized second speech text to the user. In the method, when the whole voice is longer, the voice contains more contents, which leads to low efficiency of voice recognition, and the content of the voice recognition is directly output as a result, which leads to some errors contained in the voice recognition result, so that the readability of the voice recognition result is lower.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present application provides a real-time speech recognition method, apparatus and storage medium.
In a first aspect, the present application provides a real-time speech recognition method, including:
acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user;
performing end point detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
and when the voice starting point is detected, performing voice recognition on the second voice stream in real time to obtain a voice text.
Optionally, before the step of acquiring the first voice stream input by the user, the method includes:
acquiring the historical voice set of a user;
and extracting the first voiceprint feature and the sentence break feature from the historical speech set.
Optionally, the denoising the first voice stream according to the first voiceprint feature to obtain a second voice stream includes:
extracting a second voiceprint feature of each moment in the first voice stream;
calculating the similarity of the second voiceprint feature and the first voiceprint feature;
and eliminating a second voiceprint feature with the similarity smaller than a preset similarity threshold value in the first voiceprint to obtain a second voiceprint.
Optionally, the sentence-break feature includes a sentence-break duration threshold and a habitual sentence-break word, and the extracting the sentence-break duration threshold from the historical speech set includes:
counting sentence break duration of each sentence break of the historical voice set users;
calculating the average sentence break duration of a user according to the sentence break duration, and determining the sentence break duration threshold according to the average duration; and the number of the first and second groups,
the method for extracting the customary sentence-breaking words from the historical voice set comprises the following steps:
and counting words before each sentence break of the user in the historical voice set to obtain a sentence break word set
And calculating the occurrence frequency of each word in the sentence-breaking word set, and taking the word with the frequency greater than a preset frequency threshold value as the sentence-breaking habit word.
Optionally, the performing, in real time, end point detection on the second voice stream through a preset sentence break feature to determine whether a voice start point occurs includes:
detecting whether the second voice stream is paused in real time;
when the pause is monitored, recording the duration of the pause, and identifying words appearing in the second voice stream in unit time before the pause;
comparing the duration with the sentence-break duration threshold, and comparing the words with the habitual sentence-break words;
and if the duration is greater than the sentence break duration threshold and the word is the customary sentence break word, determining the voice starting point of the second voice stream which is not paused.
Optionally, the performing voice recognition on the second voice stream to obtain a voice text includes:
performing convolution, pooling and multiple full-connection processing on the second voice stream to obtain a voice vector;
matching the voice vectors according to a preset character vector table to obtain an initial text;
and performing text completion on the initial text to obtain a voice text.
Optionally, the performing text completion on the initial text to obtain a speech text includes:
performing word segmentation processing on the initial text to obtain text word segmentation;
selecting target participles from the text participles, and performing semantic relevance detection on the target participles to obtain a relevance coefficient of the target participles and the text participles before and after the target participles;
and when the correlation coefficient is smaller than a preset correlation coefficient threshold value, correcting the target participle by using a replacement word to obtain a voice text.
Optionally, the correcting the target segmented word by using the replacement word to obtain a speech text includes:
acquiring replaceable words, and calculating the prepositive correlation coefficient of the text participles before the target participle and the replaceable words;
calculating a post-correlation coefficient between the text participle after the target participle and the replaceable word;
judging whether the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value;
if at least one of the pre-correlation coefficient and the post-correlation coefficient is less than or equal to the correlation coefficient threshold, returning to the step of obtaining the replaceable words, and obtaining new replaceable words again;
and if the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value, replacing the target word segmentation by using the replaceable word to obtain a voice text.
In a second aspect, the present application provides a real-time speech recognition apparatus, the apparatus comprising:
the noise removal module is used for acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user;
the endpoint identification module is used for carrying out endpoint detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears or not, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
and the voice recognition module is used for performing voice recognition on the second voice stream in real time to obtain a voice text when the voice starting point is detected.
In a third aspect, a voice recognition device based on private information is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the real-time speech recognition method according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the real-time speech recognition method according to any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method, the device, the electronic equipment and the computer readable storage medium provided by the embodiment of the application, the noise of the first voice stream of the user can be removed through the first voiceprint feature, the amount of useless information in the first voice stream is reduced, the noise in the first voice stream is accurately removed, the accuracy in subsequent voice recognition is improved, and the efficiency in voice recognition can be improved; the second voice stream is segmented according to the sentence-breaking characteristics, so that voice recognition of the complete voice stream is avoided, and the recognition efficiency is improved; and the text completion is carried out on the voice text obtained by voice recognition, so that errors in the voice text are reduced, and the readability of the voice text is improved. The problems of low efficiency of voice recognition and poor readability of a recognition result can be solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a real-time speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of endpoint detection provided in an embodiment of the present application;
fig. 3 is a schematic flow chart of text completion according to an embodiment of the present application;
fig. 4 is a schematic block diagram of an apparatus for speech recognition according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for speech recognition according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flowchart of a real-time speech recognition method according to an embodiment of the present application. In this embodiment, the real-time speech recognition method includes:
s1, acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user.
In the embodiment of the present application, the first voice stream of the user can be continuously acquired by using the microphone pre-installed at the user end. For example, after the user clicks a button for voice input in the user terminal, the microphone continuously captures sound in the environment to obtain a first voice stream of the user; or after the user executes the instruction of voice input in the user terminal, the instruction controls the microphone to start continuously capturing the sound in the environment so as to acquire the first voice stream of the user.
In one application scenario of the present application, the captured first voice stream further contains background noise except for the user voice, so that the noise reduction processing is performed on the first voice stream in real time according to the preset first voiceprint feature in the embodiment of the present application, and the noise not belonging to the user voice in the first voice stream can be removed to obtain the second voice stream. Wherein the second voice stream is user voice without packets or containing little noise.
In detail, the first voiceprint feature is a voiceprint feature extracted from a historical speech collection of a user.
In one embodiment of the present application, before the step of obtaining the first voice stream input by the user, the method includes:
acquiring the historical voice set of a user;
and extracting the first voiceprint feature and the sentence break feature from the historical speech set.
The sentence-break characteristics comprise a sentence-break duration threshold and customary sentence-break words.
In the embodiment of the present application, the historical speech set includes a plurality of pieces of historical speech of the user, for example, the historical speech set includes a plurality of call records of the user, or includes a plurality of consultation speech records of the user, and the like.
The implementation of the present application can extract a first voiceprint feature and a sentence break feature in a historical speech set by means of a preset convolutional neural network, a decibel meter, a filter, or a combination thereof, wherein the first voiceprint feature includes but is not limited to: volume, amplitude, voice spectral density; the sentence-break feature refers to the pause habit of the user when generating the speech, for example, at which words the user prefers to pause, the time length for the user to pause, and the like.
In one embodiment of the application, the voice intensity of each statement in a historical voice set can be detected through a decibel meter, and the volume of the historical voice is calculated according to the voice intensity and the voice duration of the historical voice; a speech waveform of each historical speech is generated by a filter, and an amplitude is calculated from the speech waveform to extract a first voiceprint feature from a set of historical speech of the user.
Furthermore, each historical voice in the historical voice set can be detected through a preset convolutional neural network so as to determine where the user makes a sentence break in each historical voice, and how long the sentence break is, and the sentence break characteristics of the user are counted.
In detail, the sentence-break characteristics include a sentence-break duration threshold and a customary sentence-break word, and the sentence-break duration threshold is used for judging whether a pause in the user speaking process is a sentence break. For example, when the user stops speaking, if the pause time is less than the sentence-break time threshold, it is determined that the user is making a sentence break, and if the pause time is greater than or equal to the sentence-break time threshold, it is determined that the user has finished speaking at this time.
The habitual sentence-breaking words can also be used for judging whether the pause in the speaking process of the user is a sentence-breaking word, for example, when the pause occurs in the speaking process of the user, if the word appearing at the pause is the habitual sentence-breaking word, the sentence-breaking is confirmed to be performed by the user at the moment, and if the word appearing at the pause is not the habitual sentence-breaking word, the sentence-breaking is confirmed to be completed by the user at the moment.
Specifically, the extracting the sentence break duration threshold from the historical speech set includes:
counting sentence break duration of each sentence break of the historical voice set users;
calculating the average sentence break duration of a user according to the sentence break duration, and determining the sentence break duration threshold according to the average duration; and the number of the first and second groups,
the method for extracting the customary sentence-breaking words from the historical voice set comprises the following steps:
and counting words before each sentence break of the user in the historical voice set to obtain a sentence break word set
And calculating the occurrence frequency of each word in the sentence-breaking word set, and taking the word with the frequency greater than a preset frequency threshold value as the sentence-breaking habit word.
For example, in the sentence break characteristics obtained by statistics, if the sentence break occurs for 3 times by the user, and the sentence break duration is 2s, 3s and 4s, the average duration can be calculated to be 3s, and if the preset proportional value is four-thirds, the sentence break duration threshold can be calculated to be 4 s.
And counting to obtain words spoken before each sentence break by the user in the sentence break characteristic, wherein the words A comprise 10 times, the words B comprise 50 times and the words C comprise 40 times, the frequency of the words A is one tenth, the frequency of the words B is one half, the frequency of the words C is two fifths, and when the preset frequency threshold is three tenths, the words B and the words C can be confirmed to be the customary sentence break words.
In this embodiment, the denoising the first voice stream according to the first voiceprint feature to obtain a second voice stream includes:
extracting a second voiceprint feature of each moment in the first voice stream;
calculating the similarity of the second voiceprint feature and the first voiceprint feature;
and eliminating a second voiceprint feature with the similarity smaller than a preset similarity threshold value in the first voiceprint to obtain a second voiceprint.
In detail, the step of extracting the second voiceprint feature at each time in the first voice stream is the same as the step of extracting the first voiceprint feature in the historical voice set in step S1, and is not described herein again.
In one embodiment of the application, vector conversion can be performed on the second voiceprint features and the first voiceprint features, and similarity calculation is performed on vectors obtained through conversion by utilizing algorithms with similarity calculation functions, such as cosine distance algorithm or Euclidean distance algorithm. And eliminating the part which does not belong to the user voice in the first voice stream according to the result of the similarity calculation so as to realize the noise reduction processing of the first voice stream.
For example, it is calculated that, in the first voice stream, if the similarity between the second voiceprint feature and the first voiceprint feature included in the 10 th to 15 th input voices is smaller than the similarity threshold, the part of the 10 th to 15 th input voices not belonging to the first voiceprint feature is extracted, and the part belonging to the first voiceprint feature is retained.
According to the voice recognition method and device, noise removal is carried out on the first voice stream through the first voiceprint feature, the amount of useless information in the first voice stream can be reduced, the noise in the first voice stream can be removed accurately, the follow-up accuracy in voice recognition can be improved, and the efficiency in voice recognition can be improved.
And S2, carrying out end point detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user.
According to the embodiment of the application, the end point detection can be carried out on the second voice stream in real time through the sentence-break duration threshold value in the sentence-break characteristics and the customary sentence-break words so as to judge whether a voice starting point appears or not.
For example, the second voice stream is continuously detected, when a pause is detected, the pause is timed, and when the pause time is less than the sentence-break duration threshold, the user is considered to be sentence-break and has not been spoken; and when the pause time is greater than or equal to the sentence break duration threshold, confirming that the user finishes speaking at the moment, and determining the pause position as the endpoint of the second voice stream.
Or, the second voice stream is continuously detected, when pause is detected, the pause is timed, words appearing in the second voice stream before the pause appears are detected, and when the pause time is less than the sentence-breaking time threshold and the words are habitual sentence-breaking words, the user is considered to be sentence-breaking and is not spoken; and when the pause time is greater than or equal to the sentence break duration threshold and the word is not a customary sentence break word, confirming that the user finishes speaking at the moment, and determining that the pause position is the end point of the second voice stream.
For example, in the sentence break characteristics obtained by statistics, if the sentence break occurs for 3 times by the user, and the sentence break duration is 2s, 3s and 4s, the average duration can be calculated to be 3s, and if the preset proportional value is four-thirds, the sentence break duration threshold can be calculated to be 4 s.
And counting to obtain words spoken before each sentence break by the user in the sentence break characteristic, wherein the words A comprise 10 times, the words B comprise 50 times and the words C comprise 40 times, the frequency of the words A is one tenth, the frequency of the words B is one half, the frequency of the words C is two fifths, and when the preset frequency threshold is three tenths, the words B and the words C can be confirmed to be the customary sentence break words.
Further, in another embodiment of the present application, referring to fig. 2, the performing, in real time, an endpoint detection on the second voice stream through a preset sentence break characteristic to determine whether a voice start point occurs includes:
s21, detecting whether the second voice stream is stopped in real time;
s22, when the pause is monitored, recording the duration of the pause, and identifying words appearing in the second voice stream in unit time before the pause;
s23, comparing the duration with the sentence-break duration threshold, and comparing the words with the habitual sentence-break words;
s24, if the duration is longer than the sentence-break duration threshold and the word is the customary sentence-break word, determining the speech starting point of the second speech stream which is not paused.
In detail, the unit time can be defined by a user, when the pause is monitored, the words appearing in the second voice stream in the unit time before the pause can be identified by using the model with the voice identification function, so that the complete voice stream is prevented from being identified, the identification efficiency is improved, and the endpoint detection efficiency of the second voice stream is further improved.
And S3, when the voice starting point is detected, performing voice recognition on the second voice stream in real time to obtain a voice text.
In the embodiment of the present application, when a speech start point is detected, a preset speech recognition Model may be used to perform speech recognition on the second speech stream in real time to obtain a speech text, where the speech recognition Model includes, but is not limited to, a speech recognition Model based on HMM (Hidden Markov Model) and a speech recognition Model based on GMM (Gaussian Mixture Model).
In this embodiment of the application, performing speech recognition on the second speech stream to obtain a speech text includes:
performing convolution, pooling and multiple full-connection processing on the second voice stream to obtain a voice vector;
matching the voice vectors according to a preset character vector table to obtain an initial text;
and performing text completion on the initial text to obtain a voice text.
The multiple full-join processing is generally two-layer full-join processing, that is, performing full-join processing on the pooled speech segments twice, so as to enhance the model complexity and improve the accuracy of the obtained speech vectors.
The word vector table comprises a plurality of words and voice vectors corresponding to the words, and the words matched with the voice vectors can be inquired through the word vector table so as to obtain voice texts.
And performing voice recognition on the second voice stream through a preset voice recognition model, so that the accuracy of the recognized voice text is improved.
In another embodiment of this application, it is right still to adopt acoustic model the second speech flow carries out speech recognition, obtains the speech text, acoustic model is through carrying out the modeling of vocalization to every word to the establishment contains a plurality of words, and the database of the standard pronunciation that every word corresponds, through right the collection of user's pronunciation under every moment in the second speech flow to obtain user's pronunciation under every moment, and then with this pronunciation and the contained a plurality of words of constructing in advance, and the word in the database of the standard pronunciation that every word corresponds carries out the probability matching, carries out speech recognition to this realization to the second speech flow, obtains the speech text.
When the second voice stream is subjected to voice recognition through the acoustic model, because the processing of feature extraction such as convolution, pooling and the like is not required to be performed on the language, the efficiency of improving the voice recognition and obtaining the voice text is facilitated.
In an actual application scenario of the present application, due to the influence of model accuracy or external environment factors, readability of recognized speech texts is not high. For example, since the external noise is strong at a certain time, after the noise of the first voice stream is removed according to the first voiceprint feature in step S2, the real voice at the certain time is also removed, so that the obtained voice text is missing at this point; alternatively, the obtained speech text may have wrongly written characters. Therefore, the embodiment of the invention completes the text of the voice text so as to improve the readability of the acquired voice text.
In the embodiment of the application, the speech text can be subjected to text completion by using models with natural language processing functions, such as an NLP model, a BERT model, and the like, so as to obtain the speech text.
In an embodiment of the present invention, referring to fig. 3, the performing text completion on the initial text to obtain a speech text includes:
s31, performing word segmentation processing on the initial text to obtain text word segmentation;
s32, selecting target participles from the text participles, and performing semantic relevance detection on the target participles to obtain a relevance coefficient of the target participles and the preceding and following text participles;
and S33, when the correlation coefficient is smaller than a preset correlation coefficient threshold value, correcting the target participle by using the replacement word to obtain a voice text.
In detail, target participles can be selected from the text participles in sequence, a semantic vector of each text participle is constructed through models such as word2vec and NLP, and the correlation coefficient is obtained by calculating a difference value between the target participle and the semantic vectors corresponding to the text participles before and after the target participle.
Specifically, when the target segmentation is compensated, the target segmentation can be replaced by adopting the synonym of the target segmentation to compensate the target segmentation; or calculating the correlation coefficient between the text participles before and after the target participle and a preset replaceable word, and replacing the target participle by using the replaceable word to obtain the voice text when the correlation coefficient between the text participles before and after the target participle and the replaceable word is larger than or equal to a preset correlation coefficient threshold value. Wherein the alternative words may be pre-given by the user.
In another embodiment of the present invention, the correcting the target segmented word by using the replacement word to obtain a speech text includes:
acquiring replaceable words, and calculating the prepositive correlation coefficient of the text participles before the target participle and the replaceable words;
calculating a post-correlation coefficient between the text participle after the target participle and the replaceable word;
judging whether the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value;
if at least one of the pre-correlation coefficient and the post-correlation coefficient is less than or equal to the correlation coefficient threshold, returning to the step of obtaining the replaceable words, and obtaining new replaceable words again;
and if the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value, replacing the target word segmentation by using the replaceable word to obtain a voice text.
Further, if at least one of the pre-correlation coefficient and the post-correlation coefficient is smaller than or equal to the correlation coefficient threshold, new replaceable words are obtained again and calculation is carried out until the target word segmentation is replaced, and a voice text is obtained.
According to the method provided by the embodiment of the application, the noise of the first voice stream of the user can be removed through the first voiceprint characteristic, the amount of useless information in the first voice stream is reduced, the noise in the first voice stream is accurately removed, the accuracy in subsequent voice recognition is improved, and the efficiency in voice recognition can be improved; the second voice stream is segmented according to the sentence-breaking characteristics, so that voice recognition of the complete voice stream is avoided, and the recognition efficiency is improved; and the text completion is carried out on the voice text obtained by voice recognition, so that errors in the voice text are reduced, and the readability of the voice text is improved. The problems of low efficiency of voice recognition and poor readability of a recognition result can be solved.
As shown in fig. 4, an embodiment of the present application provides a schematic block diagram of a real-time speech recognition apparatus 10, where the real-time speech recognition apparatus 10 includes: a noise removal module 11, an endpoint recognition module 12 and a speech recognition module 13.
The noise removing module 11 is configured to obtain a first voice stream input by a user, and perform noise removal on the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, where the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user;
the endpoint recognition module 12 is configured to perform endpoint detection on the second voice stream in real time through preset sentence break characteristics to determine whether a voice starting point occurs, where the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
the voice recognition module 13 is configured to perform voice recognition on the second voice stream in real time when a voice starting point is detected, so as to obtain a voice text.
In detail, in the embodiment of the present application, each module in the real-time speech recognition apparatus 10 adopts the same technical means as the real-time speech recognition method described in fig. 1 to 3, and can produce the same technical effect, and is not described herein again.
As shown in fig. 5, the embodiment of the present application provides a speech recognition device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, wherein the processor 111, the communication interface 112, and the memory 113 complete mutual communication via the communication bus 114,
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, when configured to execute the program stored in the memory 113, implements the real-time speech recognition method provided in any one of the foregoing method embodiments, including:
acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user;
performing end point detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
and when the voice starting point is detected, performing voice recognition on the second voice stream in real time to obtain a voice text.
The communication bus 114 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 114 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 112 is used for communication between the above-described electronic apparatus and other apparatuses.
The memory 113 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 113 may also be at least one storage device located remotely from the processor 111.
The processor 111 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
The present application also provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the real-time speech recognition method provided by any one of the foregoing method embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A real-time speech recognition method, the method comprising:
acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint feature to obtain a second voice stream, wherein the first voiceprint feature is a voiceprint feature extracted from a historical voice set of the user;
performing end point detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
when a voice starting point is detected, performing voice recognition on the second voice stream in real time to obtain a voice text;
wherein the sentence-break characteristics include a sentence-break duration threshold and a customary sentence-break word, and the method further comprises:
extracting the sentence break duration threshold from the historical voice set; counting sentence break duration of each sentence break of the historical voice set users; calculating the average sentence break duration of a user according to the sentence break duration, and determining the sentence break duration threshold according to the average duration;
extracting the habitual sentence-breaking words from the historical voice set; the method comprises the following steps: and counting words before the user breaks a sentence every time in the historical voice set to obtain a sentence breaking word set, calculating the occurrence frequency of each word in the sentence breaking word set, and taking the word with the frequency greater than a preset frequency threshold value as the habit sentence breaking word.
2. The real-time speech recognition method of claim 1, wherein the step of obtaining the first speech stream of the user input is preceded by:
acquiring the historical voice set of a user;
and extracting the first voiceprint feature and the sentence break feature from the historical speech set.
3. The method according to claim 1, wherein denoising the first voice stream according to the first voiceprint feature to obtain a second voice stream comprises:
extracting a second voiceprint feature of each moment in the first voice stream;
calculating the similarity of the second voiceprint feature and the first voiceprint feature;
and eliminating a second voiceprint feature with the similarity smaller than a preset similarity threshold value in the first voiceprint to obtain a second voiceprint.
4. The real-time speech recognition method according to claim 1, wherein the performing endpoint detection on the second speech stream in real time through a preset sentence break characteristic to determine whether a speech start point occurs comprises:
detecting whether the second voice stream is paused in real time;
when the pause is monitored, recording the duration of the pause, and identifying words appearing in the second voice stream in unit time before the pause;
comparing the duration with the sentence-break duration threshold, and comparing the words with the habitual sentence-break words;
and if the duration is greater than the sentence break duration threshold and the word is the customary sentence break word, determining the pause as the voice starting point of the second voice stream.
5. The method according to any one of claims 1 to 4, wherein performing speech recognition on the second speech stream to obtain a speech text comprises:
performing convolution, pooling and multiple full-connection processing on the second voice stream to obtain a voice vector;
matching the voice vectors according to a preset character vector table to obtain an initial text;
and performing text completion on the initial text to obtain a voice text.
6. The real-time speech recognition method of claim 5, wherein the performing text completion on the initial text to obtain a speech text comprises:
performing word segmentation processing on the initial text to obtain text word segmentation;
selecting target participles from the text participles, and performing semantic relevance detection on the target participles to obtain a relevance coefficient of the target participles and the text participles before and after the target participles;
and when the correlation coefficient is smaller than a preset correlation coefficient threshold value, correcting the target participle by using a replacement word to obtain a voice text.
7. The real-time speech recognition method of claim 6, wherein the supplementing the target segmented word with the replacement word to obtain a speech text comprises:
acquiring replaceable words, and calculating the prepositive correlation coefficient of the text participles before the target participle and the replaceable words;
calculating a post-correlation coefficient between the text participle after the target participle and the replaceable word;
judging whether the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value;
if at least one of the pre-correlation coefficient and the post-correlation coefficient is less than or equal to the correlation coefficient threshold, returning to the step of obtaining the replaceable words, and obtaining new replaceable words again;
and if the pre-correlation coefficient and the post-correlation coefficient are both larger than the correlation coefficient threshold value, replacing the target word segmentation by using the replaceable word to obtain a voice text.
8. A real-time speech recognition apparatus, the apparatus comprising:
the voice processing device comprises a noise removing module, a voice analyzing module and a voice processing module, wherein the noise removing module is used for acquiring a first voice stream input by a user, and removing noise of the first voice stream in real time through a preset first voiceprint characteristic to obtain a second voice stream, and the first voiceprint characteristic is a voiceprint characteristic extracted from a historical voice set of the user;
the endpoint identification module is used for carrying out endpoint detection on the second voice stream in real time through preset sentence break characteristics to judge whether a voice starting point appears or not, wherein the sentence break characteristics are sentence break characteristics extracted from a historical voice set of a user;
the voice recognition module is used for carrying out voice recognition on the second voice stream in real time when a voice starting point is detected to obtain a voice text;
wherein the sentence-break characteristics include a sentence-break duration threshold and a habitual sentence-break word, the apparatus is further configured to:
extracting the sentence break duration threshold from the historical voice set; counting sentence break duration of each sentence break of the historical voice set users; calculating the average sentence break duration of a user according to the sentence break duration, and determining the sentence break duration threshold according to the average duration;
extracting the habitual sentence-breaking words from the historical voice set; the method comprises the following steps: and counting words before the user breaks a sentence every time in the historical voice set to obtain a sentence breaking word set, calculating the occurrence frequency of each word in the sentence breaking word set, and taking the word with the frequency greater than a preset frequency threshold value as the habit sentence breaking word.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the real-time speech recognition method of any one of claims 1 to 7 when executing a program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the real-time speech recognition method according to any one of claims 1 to 7.
CN202110374258.8A 2021-04-07 2021-04-07 Real-time voice recognition method, device, equipment and medium Active CN113516994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110374258.8A CN113516994B (en) 2021-04-07 2021-04-07 Real-time voice recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110374258.8A CN113516994B (en) 2021-04-07 2021-04-07 Real-time voice recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113516994A CN113516994A (en) 2021-10-19
CN113516994B true CN113516994B (en) 2022-04-26

Family

ID=78062168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110374258.8A Active CN113516994B (en) 2021-04-07 2021-04-07 Real-time voice recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113516994B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023308A (en) * 2021-12-17 2022-02-08 广州讯飞易听说网络科技有限公司 Method and system for processing punctuation of voice sentence
CN115810346A (en) * 2023-02-17 2023-03-17 深圳市北科瑞声科技股份有限公司 Voice recognition method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06161485A (en) * 1992-11-24 1994-06-07 Nippon Telegr & Teleph Corp <Ntt> Synthesized speech pause setting system
CN106486130A (en) * 2015-08-25 2017-03-08 百度在线网络技术(北京)有限公司 Noise elimination, audio recognition method and device
CN107068147A (en) * 2015-10-19 2017-08-18 谷歌公司 Sound end is determined
CN107481718A (en) * 2017-09-20 2017-12-15 广东欧珀移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment
CN111402880A (en) * 2020-03-24 2020-07-10 联想(北京)有限公司 Data processing method and device and electronic equipment
CN111737980A (en) * 2020-06-22 2020-10-02 桂林电子科技大学 Method for correcting English text word use errors
CN112397052A (en) * 2020-11-19 2021-02-23 康键信息技术(深圳)有限公司 VAD sentence-breaking test method, VAD sentence-breaking test device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170069309A1 (en) * 2015-09-03 2017-03-09 Google Inc. Enhanced speech endpointing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06161485A (en) * 1992-11-24 1994-06-07 Nippon Telegr & Teleph Corp <Ntt> Synthesized speech pause setting system
CN106486130A (en) * 2015-08-25 2017-03-08 百度在线网络技术(北京)有限公司 Noise elimination, audio recognition method and device
CN107068147A (en) * 2015-10-19 2017-08-18 谷歌公司 Sound end is determined
CN107481718A (en) * 2017-09-20 2017-12-15 广东欧珀移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment
CN111402880A (en) * 2020-03-24 2020-07-10 联想(北京)有限公司 Data processing method and device and electronic equipment
CN111737980A (en) * 2020-06-22 2020-10-02 桂林电子科技大学 Method for correcting English text word use errors
CN112397052A (en) * 2020-11-19 2021-02-23 康键信息技术(深圳)有限公司 VAD sentence-breaking test method, VAD sentence-breaking test device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113516994A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
CN109473106B (en) Voiceprint sample collection method, voiceprint sample collection device, voiceprint sample collection computer equipment and storage medium
CN108182937B (en) Keyword recognition method, device, equipment and storage medium
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN109493850B (en) Growing type dialogue device
CN111797632B (en) Information processing method and device and electronic equipment
CN113516994B (en) Real-time voice recognition method, device, equipment and medium
CN111145733B (en) Speech recognition method, speech recognition device, computer equipment and computer readable storage medium
CN104143326A (en) Voice command recognition method and device
CN108039181B (en) Method and device for analyzing emotion information of sound signal
CN110544470B (en) Voice recognition method and device, readable storage medium and electronic equipment
US10971149B2 (en) Voice interaction system for interaction with a user by voice, voice interaction method, and program
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN112468659A (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN115102789B (en) Anti-communication network fraud studying, judging, early warning and intercepting comprehensive platform
CN112509561A (en) Emotion recognition method, device, equipment and computer readable storage medium
CN115394318A (en) Audio detection method and device
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN113782026A (en) Information processing method, device, medium and equipment
CN113782036A (en) Audio quality evaluation method and device, electronic equipment and storage medium
CN115810346A (en) Voice recognition method, device, equipment and medium
CN115512687B (en) Voice sentence-breaking method and device, storage medium and electronic equipment
JP3735209B2 (en) Speaker recognition apparatus and method
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN110580899A (en) Voice recognition method and device, storage medium and computing equipment
CN113111855B (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant