CN115240655A - Chinese voice recognition system and method based on deep learning - Google Patents
Chinese voice recognition system and method based on deep learning Download PDFInfo
- Publication number
- CN115240655A CN115240655A CN202210848331.5A CN202210848331A CN115240655A CN 115240655 A CN115240655 A CN 115240655A CN 202210848331 A CN202210848331 A CN 202210848331A CN 115240655 A CN115240655 A CN 115240655A
- Authority
- CN
- China
- Prior art keywords
- chinese
- voice
- recognized
- text
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 11
- 238000012163 sequencing technique Methods 0.000 claims abstract description 18
- 238000012937 correction Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 14
- 230000002159 abnormal effect Effects 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 9
- 239000003086 colorant Substances 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 6
- 230000004048 modification Effects 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000010183 spectrum analysis Methods 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 12
- 230000009286 beneficial effect Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a Chinese speech recognition system and method based on deep learning, relating to the technical field of speech recognition, wherein the system comprises: the voice acquisition module is used for receiving the Chinese voice segment to be recognized in real time and sequencing the Chinese voice segment to be recognized based on a time sequence; the voice recognition module is used for constructing a Chinese voice recognition model and sequentially inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model for voice recognition based on the sequencing result to obtain a voice text; and the correction module is used for carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text. The Chinese speech recognition model is constructed to sequentially recognize the acquired Chinese speech segments to be recognized and correct the recognized speech text according to the Chinese grammar, so that the accuracy of Chinese speech recognition is ensured, and the effect of Chinese speech recognition is improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a Chinese voice recognition system and method based on deep learning.
Background
At present, with the rapid improvement of computer processing capability, the speech recognition technology is rapidly developed, and the application of the speech recognition technology is increasingly changing the production and living modes of human beings, so that the speech recognition technology is widely applied to the fields such as a speech input system, a speech control system, an intelligent dialogue inquiry system and the like;
however, most of the existing voice recognition systems can only simply recognize the voice to be recognized, and cannot check the recognized voice text according to the Chinese semantics, so that a logic error or a grammar error exists in a recognition result, and meanwhile, a wrongly written word in the recognition result cannot be effectively corrected, so that the voice recognition effect is greatly reduced;
therefore, the invention provides a Chinese speech recognition system and method based on deep learning.
Disclosure of Invention
The invention provides a Chinese voice recognition system and method based on deep learning, which are used for sequentially recognizing an acquired Chinese voice segment to be recognized by constructing a Chinese voice recognition model and correcting a recognized voice text according to Chinese grammar, so that the accuracy of Chinese voice recognition is ensured, and the effect of Chinese voice recognition is improved.
The invention provides a Chinese speech recognition system based on deep learning, which comprises:
the voice acquisition module is used for receiving the Chinese voice segment to be recognized in real time and sequencing the Chinese voice segment to be recognized based on a time sequence;
the voice recognition module is used for constructing a Chinese voice recognition model and sequentially inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model for voice recognition based on the sequencing result to obtain a voice text;
and the correction module is used for carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
Preferably, the system for recognizing chinese speech based on deep learning includes:
the voice acquisition unit is used for monitoring the current acoustic characteristics of the user in real time and determining the current voice state of the user based on the acoustic characteristics, wherein the voice state comprises a voiced sound and an unvoiced sound;
and the voice recording unit is used for acquiring the Chinese voice sent by the user when the voice state is voice, and storing the acquired Chinese voice to obtain the Chinese voice section to be identified.
Preferably, the system for recognizing chinese speech based on deep learning includes:
the voice processing subunit is used for acquiring the obtained Chinese voice segment to be recognized, and performing spectrum analysis on the Chinese voice segment to be recognized to obtain an audio map corresponding to the Chinese voice segment to be recognized;
the voice screening subunit is used for determining a first peak frequency point corresponding to the Chinese voice segment to be identified at each moment based on the audio frequency map, acquiring a noise audio frequency map corresponding to a noise signal, and determining a second peak frequency point of the noise signal based on the noise audio frequency map;
the voice screening subunit is configured to compare the first peak frequency point with the second peak frequency point, screen out a target peak frequency point at which the first peak frequency point is greater than the second peak frequency point, and determine the to-be-identified Chinese voice segment corresponding to the target peak frequency point as an effective to-be-identified Chinese voice segment.
Preferably, the system for recognizing chinese speech based on deep learning includes:
the time determining unit is used for acquiring the obtained Chinese speech segment to be recognized and processing the Chinese speech segment to be recognized to obtain a speech signal corresponding to each frame;
the time determining unit is further configured to determine time domain information of the Chinese speech segment to be recognized based on the speech signal corresponding to each frame, and match the time domain information with the speech signal corresponding to each frame;
and the sorting unit is used for determining a time sequence corresponding to the Chinese speech segment to be recognized based on the matching result and sorting the Chinese speech segment to be recognized based on the ascending sequence of the time sequence, wherein the Chinese speech segment to be recognized is at least one segment.
Preferably, the chinese speech recognition system based on deep learning includes:
the result acquiring subunit is used for acquiring the sequencing result of the Chinese speech segment to be recognized and determining the target number of the Chinese speech segment to be recognized based on the sequencing result;
the tag obtaining subunit is used for extracting the acoustic features of the Chinese speech segment to be recognized and determining the speech type of the Chinese speech segment to be recognized based on the acoustic features;
and the marking subunit is used for acquiring a target number of marking labels from a preset label database based on the voice type and marking the Chinese voice segment to be recognized based on the target number of marking labels.
Preferably, the system for recognizing chinese speech based on deep learning comprises:
the data acquisition unit is used for acquiring a voice training text and calling accents with different sound colors from a preset voice library to read the voice training text to obtain audio data of the accents with different sound colors to the voice training text;
the data processing unit is used for preprocessing the audio data, converting the audio data into a corresponding spectrogram based on a preprocessing result, and determining an effective area in the audio data based on the spectrogram;
the model construction unit is used for determining the characteristic parameters of the audio data based on the effective region, acquiring the corresponding relation between Chinese pinyin and Chinese characters, training the characteristic parameters based on the corresponding relation, and constructing a Chinese voice recognition model based on the training result;
the speech recognition unit is used for sequentially inputting the acquired Chinese speech segments to be recognized into the Chinese speech recognition model, analyzing the received Chinese speech segments to be recognized based on a preset syntax analysis tree in the Chinese speech recognition model, and determining a starting point and an ending point of each sentence in the Chinese speech segments to be recognized;
the speech recognition unit is used for performing first splitting on each Chinese speech segment to be recognized based on the starting point and the end point, obtaining a sentence set of each Chinese speech segment to be recognized based on a first splitting result, and extracting syllable attributes contained in each sentence of Chinese speech in the sentence set;
the voice recognition unit is used for carrying out second splitting on the Chinese voice of each sentence based on the syllable attribute and obtaining Chinese words contained in the Chinese voice of each sentence based on a second splitting result;
the voice recognition unit is also used for extracting pronunciation characteristics of the Chinese vocabulary, and processing the pronunciation characteristics based on the corresponding relation between the Chinese pinyin and the Chinese characters to obtain vocabulary texts corresponding to the Chinese vocabulary;
and the text splicing unit is used for splicing the vocabulary texts corresponding to the Chinese vocabularies contained in each sentence of Chinese voice to obtain the voice text corresponding to the Chinese voice segment to be recognized.
Preferably, the system for recognizing chinese speech based on deep learning includes:
the voice recognition subunit is configured to acquire a statement set of each to-be-recognized Chinese voice segment obtained based on the first splitting result, construct an acoustic model at the same time, and perform acoustic recognition on each sentence of Chinese voice in the statement set based on the acoustic model;
the identity determining subunit is used for determining the sound characteristics corresponding to the Chinese speech of the adjacent sentence based on the acoustic recognition result and comparing the sound characteristics corresponding to the Chinese speech of the adjacent sentence;
and the result determining subunit is used for determining that the users corresponding to the Chinese voices of the adjacent sentences are the same when the comparison result determines that the sound characteristics corresponding to the Chinese voices of the adjacent sentences are consistent, and uniformly labeling the voice texts corresponding to the Chinese voices of the adjacent sentences, otherwise, determining that the users corresponding to the Chinese voices of the adjacent sentences are different, and performing distinguishing labeling on the voice texts corresponding to the Chinese voices of the adjacent sentences.
Preferably, the chinese speech recognition system based on deep learning includes:
the text acquisition unit is used for acquiring the Chinese speech segment to be recognized, constructing a pronunciation change recognition model at the same time, and inputting the Chinese speech segment to be recognized into the pronunciation change recognition model for processing to obtain the intonation information of the Chinese speech segment to be recognized;
the intention determining unit is used for acquiring a voice text obtained by identifying the Chinese voice segment to be identified and combining the intonation information and the voice text to determine the target intention of the Chinese voice segment to be identified;
the semantic determining unit is used for performing semantic analysis on the voice text based on the target intention to obtain a semantic analysis result, acquiring a preset Chinese grammar checking rule and performing grammar checking on the voice text based on the semantic analysis result;
the grammar correcting unit is used for determining a target position of the abnormal voice text in the voice text when the grammar checking result judges that the voice text has wrong grammar, and determining the logic relation of the context of the abnormal voice text based on the target position;
the grammar correction unit is used for splitting the abnormal voice text at the target position to obtain N text keywords, and rearranging the N text keywords based on the logic relation and a preset Chinese grammar rule to obtain a corrected voice text;
the text verification unit is used for performing character verification on the corrected voice text based on the target intention, determining the difference characters in the voice text based on a verification result and determining the target pinyin of the difference characters;
the character replacing unit is used for mapping the target pinyin and each preset noun in a preset noun library one by one and determining a target replacing character based on a mapping result;
the character replacing unit is also used for replacing the difference characters based on the target replacing characters and obtaining a final voice recognition text based on a replacing result.
Preferably, the chinese speech recognition system based on deep learning includes:
the voice recognition text acquisition unit is used for acquiring a final voice recognition text and determining the text size of the final voice recognition text;
and the capacity allocation unit is used for allocating a target storage space in a preset storage area based on the text size and storing the final voice recognition text in the target storage space.
The invention provides a Chinese speech recognition method based on deep learning, which comprises the following steps:
step 1: receiving Chinese voice segments to be recognized in real time, and sequencing the Chinese voice segments to be recognized based on a time sequence;
step 2: constructing a Chinese voice recognition model, and inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model in sequence based on the sequencing result to perform voice recognition to obtain a voice text;
and 3, step 3: and carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a block diagram of a deep learning based Chinese speech recognition system according to an embodiment of the present invention;
FIG. 2 is a block diagram of a speech acquisition module of a deep learning-based Chinese speech recognition system according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for Chinese speech recognition based on deep learning according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.
Example 1:
the embodiment provides a chinese speech recognition system based on deep learning, as shown in fig. 1, including:
the voice acquisition module is used for receiving the Chinese voice segment to be recognized in real time and sequencing the Chinese voice segment to be recognized based on a time sequence;
the voice recognition module is used for constructing a Chinese voice recognition model and sequentially inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model for voice recognition based on the sequencing result to obtain a voice text;
and the correction module is used for carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
In this embodiment, the chinese speech segment to be recognized refers to a received sentence set for speech recognition, and each sentence is a speech segment.
In this embodiment, the time sequence is used to represent the occurrence sequence of different chinese speech segments to be recognized, i.e. the speaking sequence of the different chinese speech segments to be recognized.
In this embodiment, sorting the to-be-recognized chinese speech segments based on the time sequence refers to sorting the acquired to-be-recognized speech segments according to the time sequence.
In this embodiment, the speech text refers to text information obtained by identifying a received chinese speech segment to be identified, that is, chinese character information corresponding to the chinese speech segment to be identified.
In this embodiment, the preset chinese grammar is set in advance, and includes defining the positions and logical relationships of the subjects and verbs.
The beneficial effects of the above technical scheme are: the Chinese speech recognition model is constructed to sequentially recognize the acquired Chinese speech segments to be recognized and correct the recognized speech text according to the Chinese grammar, so that the accuracy of Chinese speech recognition is ensured, and the effect of Chinese speech recognition is improved.
Example 2:
on the basis of embodiment 1, this embodiment provides a chinese speech recognition system based on deep learning, as shown in fig. 2, the speech acquisition module includes:
the voice acquisition unit is used for monitoring the current acoustic characteristics of the user in real time and determining the current voice state of the user based on the acoustic characteristics, wherein the voice state comprises a voiced sound and an unvoiced sound;
and the voice recording unit is used for acquiring the Chinese voice sent by the user when the voice state is voice, and storing the acquired Chinese voice to obtain the Chinese voice section to be identified.
In this embodiment, the acoustic feature is used to determine whether the user currently has speaking behavior.
The beneficial effects of the above technical scheme are: by accurately judging the current speaking behavior of the user, the generated Chinese speech is acquired and stored in time when the user speaks, and convenience is brought to accurate and effective recognition of the Chinese speech of the user.
Example 3:
on the basis of embodiment 2, this embodiment provides a chinese speech recognition system based on deep learning, where the speech recording unit includes:
the voice processing subunit is used for acquiring the obtained Chinese voice segment to be recognized, and performing spectrum analysis on the Chinese voice segment to be recognized to obtain an audio spectrum corresponding to the Chinese voice segment to be recognized;
the voice screening subunit is used for determining a first peak frequency point corresponding to the Chinese voice segment to be identified at each moment based on the audio frequency map, acquiring a noise audio frequency map corresponding to a noise signal, and determining a second peak frequency point of the noise signal based on the noise audio frequency map;
the voice screening subunit is configured to compare the first peak frequency point with the second peak frequency point, screen out a target peak frequency point at which the first peak frequency point is greater than the second peak frequency point, and determine the to-be-identified Chinese voice segment corresponding to the target peak frequency point as an effective to-be-identified Chinese voice segment.
In this embodiment, the audio map refers to converting the chinese speech segment to be recognized into a corresponding audio format, so as to distinguish the effective speech signal from the noise signal in the chinese speech segment to be recognized.
In this embodiment, the first peak frequency point refers to the size of an audio value corresponding to each frame of the chinese speech segment to be recognized in the time domain.
In this embodiment, the noise audio map refers to the audio forms corresponding to various kinds of noise.
In this embodiment, the second peak frequency point refers to the audio value size corresponding to the noise.
In this embodiment, the target peak frequency point refers to a chinese speech signal in which the first peak frequency point is greater than the second peak frequency point.
In this embodiment, the valid chinese speech segment to be recognized refers to a speech signal without other interference factors obtained after removing noise signals in the chinese speech segment to be recognized.
The beneficial effects of the above technical scheme are: the obtained Chinese speech segment to be recognized is converted into the corresponding audio map, and the audio map corresponding to the noise signal is obtained at the same time, so that the noise signal in the Chinese speech segment to be recognized is eliminated through the audio map, the effectiveness of the Chinese speech segment to be recognized is guaranteed, and the recognition effect of the Chinese speech segment to be recognized is improved.
Example 4:
on the basis of embodiment 1, this embodiment provides a chinese speech recognition system based on deep learning, where the speech acquisition module includes:
the time determining unit is used for acquiring the obtained Chinese speech segment to be recognized and processing the Chinese speech segment to be recognized to obtain a speech signal corresponding to each frame;
the time determining unit is further configured to determine time domain information of the Chinese speech segment to be recognized based on the speech signal corresponding to each frame, and match the time domain information with the speech signal corresponding to each frame;
and the sorting unit is used for determining a time sequence corresponding to the Chinese speech segment to be recognized based on the matching result and sorting the Chinese speech segment to be recognized based on the ascending sequence of the time sequence, wherein the Chinese speech segment to be recognized is at least one segment.
In this embodiment, the time domain information refers to a time range related to the received chinese speech segment to be recognized.
In this embodiment, the time sequence is used to represent the specific time corresponding to each Chinese speech segment to be recognized
The beneficial effects of the above technical scheme are: the specific time sequence of each Chinese speech segment to be recognized is confirmed by determining the time domain information related to the Chinese speech segment to be recognized, so that the obtained Chinese speech segments to be recognized are conveniently sequenced through the specific time sequence, the recognition efficiency of the Chinese speech segments to be recognized is improved, and the recognition effect of the Chinese speech segments to be recognized is guaranteed.
Example 5:
on the basis of embodiment 4, this embodiment provides a chinese speech recognition system based on deep learning, and the ranking unit includes:
the result acquiring subunit is used for acquiring the sorting result of the Chinese speech segment to be recognized and determining the target number of the Chinese speech segment to be recognized based on the sorting result;
the tag obtaining subunit is used for extracting the acoustic features of the Chinese speech segment to be recognized and determining the speech type of the Chinese speech segment to be recognized based on the acoustic features;
and the marking subunit is used for acquiring a target number of marking labels from a preset label database based on the voice type and marking the Chinese voice segment to be recognized based on the target number of marking labels.
In this embodiment, the target number is a specific number used to characterize the acquired chinese speech segment to be recognized.
In this embodiment, the acoustic features refer to the sound characteristics of the chinese speech segment to be recognized, including the sound color and intonation.
In this embodiment, the preset tag database is set in advance and is used for storing the tag tags corresponding to different voice types.
In this embodiment, the markup tags refer to markup symbols that can be used to distinguish different chinese speech segments to be recognized, and the markup tags can quickly distinguish the different chinese speech segments to be recognized and facilitate determining the speech type of the chinese speech segments to be recognized.
The beneficial effects of the above technical scheme are: the acoustic characteristics of the Chinese speech segment to be recognized are determined, and the speech type of the speech segment to be recognized is accurately and effectively judged according to the acoustic characteristics, so that the fact that different Chinese speech segments to be recognized are marked by selecting proper marking labels according to the speech type is facilitated, the orderliness of recognition of the Chinese speech segment to be recognized is guaranteed, and meanwhile, the recognition efficiency and the accuracy are facilitated to be improved.
Example 6:
on the basis of embodiment 1, this embodiment provides a chinese speech recognition system based on deep learning, where the speech recognition module includes:
the data acquisition unit is used for acquiring a voice training text and calling accents with different sound colors from a preset voice library to read the voice training text to obtain audio data of the accents with different sound colors to the voice training text;
the data processing unit is used for preprocessing the audio data, converting the audio data into a corresponding spectrogram based on a preprocessing result, and determining an effective area in the audio data based on the spectrogram;
the model construction unit is used for determining the characteristic parameters of the audio data based on the effective region, acquiring the corresponding relation between Chinese pinyin and Chinese characters, training the characteristic parameters based on the corresponding relation, and constructing a Chinese voice recognition model based on the training result;
the speech recognition unit is used for sequentially inputting the acquired Chinese speech segments to be recognized into the Chinese speech recognition model, analyzing the received Chinese speech segments to be recognized based on a preset syntax analysis tree in the Chinese speech recognition model, and determining a starting point and an ending point of each sentence in the Chinese speech segments to be recognized;
the speech recognition unit is used for performing first splitting on each Chinese speech segment to be recognized based on the starting point and the end point, obtaining a sentence set of each Chinese speech segment to be recognized based on a first splitting result, and extracting syllable attributes contained in each sentence of Chinese speech in the sentence set;
the voice recognition unit is used for carrying out second splitting on the Chinese voice of each sentence based on the syllable attribute and obtaining Chinese words contained in the Chinese voice of each sentence based on a second splitting result;
the voice recognition unit is also used for extracting pronunciation characteristics of the Chinese vocabulary, and processing the pronunciation characteristics based on the corresponding relation between the Chinese pinyin and the Chinese characters to obtain vocabulary texts corresponding to the Chinese vocabulary;
and the text splicing unit is used for splicing the vocabulary text corresponding to the Chinese vocabulary contained in each sentence of Chinese speech to obtain the speech text corresponding to the Chinese speech segment to be recognized.
In the embodiment, the preset voice library is used for storing accents with different sound colors, so that the Chinese voice recognition model can be accurately and effectively trained.
In this embodiment, the speech training text is set in advance, and the text information corresponding to the speech is known.
In this embodiment, the preprocessing may be denoising or the like processing on the audio data.
In this embodiment, the spectrogram refers to a spectral analysis view, the abscissa of which is time, the ordinate of which is frequency, and the coordinate point value of which is voice data energy.
In this embodiment, the valid region refers to filtering the acquired audio data to extract a speech segment having key characterization information therein.
In this embodiment, the characteristic parameter refers to a value of the audio data and a fluctuation range corresponding to the intonation.
In the embodiment, the preset syntax analysis tree is set in advance and is used for identifying the acquired Chinese speech segment to be identified according to the Chinese grammar, so that the accuracy and efficiency of identification are improved conveniently.
In this embodiment, the start point and the fructification point are used to characterize the beginning and end of each sentence.
In this embodiment, the first splitting refers to splitting the chinese speech segment to be recognized into a plurality of sentences in units of sentences.
In this embodiment, the sentence set refers to a set obtained by splitting a Chinese speech segment to be recognized into a plurality of sentences.
In this embodiment, the syllable attribute refers to monosyllable and bisyllable of the vocabulary contained in each sentence.
In this embodiment, the second splitting refers to splitting each sentence of the chinese speech into a plurality of chinese vocabularies in units of vocabularies.
In this embodiment, the pronunciation characteristics refer to pronunciation characteristics of each Chinese vocabulary.
In this embodiment, the vocabulary text refers to the Chinese characters corresponding to the vocabulary speech in each sentence of Chinese speech.
In this embodiment, analyzing the received chinese speech segment to be recognized based on the preset parse tree in the chinese speech recognition model includes:
acquiring an obtained Chinese voice segment to be recognized, converting the Chinese voice segment to be recognized into a corresponding feature vector, and determining a feature sequence corresponding to the Chinese voice segment to be recognized based on the feature vector;
calculating the word sequence recognized by the Chinese speech recognition model to the Chinese speech segment to be recognized based on the characteristic sequence, and calculating the recognition accuracy of the Chinese speech segment to be recognized based on the word sequence, wherein the method specifically comprises the following steps:
calculating the recognized word sequence of the Chinese speech segment to be recognized according to the following formula:
M=argmax[log 2 P(α|m)+η*log 2 P(m)];
wherein, M represents the recognized word sequence of the Chinese speech segment to be recognized; p (α | m) represents an acoustic model, and represents the probability that the output acoustic feature is the feature sequence α under the condition that the preset word sequence is m, and the value range is (0,1); p (m) represents a language model, represents the probability value of the preset word sequence m in the feature sequence, and the value range is (0,1); eta represents an adjustable parameter and has a value range of (0,1); argmax [. Cndot.) represents a function for solving a set of functions, and particularly represents a maximum vocabulary set obtained when the acoustic model and the language model meet the condition for recognizing the Chinese speech segment to be recognized;
determining the total number K of recognized words of the Chinese speech segment to be recognized based on the word sequence M;
calculating the recognition accuracy of the Chinese speech segment to be recognized according to the following formula:
wherein,the recognition accuracy of the Chinese speech segment to be recognized is shown, and the value range is (0,1); omega represents an error factor, and the value range is (0.02,0.05); k represents the recognition of the Chinese speech segment to be recognizedThe total number K of the words; delta represents the number of wrongly recognized words of the Chinese speech segment to be recognized; sigma represents the number of the vocabularies of the missed recognition of the Chinese speech segment to be recognized;
comparing the calculated identification accuracy with a preset accuracy;
if the recognition accuracy is greater than or equal to the preset accuracy, judging that the Chinese speech segment to be recognized is qualified for recognition;
otherwise, judging that the Chinese speech segment to be recognized is unqualified in recognition, and performing speech recognition on the Chinese speech segment to be recognized again until the recognition accuracy is greater than or equal to the preset accuracy.
The feature vector refers to a statement vector about the Chinese speech segment to be recognized, which is obtained after vectorization processing is performed on the Chinese speech segment to be recognized.
The feature sequence refers to a vocabulary sequence corresponding to each feature vector.
The preset accuracy is set in advance and is used for judging whether the recognition accuracy of the Chinese speech segment to be recognized meets the preset requirement or not.
The beneficial effects of the above technical scheme are: the voice training text is obtained and read through the accents with different sound colors, so that the accents with different sound colors are effectively obtained, the audio data corresponding to the voice training text are processed and trained, the Chinese voice recognition model is accurately and reliably constructed, the obtained Chinese voice section to be recognized is split and recognized through the Chinese voice recognition model, the recognition accuracy and recognition effect of the Chinese voice section to be recognized are guaranteed, and the obtained voice text is accurate and effective.
Example 7:
on the basis of embodiment 6, this embodiment provides a chinese speech recognition system based on deep learning, and the speech recognition unit includes:
the voice recognition subunit is configured to acquire a statement set of each to-be-recognized Chinese voice segment obtained based on the first splitting result, construct an acoustic model at the same time, and perform acoustic recognition on each sentence of Chinese voice in the statement set based on the acoustic model;
the identity determining subunit is used for determining the sound characteristics corresponding to the Chinese speech of the adjacent sentence based on the acoustic recognition result and comparing the sound characteristics corresponding to the Chinese speech of the adjacent sentence;
and the result determining subunit is used for determining that the users corresponding to the Chinese voices of the adjacent sentences are the same and uniformly labeling the voice texts corresponding to the Chinese voices of the adjacent sentences when the comparison result determines that the sound characteristics corresponding to the Chinese voices of the adjacent sentences are consistent, or else, determining that the users corresponding to the Chinese voices of the adjacent sentences are different and distinguishing and labeling the voice texts corresponding to the Chinese voices of the adjacent sentences.
In this embodiment, the acoustic model is used to analyze the sound characteristics of the Chinese speech segment to be recognized, including the sound color and tone.
In this embodiment, the sound feature may be the thickness of the sound corresponding to the chinese speech of the adjacent sentence, or the like.
In this embodiment, the unified labeling refers to labeling the chinese speeches of adjacent sentences as the same user, so as to facilitate distinguishing the obtained speech texts.
In this embodiment, the distinguishing and labeling means that the chinese speeches of adjacent sentences are labeled as different users, so as to facilitate distinguishing the obtained speech texts.
The beneficial effects of the above technical scheme are: by constructing the acoustic model and performing acoustic feature recognition on the Chinese speech of the adjacent sentence in the Chinese speech segment to be recognized through the acoustic model, accurate judgment is conveniently performed on users corresponding to different sentences in the Chinese speech segment to be recognized, so that the speech text obtained by recognition is conveniently and orderly managed, and the recognition effect of the Chinese speech segment to be recognized is improved.
Example 8:
on the basis of embodiment 1, this embodiment provides a chinese speech recognition system based on deep learning, where the modification module includes:
the text acquisition unit is used for acquiring the Chinese speech segment to be recognized, constructing a pronunciation change recognition model, inputting the Chinese speech segment to be recognized into the pronunciation change recognition model and processing the Chinese speech segment to be recognized to obtain intonation information of the Chinese speech segment to be recognized;
the intention determining unit is used for acquiring a voice text obtained by identifying the Chinese voice segment to be identified and combining the intonation information and the voice text to determine the target intention of the Chinese voice segment to be identified;
the semantic determining unit is used for performing semantic analysis on the voice text based on the target intention to obtain a semantic analysis result, acquiring a preset Chinese grammar checking rule and performing grammar checking on the voice text based on the semantic analysis result;
the grammar correcting unit is used for determining a target position of the abnormal voice text in the voice text when the grammar checking result judges that the voice text has wrong grammar, and determining the logic relation of the context of the abnormal voice text based on the target position;
the grammar correction unit is used for splitting the abnormal voice text at the target position to obtain N text keywords, and rearranging the N text keywords based on the logic relation and a preset Chinese grammar rule to obtain a corrected voice text;
the text verification unit is used for performing character verification on the corrected voice text based on the target intention, determining the difference characters in the voice text based on a verification result and determining the target pinyin of the difference characters;
the character replacing unit is used for mapping the target pinyin and each preset noun in a preset noun library one by one and determining a target replacing character based on a mapping result;
the character replacing unit is also used for replacing the difference characters based on the target replacing characters and obtaining a final voice recognition text based on a replacing result.
In this embodiment, the pronunciation change recognition model is used for recognizing pronunciation change of the Chinese speech segment to be recognized.
In this embodiment, the intonation information refers to the intonation change condition corresponding to the Chinese speech segment to be recognized, so as to facilitate the determination of the speech intention of the user.
In this embodiment, the target intent refers to an expression purpose corresponding to the Chinese speech segment to be recognized.
In this embodiment, the semantic analysis refers to analyzing the acquired voice text to determine the meaning of the voice text expression.
In this embodiment, the preset chinese grammar check rule is set in advance, and is used to check the grammar of the chinese speech segment to be recognized.
In this embodiment, the abnormal speech text refers to text information corresponding to an incorrect grammar existing in the acquired speech text.
In this embodiment, the target position refers to a position condition of the abnormal speech text in the obtained speech text.
In this embodiment, the text keyword refers to a Chinese vocabulary contained in a sentence in which the abnormal speech text is located.
In this embodiment, the preset chinese grammar rule is set in advance.
In this embodiment, word checking refers to checking words in the obtained speech text so as to determine whether an error word exists therein.
In this embodiment, the difference word refers to an error word existing in the obtained speech text.
In this embodiment, the target pinyin refers to the pronunciation corresponding to the difference characters.
In this embodiment, the preset noun library is set in advance and is used for storing different characters.
In this embodiment, the target replacement word refers to a Chinese character that is homophonic with the difference word but has a non-uniform font.
The beneficial effects of the above technical scheme are: by carrying out grammar check on the obtained voice text, the error grammar in the voice text is corrected in time when grammar errors exist, and then the Chinese character form in the voice text is checked after grammar correction, so that the accuracy of the finally obtained voice recognition text is ensured, and the recognition effect of the Chinese voice segment to be recognized is ensured.
Example 9:
on the basis of embodiment 1, this embodiment provides a chinese speech recognition system based on deep learning, where the correction module includes:
the voice recognition text acquisition unit is used for acquiring a final voice recognition text and determining the text size of the final voice recognition text;
and the capacity allocation unit is used for allocating a target storage space in a preset storage area based on the text size and storing the final voice recognition text in the target storage space.
In this embodiment, the preset storage area is set in advance and includes storage spaces of different sizes.
In this embodiment, the target storage space refers to a storage area for storing the acquired final speech recognition text.
The beneficial effects of the above technical scheme are: the file size of the finally obtained speech recognition text is determined, so that the corresponding storage space is conveniently distributed for the speech recognition text, the speech recognition text is stored, the storage effect of the recognition result of the Chinese speech segment to be recognized is improved, and the recognition effect of the Chinese speech segment to be recognized is guaranteed.
Example 10:
the embodiment provides a method for recognizing Chinese speech based on deep learning, as shown in fig. 3, including:
step 1: receiving Chinese voice segments to be recognized in real time, and sequencing the Chinese voice segments to be recognized based on a time sequence;
and 2, step: constructing a Chinese voice recognition model, and inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model in sequence based on the sequencing result to perform voice recognition to obtain a voice text;
and step 3: and carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
The beneficial effects of the above technical scheme are: the Chinese speech recognition model is constructed to sequentially recognize the acquired Chinese speech segments to be recognized and correct the recognized speech text according to the Chinese grammar, so that the accuracy of Chinese speech recognition is ensured, and the effect of Chinese speech recognition is improved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. A Chinese speech recognition system based on deep learning, comprising:
the voice acquisition module is used for receiving the Chinese voice segment to be recognized in real time and sequencing the Chinese voice segment to be recognized based on a time sequence;
the voice recognition module is used for constructing a Chinese voice recognition model and sequentially inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model for voice recognition based on the sequencing result to obtain a voice text;
and the correction module is used for carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
2. The system of claim 1, wherein the speech acquisition module comprises:
the voice acquisition unit is used for monitoring the current acoustic characteristics of the user in real time and determining the current voice state of the user based on the acoustic characteristics, wherein the voice state comprises a voiced sound and an unvoiced sound;
and the voice recording unit is used for acquiring the Chinese voice sent by the user when the voice state is voice, and storing the acquired Chinese voice to obtain the Chinese voice section to be identified.
3. The system of claim 2, wherein the phonetic listing unit comprises:
the voice processing subunit is used for acquiring the obtained Chinese voice segment to be recognized, and performing spectrum analysis on the Chinese voice segment to be recognized to obtain an audio spectrum corresponding to the Chinese voice segment to be recognized;
the voice screening subunit is used for determining a first peak frequency point corresponding to the Chinese voice segment to be identified at each moment based on the audio map, acquiring a noise audio map corresponding to a noise signal, and determining a second peak frequency point of the noise signal based on the noise audio map;
the voice screening subunit is configured to compare the first peak frequency point with the second peak frequency point, screen out a target peak frequency point at which the first peak frequency point is greater than the second peak frequency point, and determine the to-be-identified Chinese voice segment corresponding to the target peak frequency point as an effective to-be-identified Chinese voice segment.
4. The system of claim 1, wherein the speech acquisition module comprises:
the time determining unit is used for acquiring the obtained Chinese speech segment to be recognized and processing the Chinese speech segment to be recognized to obtain a speech signal corresponding to each frame;
the time determining unit is further configured to determine time domain information of the Chinese speech segment to be recognized based on the speech signal corresponding to each frame, and match the time domain information with the speech signal corresponding to each frame;
and the sorting unit is used for determining a time sequence corresponding to the Chinese speech segment to be recognized based on the matching result and sorting the Chinese speech segment to be recognized based on the ascending sequence of the time sequence, wherein the Chinese speech segment to be recognized is at least one segment.
5. The deep learning based Chinese speech recognition system of claim 4, wherein the ranking unit comprises:
the result acquiring subunit is used for acquiring the sequencing result of the Chinese speech segment to be recognized and determining the target number of the Chinese speech segment to be recognized based on the sequencing result;
the tag obtaining subunit is used for extracting the acoustic features of the Chinese speech segment to be recognized and determining the speech type of the Chinese speech segment to be recognized based on the acoustic features;
and the marking subunit is used for acquiring a target number of marking labels from a preset label database based on the voice type and marking the Chinese voice segment to be recognized based on the target number of marking labels.
6. The system of claim 1, wherein the speech recognition module comprises:
the data acquisition unit is used for acquiring a voice training text and calling accents with different sound colors from a preset voice library to read the voice training text to obtain audio data of the accents with different sound colors to the voice training text;
the data processing unit is used for preprocessing the audio data, converting the audio data into a corresponding spectrogram based on a preprocessing result, and determining an effective area in the audio data based on the spectrogram;
the model construction unit is used for determining the characteristic parameters of the audio data based on the effective region, acquiring the corresponding relation between Chinese pinyin and Chinese characters, training the characteristic parameters based on the corresponding relation, and constructing a Chinese voice recognition model based on the training result;
the speech recognition unit is used for sequentially inputting the acquired Chinese speech segments to be recognized into the Chinese speech recognition model, analyzing the received Chinese speech segments to be recognized based on a preset syntax analysis tree in the Chinese speech recognition model, and determining a starting point and an ending point of each sentence in the Chinese speech segments to be recognized;
the voice recognition unit is used for performing first splitting on each Chinese voice segment to be recognized based on the starting point and the end point, obtaining a sentence set of each Chinese voice segment to be recognized based on a first splitting result, and extracting syllable attributes contained in each sentence of Chinese voice in the sentence set;
the voice recognition unit is used for carrying out second splitting on each sentence of Chinese voice based on the syllable attribute and obtaining Chinese vocabulary contained in each sentence of Chinese voice based on a second splitting result;
the voice recognition unit is also used for extracting pronunciation characteristics of the Chinese vocabulary, and processing the pronunciation characteristics based on the corresponding relation between the Chinese pinyin and the Chinese characters to obtain vocabulary texts corresponding to the Chinese vocabulary;
and the text splicing unit is used for splicing the vocabulary text corresponding to the Chinese vocabulary contained in each sentence of Chinese speech to obtain the speech text corresponding to the Chinese speech segment to be recognized.
7. The system of claim 6, wherein the speech recognition unit comprises:
the voice recognition subunit is configured to acquire a statement set of each to-be-recognized Chinese voice segment obtained based on the first splitting result, construct an acoustic model at the same time, and perform acoustic recognition on each sentence of Chinese voice in the statement set based on the acoustic model;
the identity determining subunit is used for determining the sound characteristics corresponding to the Chinese speech of the adjacent sentence based on the acoustic recognition result and comparing the sound characteristics corresponding to the Chinese speech of the adjacent sentence;
and the result determining subunit is used for determining that the users corresponding to the Chinese voices of the adjacent sentences are the same and uniformly labeling the voice texts corresponding to the Chinese voices of the adjacent sentences when the comparison result determines that the sound characteristics corresponding to the Chinese voices of the adjacent sentences are consistent, or else, determining that the users corresponding to the Chinese voices of the adjacent sentences are different and distinguishing and labeling the voice texts corresponding to the Chinese voices of the adjacent sentences.
8. The system of claim 1, wherein the modification module comprises:
the text acquisition unit is used for acquiring the Chinese speech segment to be recognized, constructing a pronunciation change recognition model at the same time, and inputting the Chinese speech segment to be recognized into the pronunciation change recognition model for processing to obtain the intonation information of the Chinese speech segment to be recognized;
the intention determining unit is used for acquiring a voice text obtained by identifying the Chinese voice segment to be identified and combining the intonation information and the voice text to determine the target intention of the Chinese voice segment to be identified;
the semantic determining unit is used for performing semantic analysis on the voice text based on the target intention to obtain a semantic analysis result, acquiring a preset Chinese grammar checking rule and performing grammar checking on the voice text based on the semantic analysis result;
the grammar correcting unit is used for determining a target position of the abnormal voice text in the voice text when the grammar checking result judges that the voice text has wrong grammar, and determining the logic relation of the context of the abnormal voice text based on the target position;
the grammar correction unit is used for splitting the abnormal voice text at the target position to obtain N text keywords, and rearranging the N text keywords based on the logic relation and a preset Chinese grammar rule to obtain a corrected voice text;
the text verification unit is used for performing character verification on the corrected voice text based on the target intention, determining the difference characters in the voice text based on a verification result and determining the target pinyin of the difference characters;
the character replacing unit is used for mapping the target pinyin and each preset noun in a preset noun library one by one and determining a target replacing character based on a mapping result;
and the character replacing unit is also used for replacing the difference characters based on the target replacing characters and obtaining a final voice recognition text based on a replacing result.
9. The system of claim 1, wherein the modification module comprises:
the voice recognition text acquisition unit is used for acquiring a final voice recognition text and determining the text size of the final voice recognition text;
and the capacity allocation unit is used for allocating a target storage space in a preset storage area based on the text size and storing the final voice recognition text in the target storage space.
10. A Chinese speech recognition method based on deep learning is characterized by comprising the following steps:
step 1: receiving Chinese voice segments to be recognized in real time, and sequencing the Chinese voice segments to be recognized based on a time sequence;
step 2: constructing a Chinese voice recognition model, and inputting the acquired Chinese voice segments to be recognized into the Chinese voice recognition model in sequence based on the sequencing result to perform voice recognition to obtain a voice text;
and step 3: and carrying out grammar correction on the obtained voice text based on the preset Chinese grammar to obtain the final voice recognition text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210848331.5A CN115240655A (en) | 2022-07-19 | 2022-07-19 | Chinese voice recognition system and method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210848331.5A CN115240655A (en) | 2022-07-19 | 2022-07-19 | Chinese voice recognition system and method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115240655A true CN115240655A (en) | 2022-10-25 |
Family
ID=83674272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210848331.5A Pending CN115240655A (en) | 2022-07-19 | 2022-07-19 | Chinese voice recognition system and method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115240655A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116597821A (en) * | 2023-07-17 | 2023-08-15 | 深圳市国硕宏电子有限公司 | Intelligent customer service voice recognition method and system based on deep learning |
CN117558269A (en) * | 2024-01-11 | 2024-02-13 | 深圳波洛斯科技有限公司 | Voice recognition method, device, medium and electronic equipment |
-
2022
- 2022-07-19 CN CN202210848331.5A patent/CN115240655A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116597821A (en) * | 2023-07-17 | 2023-08-15 | 深圳市国硕宏电子有限公司 | Intelligent customer service voice recognition method and system based on deep learning |
CN117558269A (en) * | 2024-01-11 | 2024-02-13 | 深圳波洛斯科技有限公司 | Voice recognition method, device, medium and electronic equipment |
CN117558269B (en) * | 2024-01-11 | 2024-03-15 | 深圳波洛斯科技有限公司 | Voice recognition method, device, medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
CN112397091B (en) | Chinese speech comprehensive scoring and diagnosing system and method | |
US6487532B1 (en) | Apparatus and method for distinguishing similar-sounding utterances speech recognition | |
US6839667B2 (en) | Method of speech recognition by presenting N-best word candidates | |
US7421387B2 (en) | Dynamic N-best algorithm to reduce recognition errors | |
US7529678B2 (en) | Using a spoken utterance for disambiguation of spelling inputs into a speech recognition system | |
Hazen | Automatic language identification using a segment-based approach | |
US6836760B1 (en) | Use of semantic inference and context-free grammar with speech recognition system | |
CN101076851B (en) | Spoken language identification system and method for training and operating the said system | |
US8494853B1 (en) | Methods and systems for providing speech recognition systems based on speech recordings logs | |
US10019983B2 (en) | Method and system for predicting speech recognition performance using accuracy scores | |
US20170032780A1 (en) | System and Method for Learning Alternate Pronunciations for Speech Recognition | |
JP2559998B2 (en) | Speech recognition apparatus and label generation method | |
CN108536654A (en) | Identify textual presentation method and device | |
US11282511B2 (en) | System and method for automatic speech analysis | |
WO1995027976A1 (en) | Computer system and computer-implemented process for phonology-based automatic speech recognition | |
US20130289987A1 (en) | Negative Example (Anti-Word) Based Performance Improvement For Speech Recognition | |
CN115240655A (en) | Chinese voice recognition system and method based on deep learning | |
AU2012388796B2 (en) | Method and system for predicting speech recognition performance using accuracy scores | |
US6963834B2 (en) | Method of speech recognition using empirically determined word candidates | |
US8219386B2 (en) | Arabic poetry meter identification system and method | |
US20050187767A1 (en) | Dynamic N-best algorithm to reduce speech recognition errors | |
US20110224985A1 (en) | Model adaptation device, method thereof, and program thereof | |
US20020184019A1 (en) | Method of using empirical substitution data in speech recognition | |
CN111429921B (en) | Voiceprint recognition method, system, mobile terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |