CN117831573B - Multi-mode-based language barrier crowd speech recording analysis method and system - Google Patents

Multi-mode-based language barrier crowd speech recording analysis method and system Download PDF

Info

Publication number
CN117831573B
CN117831573B CN202410254551.4A CN202410254551A CN117831573B CN 117831573 B CN117831573 B CN 117831573B CN 202410254551 A CN202410254551 A CN 202410254551A CN 117831573 B CN117831573 B CN 117831573B
Authority
CN
China
Prior art keywords
sequence
character
comparison
text
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410254551.4A
Other languages
Chinese (zh)
Other versions
CN117831573A (en
Inventor
钟晓云
巩湘红
李爱晶
李娇
张苑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qindao University Of Technology
Original Assignee
Qindao University Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qindao University Of Technology filed Critical Qindao University Of Technology
Priority to CN202410254551.4A priority Critical patent/CN117831573B/en
Publication of CN117831573A publication Critical patent/CN117831573A/en
Application granted granted Critical
Publication of CN117831573B publication Critical patent/CN117831573B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a speech recording analysis method and a speech recording analysis system for language barrier crowd based on multiple modes, which relate to the technical field of data processing, and comprise the following steps: obtaining voice template information; analyzing the speech sound record through a voice recognition model to obtain a comparison text sequence; obtaining a comparison pinyin sequence; carrying out semantic deviation analysis to obtain a text semantic deviation coefficient; when the semantic deviation coefficient of the characters is larger than or equal to a first deviation coefficient threshold value, obtaining a deviation character set; the deviation text set is used as constraint, semantic deviation analysis is carried out on the reference pinyin sequence and the comparison pinyin sequence, and pinyin semantic deviation coefficients are obtained; and sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal. The invention solves the technical problem of low reliability of analysis results caused by the on-one-sided speech recording analysis of language handicapped people in the prior art, and achieves the technical effect of improving analysis accuracy by carrying out speech recording analysis from multiple dimensions.

Description

Multi-mode-based language barrier crowd speech recording analysis method and system
Technical Field
The invention relates to the technical field of data processing, in particular to a speech recording analysis method and system for language barrier crowd based on multiple modes.
Background
The language disorder is a common functional disorder of language use, and at present, staff mainly performs scoring evaluation on the sounding mode, sound and material reading condition of a language disorder crowd according to an evaluation table, however, in the evaluation process, the evaluation has more interference due to the problem of staff capacity. Therefore, in order to determine the degree of language disorder of the population suffering from language disorder, there is a method of obtaining analysis materials by recording speech of the population suffering from language disorder, and then comparing the obtained materials with corresponding standard voices to obtain the degree of language disorder.
However, in the process of comparison, the obtained comparison result usually ignores the personal specificity of the voice disorder crowd and has larger error through mechanical comparison according to the standard voice and the voice disorder crowd. In the actual analysis process, the words which are recorded on the surface are compared, the comparison method is single, and the analysis result can not provide an accurate basis for the language use degree of the people with language disorder. Because the voice recordings of the voice disorder group are analyzed in a one-sided way, the accuracy of analysis results is not high.
Disclosure of Invention
The application provides a speech recording analysis method and a speech recording analysis system for language disorder groups based on multiple modes, which are used for solving the technical problem that the reliability of analysis results is low due to the fact that speech recording analysis is carried out on the language disorder groups on one side in the prior art.
In view of the problems, the application provides a speech recording analysis method and a speech recording analysis system for a language barrier crowd based on multiple modes.
In a first aspect of the application, a speech recording analysis method based on a multi-mode language barrier crowd is provided, and the method comprises the following steps:
obtaining voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, and the reference character sequence has no repeated characters;
Receiving speech sound records of a target user on the voice template information, and analyzing the speech sound records through a voice recognition model to obtain a comparison text sequence;
Processing the comparison text sequence through a pinyin matching table to obtain a comparison pinyin sequence;
Carrying out semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain a character semantic deviation coefficient;
when the semantic deviation coefficient of the characters is larger than or equal to a first deviation coefficient threshold value, a deviation character set is obtained;
carrying out semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence by taking the deviation text set as constraint to obtain a pinyin semantic deviation coefficient;
and sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal.
In a second aspect of the present application, a speech recording analysis system based on a multimodal language barrier crowd is provided, the system comprising:
The voice template information acquisition module is used for acquiring voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, and the reference character sequence has no repeated characters;
The comparison text sequence obtaining module is used for receiving the speech sound record of the target user on the voice template information, analyzing the speech sound record through a voice recognition model and obtaining a comparison text sequence;
The comparison pinyin sequence obtaining module is used for processing the comparison text sequence through the pinyin matching table to obtain a comparison pinyin sequence;
The semantic deviation coefficient obtaining module is used for carrying out semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain character semantic deviation coefficients;
The deviation text set obtaining module is used for obtaining a deviation text set when the text semantic deviation coefficient is greater than or equal to a first deviation coefficient threshold value;
The Pinyin semantic deviation coefficient obtaining module is used for carrying out semantic deviation analysis on the reference Pinyin sequence and the comparison Pinyin sequence by taking the deviation character set as constraint to obtain a Pinyin semantic deviation coefficient;
and the deviation coefficient sending module is used for sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal.
One or more technical schemes provided by the application have at least the following technical effects or advantages:
The application obtains voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, the reference character sequence has no repeated characters, then a target user receives a speech recording of the voice template information, analyzes the speech recording through a voice recognition model to obtain a comparison character sequence, further processes the comparison character sequence through a pinyin matching table to obtain a comparison pinyin sequence, then performs semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain a character semantic deviation coefficient, when the character semantic deviation coefficient is greater than or equal to a first deviation coefficient threshold value, obtains a deviation character set, at the moment, performs semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence by taking the deviation character set as a constraint to obtain a pinyin semantic deviation coefficient, and then sends the character semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal. The technical effect of carrying out reliable speech recording analysis on the people with language disorder and improving analysis accuracy is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a speech recording analysis method for a crowd of language handicapped people based on multiple modes, which is provided by the embodiment of the application;
FIG. 2 is a schematic flow chart of a comparison text sequence obtained in a multi-mode language barrier crowd speech recording analysis method provided by the embodiment of the application;
fig. 3 is a schematic flow chart of sending an interval duration anomaly coefficient to a management terminal in the speech recording analysis method based on the multi-mode language barrier crowd provided by the embodiment of the application;
fig. 4 is a schematic structural diagram of a speech recording analysis system for a crowd with language disorder based on multiple modes according to an embodiment of the application.
Reference numerals illustrate: the system comprises a voice template information obtaining module 11, a character sequence comparing and obtaining module 12, a pinyin sequence comparing and obtaining module 13, a semantic deviation coefficient obtaining module 14, a deviation character set obtaining module 15, a pinyin semantic deviation coefficient obtaining module 16 and a deviation coefficient sending module 17.
Detailed Description
The application provides a speech recording analysis method and a speech recording analysis system for language disorder groups based on multiple modes, which are used for solving the technical problem that the reliability of analysis results is low due to the fact that speech recording analysis is carried out on the language disorder groups on one side in the prior art.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the terms "comprises" and "comprising," along with any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
Embodiment one:
As shown in FIG 1, the application provides a speech recording analysis method based on a multi-mode language barrier crowd, which comprises the following steps:
S100: obtaining voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, and the reference character sequence has no repeated characters;
In one possible embodiment, the voice template information is used for providing standardized voice input content when performing speech recording analysis on the voice-impaired people, and the standardized voice input content comprises a reference character sequence and a reference pinyin sequence, wherein the reference character sequence has no repeated characters, so that the recovery degree of the voice-impaired people cannot be accurately judged due to the repeated characters of the voice template. The reference text sequence is a paragraph or short dialogue with no repeated text, in which each text has a fixed position. The reference pinyin sequences are pinyin sequences corresponding to characters in the reference character sequences, and each pinyin corresponds to one character in the reference character sequence and is the same as the corresponding character in the reference character sequence.
By obtaining the voice template information, a basic recording material template is provided for the reliability of the speech recording analysis of the language disorder crowd, and the technical effect of laying the mats for the recovery degree of the subsequent analysis of the language disorder crowd is achieved.
S200: receiving speech sound records of a target user on the voice template information, and analyzing the speech sound records through a voice recognition model to obtain a comparison text sequence;
further, receiving a speech recording of the target user on the voice template information, and analyzing the speech recording through a voice recognition model to obtain a comparison text sequence, wherein the step S200 of the embodiment of the application further comprises:
Receiving speech sound recording of a target user on the voice template information, and analyzing the speech sound recording through a voice recognition model to obtain a first recognition text sequence;
And repeating word stock deletion on the first recognition word sequence to obtain the comparison word sequence.
Further, as shown in fig. 2, the step S200 of the embodiment of the present application further includes:
Randomly extracting first initial recognition characters of the first recognition character sequence, deleting the first initial recognition characters from the first recognition character sequence, and obtaining a second recognition character sequence;
Traversing the second recognition text sequence based on the first initial recognition text to perform semantic distance analysis to obtain a plurality of semantic distance characteristic values;
Extracting feature recognition character sets with semantic distance feature values equal to 0;
And traversing the first initial recognition text and the characteristic recognition text set to respectively store and delete the first initial recognition text and the characteristic recognition text set to obtain a plurality of recognition text sequence combinations, and setting the recognition text sequence combinations as the comparison text sequences.
In one possible embodiment, the target user is any user with language disorder, and may be a patient with dysreading, a patient with compulsive speech disorder or a patient with aphasia. And the target user records the sound when the voice template information is read through equipment such as a recording pen, a microphone and the like, and sends the sound to a voice recognition model for analysis and recognition to obtain the comparison text sequence. The comparison text sequence is that the speech sound recording of the target user is converted into text, and the text sequence is obtained after repeated text deletion processing and combined analysis.
Optionally, the speech recording of the target user on the speech template information is sent to a speech recognition main channel of the speech recognition model, and the speech recording is subjected to text analysis and converted into the first recognition text sequence. The voice recognition model is used for analyzing and recognizing the speech recording of the target user on the language template information from multiple dimensions, and comprises a voice recognition main channel, a nasal sound classification channel and an accent classification channel. The first recognition text sequence is used for converting the speech sound recording of the target user into text and outputting the text sequence according to the sequence.
Optionally, because the target user performs repeated recording on the characters in the reference character sequence in the voice template, repeated characters exist in the first recognition character sequence, and interference is generated to analyze the recovery degree of the language disorder of the target user, so that more accurate comparison character sequences are obtained by performing repeated character storage and deletion on the first recognition character sequence.
In one possible embodiment, one recognition word is randomly extracted from the first recognition word sequence as a first initial recognition word, and is taken as a repeated word recognition object. And deleting the first initial recognition text from the first recognition text sequence to obtain a second recognition text sequence. The second recognition text sequence is a repeated text object to be recognized, that is, the text which is the same as the first initial recognition text in the second recognition text sequence needs to be extracted.
And respectively carrying out semantic distance analysis on each identification word in the second identification word sequence based on the first initial identification word, thereby obtaining a plurality of semantic distance characteristic values reflecting the semantic distance between two identification words. The larger the semantic distance feature value is, the farther the distance between the characters in the second recognition character sequence and the first initial recognition character is, and the smaller the semantic distance feature value is, the closer the distance between the characters in the second recognition character sequence and the first initial recognition character is. Optionally, the first initial recognition text and the second recognition text sequence are respectively converted into word vectors, and further, cosine similarity between the word vectors converted by the first initial recognition text and a plurality of word vectors converted by the recognition text in the second recognition text sequence is calculated by using a cosine similarity calculation formula, and the calculated result is used as the plurality of semantic distance feature values.
In one possible embodiment, when the semantic distance feature value is equal to 0, it indicates a repetition with the first initial recognition word. And extracting the identification characters with the semantic distance characteristic values equal to 0 to form the characteristic identification character set. The feature recognition text set is a recognition text set which is repeated with the first initial recognition text in the second recognition text sequence.
And deleting the first initial recognition text and the characteristic recognition text set from the first recognition text sequence, so as to obtain a plurality of recognition text sequence combinations, and adding the plurality of recognition text sequence combinations into a comparison text sequence. When the reference character sequence is "one, two, three and four", and the first recognition character sequence is "two, one, two, three and four", the "two" is taken as the first initial recognition character, and two "are concentrated in the corresponding feature recognition characters, and are respectively located at the second position and the fourth position. The several identification character sequences after deletion are combined into two, one, three, four, two, one, three, four and one, two, three and four. Each repeated word "two" may be a word that is repeatedly read by the target user by accident, or may be a word that is repeatedly read due to a language barrier of the target user, so that the word combination obtained after the storing needs to be identified one by one. Thereby ensuring the reliability of the speech recording analysis of the target user.
Optionally, extracting a second initial recognition text from the first recognition text sequence randomly again, performing repeated text storage-deletion analysis, and adding the obtained plurality of recognition text sequence combinations into the comparison text sequence. And obtaining the comparison character sequence after repeated character storage and deletion analysis is carried out on the identification characters in the first identification character sequence.
S300: processing the comparison text sequence through a pinyin matching table to obtain a comparison pinyin sequence;
in one embodiment, the pinyin matching table is a table of mapping relationships between different characters and corresponding standard pinyin, which are preset by a person skilled in the art. And respectively performing pinyin matching on a plurality of identification character sequence combinations in the comparison character sequences according to the mapping relation in the pinyin matching table to obtain the comparison pinyin sequences. Wherein, the combination of the plurality of comparison pinyin sequences in the comparison pinyin sequences corresponds to the plurality of recognition text sequences in the comparison text sequences one by one.
S400: carrying out semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain a character semantic deviation coefficient;
Further, the step S400 of the embodiment of the present application further includes:
Constructing a text semantic deviation coefficient evaluation function:
Wherein, Representing text semantic deviation coefficient,/>I-th reference word representing a sequence of reference words,/>The ith alignment letter of the kth alignment letter sequence is characterized,/>Representing the semantic distance between the ith reference character and the ith comparison character,/>Characterization distance threshold,/>Character quantity representing reference character sequence,/>Representing the number of characters of the kth comparison character sequence, and representing the total number of the comparison character sequences by M;
And traversing the reference character sequence and the comparison character sequence to perform semantic deviation calculation according to the character semantic deviation coefficient evaluation function, and obtaining the character semantic deviation coefficient.
In the embodiment of the application, after the comparison text sequence is obtained, the text semantic deviation coefficient evaluation function is utilized to quantitatively analyze the semantic deviation degree between the comparison text sequence and the reference text sequence, so that the text semantic deviation coefficient is obtained. The word semantic deviation coefficient is used for describing the deviation degree between word understanding in the speech sound recording output by the target user and word semantics expressed by the reference word sequence in the voice template information.
Optionally, the text semantic deviation coefficient evaluation function is used for performing quantization calculation on the semantic deviation degree between the comparison text sequence and the reference text sequence. And inputting the reference character sequence and the comparison character sequence into the character semantic deviation coefficient, thereby obtaining the character semantic deviation coefficient. The method and the device realize the aim of improving the analysis accuracy of the semantic deviation of the characters by carrying out overall analysis by considering various conditions of the target user when recording.
S500: when the semantic deviation coefficient of the characters is larger than or equal to a first deviation coefficient threshold value, a deviation character set is obtained;
in one embodiment, the first coefficient of deviation threshold is a minimum coefficient of deviation that the person skilled in the art self-sets for the target user speech recording analysis to pass. Judging whether the semantic deviation coefficient of the characters is larger than or equal to the first deviation coefficient threshold value, if so, indicating that the speech recording analysis of the target user is not passed, and adding the identification characters which are not passed through the first deviation coefficient threshold value into the deviation character set so as to perform deep analysis. If not, the speech recording analysis of the target user is indicated to pass, and the text semantic deviation coefficient is sent to the management terminal.
Optionally, the deviation text set is a text set with larger deviation from the reference text sequence in the first recognition text sequence obtained by the speech recording analysis of the target user.
S600: carrying out semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence by taking the deviation text set as constraint to obtain a pinyin semantic deviation coefficient;
S700: and sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal.
In the embodiment of the application, the deviation text set is used as constraint to be matched with the comparison pinyin sequence, the successfully matched pinyin is based on the same calculation principle as the text semantic deviation coefficient, and the reference pinyin sequence and the comparison pinyin sequence are used for carrying out semantic deviation analysis, so that the pinyin semantic deviation coefficient is obtained. The spelling semantic deviation coefficient describes the semantic deviation degree reflected by the speech recording of the target user from the spelling angle.
And further, the text semantic deviation coefficient and the pinyin semantic deviation coefficient are sent to the management terminal. The management terminal is a terminal for collecting and storing the speech recording analysis results of the target user. Therefore, the technical effect of reliably analyzing the speech sound recordings of the speech disorder crowd from multiple dimensions is achieved.
Further, as shown in fig. 3, step S700 of the embodiment of the present application further includes:
The voice template information also comprises a reference interval duration sequence of the reference character sequence;
The voice recognition model is provided with a timing component, and the comparison text sequence is provided with a comparison interval duration sequence;
comparing the reference interval duration sequence with the comparison interval duration sequence to obtain an interval duration anomaly coefficient;
and sending the interval duration anomaly coefficient to a management terminal.
In an embodiment of the present application, the voice template information further includes a reference interval duration sequence of the reference text sequence. The reference interval duration sequence is used for describing the interval time of two adjacent characters in the reference character sequence according to the character sequence in the reference character sequence. For example, when the reference text sequence in the voice template information is read according to the standard speech speed, the interval duration between two words is 0.2 seconds, and when commas exist between the two words, a pause occurs during reading, and the interval duration is 0.3 seconds.
In one embodiment, the timing component is embedded in the speech recognition model and is configured to perform a path on an interval duration between two adjacent characters in the speech recording of the target user, so as to obtain the comparison interval duration sequence of the comparison character sequence. Comparing the comparison interval duration sequence with the reference interval duration sequence, and comparing the number of interval durations which are not successfully compared with the total number of interval durations in the reference interval duration sequence, so as to obtain the interval duration anomaly coefficient. The interval duration abnormality coefficient reflects the abnormality degree of the Chinese interval duration in the speech recording of the target user. And the larger the interval duration anomaly coefficient is, the higher the junction degree of the target user is, and the interval duration anomaly coefficient is sent to the management terminal.
Further, step S700 of the embodiment of the present application further includes:
the voice recognition model further comprises a nasal classification channel and/or an accent classification channel, wherein the nasal classification channel marks a nasal trigger tag for the comparison text sequence, and the accent classification channel marks an accent trigger tag for the comparison text sequence;
judging whether the comparison text sequence has the nasal sound trigger tag or/and the accent trigger tag;
if the nasal sound trigger signal and/or the accent trigger signal are/is generated, the nasal sound trigger signal and/or the accent trigger signal are/is sent to the management terminal.
In one embodiment of the application, the nose sound classification channel is used for intelligently analyzing whether the nose sound exists in the speech sound recording of the target user, so that the nose sound trigger label is marked for the comparison text sequence with the nose sound. The accent classification channel is used for intelligently classifying whether accents exist in the speech recording of the target user, so that accent trigger labels are marked for the comparison text sequences with accents.
Optionally, determining whether the comparison text sequence has the nasal sound trigger tag or the accent trigger tag, or whether the comparison text sequence has the nasal sound trigger tag and the accent trigger tag, if so, generating a nasal sound trigger signal or an accent trigger signal, or generating a nasal sound trigger signal and an accent trigger signal, which indicate that nasal sound or accent exists in the comparison text sequence, or that nasal sound and accent exists in the comparison text sequence, and sending the generated trigger signal to the management terminal.
Further, the voice recognition model construction step includes:
configuring a voice recognition network topology, a nasal classification network topology and an accent classification network topology;
Configuring output rules: marking a nasal sound triggering tag on the output of the voice recognition network topology when the output of the nasal sound classification network topology is equal to 1, and marking an oral sound triggering tag on the output of the voice recognition network topology when the output of the oral sound classification network topology is equal to 1;
Collecting a training record data set, a recognition character identification data set, a nasal sound identification data set and a accent sound identification data set, wherein the training record data set, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with nasal sound, the nasal sound character quantity ratio is larger than a preset ratio identification of 1, otherwise, the training record data set is identified as 0, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with accent sound, the accent sound character quantity ratio is larger than the preset ratio identification of 1, otherwise, the recognition character quantity ratio is identified as 0;
the recognition text identification data set is used as supervision, the training recording data set is used as input, the voice recognition network topology is trained to obtain a voice recognition main channel, the nasal sound identification data set is used as supervision data, the training recording data set is used as input, the nasal sound classification network topology is trained to obtain the nasal sound classification channel, the accent identification data set is used as supervision data, the training recording data set is used as input, the accent classification network topology is trained to obtain the accent classification channel;
And setting the voice recognition main channel, the nasal sound classification channel and the accent classification channel as parallel channels, sharing input information, configuring an output full-connection layer based on the output rule, and generating the voice recognition model.
In one possible embodiment, the speech recognition model is a functional model that intelligently analyzes speech recordings of the target user from three dimensions, speech recognition, nasal classification, and accent classification. The voice recognition network topology, the nasal classification network topology and the accent classification network topology are respectively the basic frameworks of a voice recognition main channel, the nasal classification channel and the accent classification channel. Optionally, the output rule of the nasal classification network topology and the accent classification network topology is to mark the output of the speech recognition network topology with a nasal trigger tag when the output of the nasal classification network topology is equal to 1, and mark the output of the speech recognition network topology with an accent trigger tag when the output of the accent classification network topology is equal to 1. Wherein, the nasal sound trigger tag indicates that nasal sound exists in the comparison text sequence. The accent trigger tag indicates that accents exist in the aligned text sequence.
Optionally, collecting a training record data set, a recognition text identification data set, a nasal sound identification data set and a accent text identification data set, wherein the training record data set is provided with nasal sound and the nasal sound text quantity ratio is greater than a preset duty ratio identification of 1, otherwise, the training record data set is identified as 0, the training record data set is provided with accent text quantity ratio is greater than the preset duty ratio identification of 1, otherwise, the training record data set is identified as 0. The preset duty cycle is a ratio preset by a person skilled in the art and may be 70%.
And training the voice recognition network topology by taking the recognition text identification data set as supervision and the training recording data set as input, and performing supervision training until the voice recognition network topology is converged to obtain a voice recognition main channel. Optionally, the voice recognition network topology is constructed based on a convolutional neural network. And taking the nasal sound identification data set as supervision data, taking the training recording data set as input, performing supervision data on the nasal sound classification network topology, and obtaining the nasal sound classification channel when the nasal sound classification network topology output reaches convergence. And further, taking the accent identification data set as supervision data, taking the training record data set as input, and performing supervision training on the accent classification network topology constructed based on the convolutional neural network until the accent classification network topology output reaches convergence, so as to obtain the accent classification channel.
Optionally, the voice recognition main channel, the nasal sound classification channel and the accent classification channel are set as parallel channels, input information is shared, and a full-connection layer is configured and output based on the output rule to generate the voice recognition model. The technical effect of improving the operation efficiency of the voice recognition model is achieved.
In summary, the embodiment of the application has at least the following technical effects:
The application obtains voice template information, achieves the aim of providing a voice recording template for a target user, then receives voice recording of the voice template information by the target user, carries out intelligent analysis on the voice recording through a voice recognition model, obtains a comparison character sequence, realizes promotion of comparison characters, further carries out processing on the comparison character sequence through a pinyin matching table, obtains a comparison pinyin sequence, carries out semantic deviation analysis on a reference character sequence and the comparison character sequence, obtains a character semantic deviation coefficient, carries out semantic deviation analysis on the reference character sequence and the comparison character sequence when the character semantic deviation coefficient is greater than or equal to a first deviation coefficient threshold value, obtains a deviation character set, takes the deviation character set as constraint at the moment, carries out semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence, obtains a pinyin semantic deviation coefficient, and then sends the character semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal. The technical effect of analyzing the speech sound record from multiple dimensions, comprehensively determining the speech ability of the people with language disorder and improving the reliability of analysis results is achieved.
Embodiment two:
based on the same inventive concept as the speech recording analysis method based on the language barrier crowd in the foregoing embodiment, as shown in fig. 4, the present application provides a speech recording analysis system based on the language barrier crowd in multiple modes, and the system and method embodiments in the embodiments of the present application are based on the same inventive concept. Wherein the system comprises:
a voice template information obtaining module 11, configured to obtain voice template information, where the voice template information includes a reference word sequence and a reference pinyin sequence, and the reference word sequence has no repeated words;
The comparison text sequence obtaining module 12 is configured to receive a speech recording of the target user on the voice template information, and parse the speech recording through a voice recognition model to obtain a comparison text sequence;
the comparison pinyin sequence obtaining module 13 is configured to process the comparison text sequence through a pinyin matching table to obtain a comparison pinyin sequence;
A semantic deviation coefficient obtaining module 14, configured to perform semantic deviation analysis on the reference text sequence and the comparison text sequence, to obtain a text semantic deviation coefficient;
The deviation text set obtaining module 15 is configured to obtain a deviation text set when the text semantic deviation coefficient is greater than or equal to a first deviation coefficient threshold;
the pinyin semantic deviation coefficient obtaining module 16 is configured to perform semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence with the deviation text set as a constraint, so as to obtain a pinyin semantic deviation coefficient;
And the deviation coefficient sending module 17 is used for sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal.
Further, the deviation coefficient sending module 17 is configured to perform the following steps:
The voice template information also comprises a reference interval duration sequence of the reference character sequence;
The voice recognition model is provided with a timing component, and the comparison text sequence is provided with a comparison interval duration sequence;
comparing the reference interval duration sequence with the comparison interval duration sequence to obtain an interval duration anomaly coefficient;
and sending the interval duration anomaly coefficient to a management terminal.
Further, the deviation coefficient sending module 17 is configured to perform the following steps:
the voice recognition model further comprises a nasal classification channel and/or an accent classification channel, wherein the nasal classification channel marks a nasal trigger tag for the comparison text sequence, and the accent classification channel marks an accent trigger tag for the comparison text sequence;
judging whether the comparison text sequence has the nasal sound trigger tag or/and the accent trigger tag;
if the nasal sound trigger signal and/or the accent trigger signal are/is generated, the nasal sound trigger signal and/or the accent trigger signal are/is sent to the management terminal.
Further, the deviation coefficient sending module 17 is configured to perform the following steps:
configuring a voice recognition network topology, a nasal classification network topology and an accent classification network topology;
Configuring output rules: marking a nasal sound triggering tag on the output of the voice recognition network topology when the output of the nasal sound classification network topology is equal to 1, and marking an oral sound triggering tag on the output of the voice recognition network topology when the output of the oral sound classification network topology is equal to 1;
Collecting a training record data set, a recognition character identification data set, a nasal sound identification data set and a accent sound identification data set, wherein the training record data set, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with nasal sound, the nasal sound character quantity ratio is larger than a preset ratio identification of 1, otherwise, the training record data set is identified as 0, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with accent sound, the accent sound character quantity ratio is larger than the preset ratio identification of 1, otherwise, the recognition character quantity ratio is identified as 0;
the recognition text identification data set is used as supervision, the training recording data set is used as input, the voice recognition network topology is trained to obtain a voice recognition main channel, the nasal sound identification data set is used as supervision data, the training recording data set is used as input, the nasal sound classification network topology is trained to obtain the nasal sound classification channel, the accent identification data set is used as supervision data, the training recording data set is used as input, the accent classification network topology is trained to obtain the accent classification channel;
And setting the voice recognition main channel, the nasal sound classification channel and the accent classification channel as parallel channels, sharing input information, configuring an output full-connection layer based on the output rule, and generating the voice recognition model.
Further, the comparison text sequence obtaining module 12 is configured to execute the following steps:
Receiving speech sound recording of a target user on the voice template information, and analyzing the speech sound recording through a voice recognition model to obtain a first recognition text sequence;
And repeating word stock deletion on the first recognition word sequence to obtain the comparison word sequence.
Further, the comparison text sequence obtaining module 12 is configured to execute the following steps:
Randomly extracting first initial recognition characters of the first recognition character sequence, deleting the first initial recognition characters from the first recognition character sequence, and obtaining a second recognition character sequence;
Traversing the second recognition text sequence based on the first initial recognition text to perform semantic distance analysis to obtain a plurality of semantic distance characteristic values;
Extracting feature recognition character sets with semantic distance feature values equal to 0;
And traversing the first initial recognition text and the characteristic recognition text set to respectively store and delete the first initial recognition text and the characteristic recognition text set to obtain a plurality of recognition text sequence combinations, and setting the recognition text sequence combinations as the comparison text sequences.
Further, the semantic deviation coefficient obtaining module 14 is configured to perform the following steps:
Constructing a text semantic deviation coefficient evaluation function:
Wherein, Representing text semantic deviation coefficient,/>I-th reference word representing a sequence of reference words,/>The ith alignment letter of the kth alignment letter sequence is characterized,/>Representing the semantic distance between the ith reference character and the ith comparison character,/>Characterization distance threshold,/>Character quantity representing reference character sequence,/>Representing the number of characters of the kth comparison character sequence, and representing the total number of the comparison character sequences by M;
And traversing the reference character sequence and the comparison character sequence to perform semantic deviation calculation according to the character semantic deviation coefficient evaluation function, and obtaining the character semantic deviation coefficient.
It should be noted that the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims (5)

1. The language barrier crowd speech recording analysis method based on the multiple modes is characterized by comprising the following steps of:
obtaining voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, and the reference character sequence has no repeated characters;
Receiving speech sound records of a target user on the voice template information, and analyzing the speech sound records through a voice recognition model to obtain a comparison text sequence;
Processing the comparison text sequence through a pinyin matching table to obtain a comparison pinyin sequence;
Carrying out semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain a character semantic deviation coefficient;
when the semantic deviation coefficient of the characters is larger than or equal to a first deviation coefficient threshold value, a deviation character set is obtained;
carrying out semantic deviation analysis on the reference pinyin sequence and the comparison pinyin sequence by taking the deviation text set as constraint to obtain a pinyin semantic deviation coefficient;
the text semantic deviation coefficient and the pinyin semantic deviation coefficient are sent to a management terminal;
The semantic deviation analysis is performed on the reference character sequence and the comparison character sequence to obtain a character semantic deviation coefficient, and the method comprises the following steps:
Constructing a text semantic deviation coefficient evaluation function:
Wherein, Representing text semantic deviation coefficient,/>I-th reference word representing a sequence of reference words,/>The ith alignment letter of the kth alignment letter sequence is characterized,/>Representing the semantic distance between the ith reference character and the ith comparison character,/>Characterization distance threshold,/>Character quantity representing reference character sequence,/>Representing the number of characters of the kth comparison character sequence, and representing the total number of the comparison character sequences by M;
And traversing the reference character sequence and the comparison character sequence to perform semantic deviation calculation according to the character semantic deviation coefficient evaluation function, and obtaining the character semantic deviation coefficient.
2. The method as recited in claim 1, further comprising:
The voice template information also comprises a reference interval duration sequence of the reference character sequence;
The voice recognition model is provided with a timing component, and the comparison text sequence is provided with a comparison interval duration sequence;
comparing the reference interval duration sequence with the comparison interval duration sequence to obtain an interval duration anomaly coefficient;
and sending the interval duration anomaly coefficient to a management terminal.
3. The method as recited in claim 1, further comprising:
the voice recognition model further comprises a nasal classification channel and/or an accent classification channel, wherein the nasal classification channel marks a nasal trigger tag for the comparison text sequence, and the accent classification channel marks an accent trigger tag for the comparison text sequence;
judging whether the comparison text sequence has the nasal sound trigger tag or/and the accent trigger tag;
if the nasal sound trigger signal and/or the accent trigger signal are/is generated, the nasal sound trigger signal and/or the accent trigger signal are/is sent to the management terminal.
4. The method of claim 3, wherein the speech recognition model building step comprises:
configuring a voice recognition network topology, a nasal classification network topology and an accent classification network topology;
Configuring output rules: marking a nasal sound triggering tag on the output of the voice recognition network topology when the output of the nasal sound classification network topology is equal to 1, and marking an oral sound triggering tag on the output of the voice recognition network topology when the output of the oral sound classification network topology is equal to 1;
Collecting a training record data set, a recognition character identification data set, a nasal sound identification data set and a accent sound identification data set, wherein the training record data set, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with nasal sound, the nasal sound character quantity ratio is larger than a preset ratio identification of 1, otherwise, the training record data set is identified as 0, the recognition character identification data set, the nasal sound identification data set and the accent sound identification data set are provided with accent sound, the accent sound character quantity ratio is larger than the preset ratio identification of 1, otherwise, the recognition character quantity ratio is identified as 0;
the recognition text identification data set is used as supervision, the training recording data set is used as input, the voice recognition network topology is trained to obtain a voice recognition main channel, the nasal sound identification data set is used as supervision data, the training recording data set is used as input, the nasal sound classification network topology is trained to obtain the nasal sound classification channel, the accent identification data set is used as supervision data, the training recording data set is used as input, the accent classification network topology is trained to obtain the accent classification channel;
And setting the voice recognition main channel, the nasal sound classification channel and the accent classification channel as parallel channels, sharing input information, configuring an output full-connection layer based on the output rule, and generating the voice recognition model.
5. The speech recording analysis system based on the multi-mode language barrier crowd is characterized in that the system comprises:
The voice template information acquisition module is used for acquiring voice template information, wherein the voice template information comprises a reference character sequence and a reference pinyin sequence, and the reference character sequence has no repeated characters;
The comparison text sequence obtaining module is used for receiving the speech sound record of the target user on the voice template information, analyzing the speech sound record through a voice recognition model and obtaining a comparison text sequence;
The comparison pinyin sequence obtaining module is used for processing the comparison text sequence through the pinyin matching table to obtain a comparison pinyin sequence;
The semantic deviation coefficient obtaining module is used for carrying out semantic deviation analysis on the reference character sequence and the comparison character sequence to obtain character semantic deviation coefficients;
The deviation text set obtaining module is used for obtaining a deviation text set when the text semantic deviation coefficient is greater than or equal to a first deviation coefficient threshold value;
The Pinyin semantic deviation coefficient obtaining module is used for carrying out semantic deviation analysis on the reference Pinyin sequence and the comparison Pinyin sequence by taking the deviation character set as constraint to obtain a Pinyin semantic deviation coefficient;
the deviation coefficient sending module is used for sending the text semantic deviation coefficient and the pinyin semantic deviation coefficient to a management terminal;
the semantic deviation coefficient obtaining module is further configured to perform the following steps:
Constructing a text semantic deviation coefficient evaluation function:
Wherein, Representing text semantic deviation coefficient,/>I-th reference word representing a sequence of reference words,/>The ith alignment letter of the kth alignment letter sequence is characterized,/>Representing the semantic distance between the ith reference character and the ith comparison character,/>Characterization distance threshold,/>Character quantity representing reference character sequence,/>Representing the number of characters of the kth comparison character sequence, and representing the total number of the comparison character sequences by M;
And traversing the reference character sequence and the comparison character sequence to perform semantic deviation calculation according to the character semantic deviation coefficient evaluation function, and obtaining the character semantic deviation coefficient.
CN202410254551.4A 2024-03-06 2024-03-06 Multi-mode-based language barrier crowd speech recording analysis method and system Active CN117831573B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410254551.4A CN117831573B (en) 2024-03-06 2024-03-06 Multi-mode-based language barrier crowd speech recording analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410254551.4A CN117831573B (en) 2024-03-06 2024-03-06 Multi-mode-based language barrier crowd speech recording analysis method and system

Publications (2)

Publication Number Publication Date
CN117831573A CN117831573A (en) 2024-04-05
CN117831573B true CN117831573B (en) 2024-05-14

Family

ID=90524482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410254551.4A Active CN117831573B (en) 2024-03-06 2024-03-06 Multi-mode-based language barrier crowd speech recording analysis method and system

Country Status (1)

Country Link
CN (1) CN117831573B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680503A (en) * 2012-08-31 2014-03-26 中瀚国际语识有限公司 Semantic identification method
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN105808197A (en) * 2014-12-30 2016-07-27 联想(北京)有限公司 Information processing method and electronic device
CN110164435A (en) * 2019-04-26 2019-08-23 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
CN112509566A (en) * 2020-12-22 2021-03-16 北京百度网讯科技有限公司 Voice recognition method, device, equipment, storage medium and program product
JP6858913B1 (en) * 2020-09-24 2021-04-14 株式会社ビジョナリスト Foreign language learning equipment, foreign language learning systems, foreign language learning methods, programs, and recording media
WO2022121251A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Method and apparatus for training text processing model, computer device and storage medium
CN114822545A (en) * 2022-03-25 2022-07-29 华南理工大学 Method for improving speech recognition rate in professional field
WO2023035525A1 (en) * 2021-09-10 2023-03-16 平安科技(深圳)有限公司 Speech recognition error correction method and system, and apparatus and storage medium
CN117238276A (en) * 2023-11-10 2023-12-15 深圳市托普思维商业服务有限公司 Analysis correction system based on intelligent voice data recognition
CN117499528A (en) * 2023-05-18 2024-02-02 马上消费金融股份有限公司 Method, device, equipment and storage medium for detecting session quality

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108615526B (en) * 2018-05-08 2020-07-07 腾讯科技(深圳)有限公司 Method, device, terminal and storage medium for detecting keywords in voice signal
CN111860506B (en) * 2020-07-24 2024-03-29 北京百度网讯科技有限公司 Method and device for recognizing characters

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680503A (en) * 2012-08-31 2014-03-26 中瀚国际语识有限公司 Semantic identification method
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN105808197A (en) * 2014-12-30 2016-07-27 联想(北京)有限公司 Information processing method and electronic device
CN110164435A (en) * 2019-04-26 2019-08-23 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
JP6858913B1 (en) * 2020-09-24 2021-04-14 株式会社ビジョナリスト Foreign language learning equipment, foreign language learning systems, foreign language learning methods, programs, and recording media
WO2022121251A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Method and apparatus for training text processing model, computer device and storage medium
CN112509566A (en) * 2020-12-22 2021-03-16 北京百度网讯科技有限公司 Voice recognition method, device, equipment, storage medium and program product
WO2023035525A1 (en) * 2021-09-10 2023-03-16 平安科技(深圳)有限公司 Speech recognition error correction method and system, and apparatus and storage medium
CN114822545A (en) * 2022-03-25 2022-07-29 华南理工大学 Method for improving speech recognition rate in professional field
CN117499528A (en) * 2023-05-18 2024-02-02 马上消费金融股份有限公司 Method, device, equipment and storage medium for detecting session quality
CN117238276A (en) * 2023-11-10 2023-12-15 深圳市托普思维商业服务有限公司 Analysis correction system based on intelligent voice data recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴振华 ; 高瑞泽 ; .智能家居场景下改进的中文字符串匹配算法.南昌航空大学学报(自然科学版).2018,(02),全文. *

Also Published As

Publication number Publication date
CN117831573A (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN110136727B (en) Speaker identification method, device and storage medium based on speaking content
CN108074576B (en) Speaker role separation method and system under interrogation scene
Clemins et al. Automatic classification and speaker identification of African elephant (Loxodonta africana) vocalizations
CN101076851B (en) Spoken language identification system and method for training and operating the said system
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN110413764B (en) Long text enterprise name recognition method based on pre-built word stock
Jancovic et al. Bird species recognition using unsupervised modeling of individual vocalization elements
CN109544104A (en) A kind of recruitment data processing method and device
CN109192194A (en) Voice data mask method, device, computer equipment and storage medium
CN108877769B (en) Method and device for identifying dialect type
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN112151014A (en) Method, device and equipment for evaluating voice recognition result and storage medium
CN112468659A (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN112818742A (en) Expression ability dimension evaluation method and device for intelligent interview
KR20230129094A (en) Method and Apparatus for Emotion Recognition in Real-Time Based on Multimodal
Xia et al. Confidence based acoustic event detection
JP4717872B2 (en) Speaker information acquisition system and method using voice feature information of speaker
US20220157322A1 (en) Metadata-based diarization of teleconferences
Anidjar et al. Hybrid speech and text analysis methods for speaker change detection
CN116150651A (en) AI-based depth synthesis detection method and system
CN117831573B (en) Multi-mode-based language barrier crowd speech recording analysis method and system
CN101266793B (en) Device and method for reducing recognition error via context relation in dialog bouts
CN112734604A (en) Device for providing multi-mode intelligent case report and record generation method thereof
Bancroft et al. Exploring the intersection between speaker verification and emotion recognition
US20230402030A1 (en) Embedded Dictation Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant