CN117334186B - Speech recognition method and NLP platform based on machine learning - Google Patents

Speech recognition method and NLP platform based on machine learning Download PDF

Info

Publication number
CN117334186B
CN117334186B CN202311337518.XA CN202311337518A CN117334186B CN 117334186 B CN117334186 B CN 117334186B CN 202311337518 A CN202311337518 A CN 202311337518A CN 117334186 B CN117334186 B CN 117334186B
Authority
CN
China
Prior art keywords
key information
sub
centroids
centroid
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311337518.XA
Other languages
Chinese (zh)
Other versions
CN117334186A (en
Inventor
高辉杰
庄志远
孙岚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhicheng Pengzhan Technology Co ltd
Original Assignee
Beijing Zhicheng Pengzhan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhicheng Pengzhan Technology Co ltd filed Critical Beijing Zhicheng Pengzhan Technology Co ltd
Priority to CN202311337518.XA priority Critical patent/CN117334186B/en
Publication of CN117334186A publication Critical patent/CN117334186A/en
Application granted granted Critical
Publication of CN117334186B publication Critical patent/CN117334186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a voice recognition method and an NLP platform based on machine learning, key information in voice signals is recognized through a target voice recognition algorithm which is debugged in advance, the accuracy is high based on a high-performance algorithm which is obtained through machine learning debugging, the accuracy of acquiring voice key information can be improved, in the debugging process of the algorithm, a plurality of basic sub-key information centroids are preset for sample semantic description sets under the same key information semantic mark, the basic sub-key information centroids corresponding to the plurality of sample semantic descriptions in the sample semantic description sets are determined, a plurality of sub-information is preset for the plurality of sample semantic descriptions under the same key information semantic mark, similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids corresponding to the sample semantic descriptions are further used for adjusting the plurality of basic sub-key information centroids, the debugging sample size requirement is low, the debugging efficiency is increased on the premise of guaranteeing the recognition accuracy of the algorithm, the calculation consumption is saved, and the cost is reduced.

Description

Speech recognition method and NLP platform based on machine learning
Technical Field
The disclosure relates to the field of data processing, in particular to a machine learning-based voice recognition method and an NLP platform.
Background
Speech recognition is a technique that converts spoken language into intelligible text or commands, involving a number of key links, where recognizing key information in a speech signal is an important loop. In the prior art, key information is identified through a voice recognition algorithm, and voice fragments belonging to the same information content in voice characteristics in a large number of voice signal samples are clustered when the voice recognition algorithm is debugged, so that the algorithm is helped to have generalization on different expression samples of the same information content. However, in the debugging link of the voice recognition algorithm, the debugging efficiency of the algorithm is reduced due to the dependence on the large-scale sample adjustment, and meanwhile, error mark samples are inevitably introduced due to the preparation and use of massive training samples, so that the debugging result is deviated and the expected precision requirement cannot be met, so that the problem to be solved is solved.
Disclosure of Invention
The invention aims to provide a machine learning-based voice recognition method and an NLP platform so as to solve the problems.
Embodiments of the present disclosure are implemented as follows:
In a first aspect, an embodiment of the present disclosure provides a machine learning-based speech recognition method applied to an NLP platform, the method including:
Acquiring a voice signal to be recognized, and inputting the voice signal to be recognized into a target voice recognition algorithm, wherein the target voice recognition algorithm is obtained by debugging in advance based on a voice sample;
The target voice recognition algorithm is used for recognizing the voice signal to be recognized, so that a key information recognition result in the voice signal to be recognized is obtained;
The target voice recognition algorithm comprises the following steps when in debugging:
Performing semantic description mining on a plurality of voice learning samples annotated with the same key information semantic tags in a voice learning sample set based on a voice recognition algorithm to be debugged to obtain a sample semantic description set under the same key information semantic tags, wherein the voice learning sample set comprises voice learning samples annotated with the plurality of key information semantic tags;
presetting a plurality of basic sub-key information centroids for the sample semantic description set, and determining basic sub-key information centroids respectively corresponding to a plurality of sample semantic descriptions in the sample semantic description set;
Adjusting the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids respectively corresponding to the plurality of sample semantic descriptions to obtain one or more target sub-key information centroids under the same key information semantic label;
And determining algorithm cost based on target sub-key information centroids under a plurality of key information semantic marks and sample semantic descriptions belonging to each target sub-key information centroid, and debugging the voice recognition algorithm to be debugged based on the algorithm cost.
As an implementation manner, the adjusting the plurality of basic sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-critical information centroids respectively corresponding to the plurality of sample semantic descriptions to obtain one or more target sub-critical information centroids under the same key information semantic label includes:
based on similarity scores between the sample semantic descriptions and basic sub-key information centroids corresponding to the sample semantic descriptions, evaluating whether isolated semantic descriptions exist in the sample semantic descriptions, wherein the similarity scores between the isolated semantic descriptions and the basic sub-key information centroids are smaller than a first preset score;
When the plurality of sample semantic descriptions have isolated semantic descriptions, based on the isolated semantic descriptions, newly adding basic sub-key information centroids, and adjusting the isolated semantic descriptions to belong to the newly added basic sub-key information centroids; the target sub-key information centroid comprises a newly added basic sub-key information centroid and a plurality of preset basic sub-key information centroids.
As one embodiment, the evaluating whether the plurality of sample semantic descriptions have isolated semantic descriptions based on similarity scores between the plurality of sample semantic descriptions and the respective corresponding basic sub-key information centroids includes:
For any basic sub-key information centroid, determining a first preset score corresponding to the basic sub-key information centroid based on an average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid and individual discrete coefficients;
evaluating whether the sample semantic descriptions belonging to the basic sub-key information centroids have the sample semantic descriptions with similarity scores smaller than the first preset scores or not;
And determining the sample semantic descriptions with similarity scores smaller than the first preset scores with the basic sub-key information centroid as isolated semantic descriptions not close to the basic sub-key information centroid.
As an implementation manner, the adjusting the plurality of basic sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-critical information centroids respectively corresponding to the plurality of sample semantic descriptions to obtain one or more target sub-critical information centroids under the same key information semantic label includes:
Based on similarity scores between the sample semantic descriptions and the corresponding basic sub-key information centroids, evaluating whether the basic sub-key information centroids have special sub-key information centroids or not, wherein an average result of the similarity scores between the sample semantic descriptions covered by the special sub-key information centroids and the special sub-key information centroids is not more than a second preset score;
Deleting the special sub-key information centroid and the sample semantic description belonging to the special sub-key information centroid when the plurality of basic sub-key information centroids are provided with the special sub-key information centroids, and obtaining the rest basic sub-key information centroids and the sample semantic description belonging to the rest basic sub-key information centroids; wherein the target sub-critical information centroid includes the remaining base sub-critical information centroid.
As one embodiment, the evaluating whether the plurality of basic sub-critical information centroids have a specific sub-critical information centroid based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-critical information centroids includes:
For any basic sub-key information centroid, evaluating whether an average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid is not greater than a second preset score;
and when the average result of similarity scores between the basic sub-key information centroid and the sample semantic descriptions belonging to the basic sub-key information centroid is not more than the second preset score, determining that the basic sub-key information centroid is a special sub-key information centroid.
As an implementation manner, the adjusting the plurality of basic sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-critical information centroids respectively corresponding to the plurality of sample semantic descriptions to obtain one or more target sub-critical information centroids under the same key information semantic label includes:
Determining one or more groups of quasi-fusion sub-key information centroids in the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids, wherein each group of quasi-fusion sub-key information centroids comprises a plurality of approximate basic sub-key information centroids;
Respectively fusing the one or more groups of quasi-fusion sub-key information centroids to obtain one or more fusion sub-key information centroids, and adjusting sample semantic descriptions corresponding to the quasi-fusion sub-key information centroids to be the fused sub-key information centroids; the target sub-critical information centroid includes the fusion sub-critical information centroid.
As one embodiment, the determining one or more groups of quasi-fusion sub-key information centroids in the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids includes:
Determining a third preset score corresponding to each basic sub-key information centroid based on the average result of similarity scores between each basic sub-key information centroid and the sample semantic descriptions belonging to each basic sub-key information centroid and individual discrete coefficients;
For any basic sub-key information centroid, evaluating whether a similarity score between the basic sub-key information centroid and the rest of basic sub-key information centroids in the plurality of basic sub-key information centroids is greater than or equal to a fourth preset score, wherein the fourth preset score is the maximum similarity score in third preset scores corresponding to the basic sub-key information centroid and the rest of basic sub-key information centroids;
Determining that the remaining base sub-critical information centroid is similar to the base sub-critical information centroid when a similarity score between the base sub-critical information centroid and the remaining base sub-critical information centroid is greater than or equal to the maximum similarity score;
and determining the basic sub-key information centroid and one or more other basic sub-key information centroids similar to the basic sub-key information centroid as a set of quasi-fusion sub-key information centroids.
As an implementation manner, the determining the algorithm cost based on the target sub-key information centroids under the semantic labels of the plurality of key information and the sample semantic descriptions of the target sub-key information centroids comprises:
Determining a sub-price corresponding to the sample semantic description based on a similarity score between the sample semantic description and the target sub-key information centroid to which the sample semantic description belongs and a similarity score between the sample semantic description and other target sub-key information centroids, wherein the other target sub-key information centroids comprise target sub-key information centroids which are not greater than a fifth preset score except the target sub-key information centroids to which the sample semantic description belongs in the plurality of target sub-key information centroids under the plurality of key information semantic marks;
And determining the algorithm cost based on the child price corresponding to each sample semantic description belonging to each target child key information centroid.
As an implementation manner, the determining the basic sub-key information centroid corresponding to each of the plurality of sample semantic descriptions in the sample semantic description set includes:
Obtaining similarity scores between the sample semantic descriptions and the centroids of the basic sub-key information according to any sample semantic description in the sample semantic description set;
And determining the basic sub-key information centroid corresponding to the maximum similarity score as the basic sub-key information centroid to which the sample semantic description belongs.
In a second aspect, embodiments of the present disclosure provide an NLP platform, comprising:
one or more processors;
A memory;
One or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, implement the methods described above.
The present disclosure has at least the beneficial effects:
according to the voice recognition method and the NLP platform based on machine learning, key information in voice signals is recognized through a target voice recognition algorithm which is debugged in advance, and the accuracy is high based on a high-performance algorithm obtained through machine learning debugging, so that accuracy of acquiring voice key information can be improved; based on similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids which are respectively corresponding to the sample semantic descriptions, the plurality of basic sub-key information centroids are adjusted, so that one or more target sub-key information centroids under the same key information semantic marks are obtained, the real distribution situation of the sample semantic descriptions under the key information semantic marks can be more adapted, namely, the sample semantic descriptions under the same key information semantic marks can be gathered to more accurate target sub-key information centroids, and the mark confusion caused by interference information of each mark error possibly existing under different key information semantic marks is reduced; the voice recognition algorithm to be debugged is debugged based on the target sub-key information centroid and the sample semantic description determined by the target sub-key information centroid, so that the generalization of the debugged voice recognition algorithm to the error marked interference voice signal can be increased in the process of debugging the voice recognition algorithm to be debugged based on the voice learning sample set which possibly covers the interference voice signal, the algorithm with good debugging effect can be obtained only through proper parameter adjustment, the requirement on the debugging sample size is low, the efficiency of the debugging process is increased on the premise of ensuring the recognition precision of the algorithm, the calculation consumption is saved, and the cost is reduced.
In the following description, other features will be partially set forth. Upon review of the ensuing disclosure and the accompanying figures, those skilled in the art will in part discover these features or will be able to ascertain them through production or use thereof. The features of the present application may be implemented and obtained by practicing or using the various aspects of the methods, tools, and combinations that are set forth in the detailed examples described below.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments of the present disclosure will be briefly introduced below.
Fig. 1 is a flowchart of a machine learning-based speech recognition method provided in an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a debugging flow of a target speech recognition algorithm provided in an embodiment of the present disclosure.
Fig. 3 is a schematic diagram of a functional module architecture of a speech recognition device according to an embodiment of the disclosure.
Fig. 4 is a schematic diagram of the composition of an NLP platform according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure are described below with reference to the accompanying drawings in the embodiments of the present disclosure. The terminology used in the description of the embodiments of the disclosure is for the purpose of describing particular embodiments of the disclosure only and is not intended to be limiting of the disclosure.
The main execution body of the machine learning-based speech recognition method in the embodiment of the disclosure is an NLP platform, i.e., a natural language processing (Natural Language Processing, NLP) platform, which is a tool for building, deploying and managing an NLP application program. The NLP platform provides various functions and services such as text processing, speech recognition, lexical analysis, emotion analysis, machine translation, etc., helps developers build and test NLP applications more efficiently, and can be deployed and expanded quickly. In the embodiment of the disclosure, the entity of the NLP platform may be, but is not limited to, a server, a personal computer, a notebook computer, a tablet computer, and the like. When the NLP platform is a server, the server includes, but is not limited to, a single network server, a server group formed by a plurality of network servers, or a cloud formed by a large number of computers or network servers in cloud computing, where the cloud computing is a distributed computing type, and is a super virtual computer formed by a group of loosely coupled computer sets. The NLP platform can independently operate to realize the disclosure, and can also access a network and realize the disclosure through interaction with other NLP platforms in the network. The network where the NLP platform is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
The embodiment of the disclosure provides a voice recognition method based on machine learning, which is applied to an NLP platform, as shown in FIG. 1, and comprises the following steps:
Step S10, a voice signal to be recognized is obtained, the voice signal to be recognized is input into a target voice recognition algorithm, and the target voice recognition algorithm is obtained by debugging in advance based on a voice sample.
Step S20, the voice signal to be recognized is recognized through a target voice recognition algorithm, and a key information recognition result in the voice signal to be recognized is obtained.
The voice signal to be recognized is a voice segment which needs to be extracted key information, for example, a voice signal of a user collected based on enterprise communication equipment (such as a pickup tablet, a working mobile phone, a fixed phone box, a recording earphone and the like). Wherein the user voice signal contains key information to be identified, such as place name, person name, task, event, etc. And carrying out recognition processing on the voice signal to be recognized through a target voice recognition algorithm debugged in advance, so that a key information recognition result in the voice signal to be recognized can be obtained.
In general speech recognition, in order to adapt to recognition of different phonemes and voiceprints on the same information, in a debugging link of a speech recognition algorithm, the same key information (like a place name) is marked with the same information mark, for example, a text mark, the same information mark is clustered into a feature set (i.e. a feature cluster) in a feature domain, speech samples of the same information mark are clustered into a centroid (or a clustering center), and when the algorithm is debugged, algorithm cost is obtained based on the difference between the information mark of the speech sample and an algorithm output result, and algorithm debugging is realized based on the cost. In the prior art, a large number of voice samples are required to be adopted for debugging the algorithm to ensure the precision of voice recognition, and the process often leads the final precision of the algorithm not to meet the expected requirement because a large number of samples carry partial error marks, so that in order to overcome the problem, the prior art also adds a sample denoising link to clean the error marked voice samples, which undoubtedly increases the workload of algorithm debugging, causes the cost increase and reduces the efficiency. The embodiment of the disclosure mainly aims to overcome the above technical problems, and the following describes an implementation process for solving the technical problems, specifically referring to fig. 2, when the target speech recognition algorithm provided in the embodiment of the disclosure is debugged, the method includes the following steps:
Step S110, semantic description mining is carried out on a plurality of voice learning samples which are annotated with the same key information semantic tags in the voice learning sample set based on the voice recognition algorithm to be debugged, and a sample semantic description set under the same key information semantic tags is obtained.
The voice learning sample set comprises voice learning samples annotated with a plurality of key information semantic tags, i.e., voice samples for debugging algorithms.
In order to identify specific content of voice information in a voice signal, the method and the device preset a plurality of information contents, acquire confidence degrees of the voice signal belonging to the information contents, and further determine target information contents corresponding to the voice signal based on the confidence degrees.
Optionally, the voice recognition algorithm to be debugged at least comprises a semantic description mining operator and a classification operator, wherein the semantic description mining operator is used for extracting semantic description of the voice signal to be recognized, and the semantic description is the semantic feature description information of the voice signal to be recognized, which is extracted, and can be a feature vector, such as features of short-time energy, mel frequency cepstrum coefficient, short-time zero-crossing rate and the like of the voice signal. The classifying operator is used for acquiring the confidence that the semantic descriptions belong to the information contents respectively, and the classifying operator is a classifier, such as softmax. Based on similarity scores between semantic descriptions and key information centroid semantic descriptions corresponding to each information content (namely, variables of similarity between the semantic descriptions and the key information centroid semantic descriptions are evaluated, the similarity scores can be obtained by calculating distances between feature vectors corresponding to the semantic descriptions, such as cosine distances and Euclidean distances), confidence degrees of the semantic descriptions respectively belonging to the information content are obtained, and the information content corresponding to the maximum confidence degree is determined as the information content of the voice information in the voice signal to be identified. In some embodiments, more operator structures, such as affine operators, convolution operators, pooling operators, and the like, may be further included before the classification operators, and the method is not limited in particular.
The sample semantic description set under the same key information semantic label may be regarded as a set made up of a plurality of sample semantic descriptions of a plurality of speech learning samples under the same key information semantic label. The voice recognition algorithm to be debugged can perform semantic description mining (namely, the process of semantic feature mining is completed) on a plurality of voice learning samples with a plurality of key information semantic marks, so as to obtain sample semantic description sets under the plurality of key information semantic marks, wherein the sample semantic description sets are feature sets of the mined voice learning samples. The semantic description mining mode of the voice recognition algorithm to be debugged is not limited, and the general feature mining mode can be referred to, for example, mining is performed through a convolution operator.
Step S120, presetting a plurality of basic sub-key information centroids for the sample semantic description set, and determining the basic sub-key information centroids respectively corresponding to the plurality of sample semantic descriptions in the sample semantic description set.
The voice learning sample set comprises a plurality of voice learning samples with key information semantic marks, and the number of basic sub-key information centroids preset by a voice recognition algorithm to be debugged, namely the number of initialized clustering centers, can be set according to the super-parameters. For example, if a speech learning sample containing 1000 information contents is set to generate 20 basic sub-key information centroids for each information content, the speech recognition algorithm to be debugged can randomly construct 20000 basic sub-key information centroids, that is, 20 basic sub-key information centroids are preset for each key information semantic mark.
In the above steps, the key information centroid is a class of key information, and based on the generated centroid, the sub-key information centroid is a sub-centroid (sub-cluster center) generated for the class, and the basic sub-key information centroid and the initial sub-key information centroid are called as initial sub-key information centroid. The basic sub-key information centroid is, for example, a vector, a matrix or a tensor, and can be randomly constructed by the voice recognition algorithm to be debugged, and can be considered as a parameter which can be learned by the voice recognition algorithm to be debugged. According to the method and the device, the basic sub-key information centroids are preset for the sample semantic description set, the basic sub-key information centroids corresponding to the sample semantic descriptions in the sample semantic description set are determined, preliminary information identification is completed in comparison, the basic sub-key information centroids are adjusted subsequently, different disturbance information can be identified adaptively, and the purpose of accurate voice identification is achieved.
Optionally, determining basic sub-key information centroids corresponding to the plurality of sample semantic descriptions in the sample semantic description set respectively includes: and aiming at any example semantic description in the example semantic description set, obtaining similarity scores between the example semantic description and each basic sub-key information centroid, and determining the basic sub-key information centroid corresponding to the maximum similarity score as the basic sub-key information centroid to which the example semantic description belongs. The basic sub-key information centroid with the largest similarity score (namely, the smallest feature distance) with each sample semantic description is determined by acquiring the similarity scores between each sample semantic description and the basic sub-key information centroid, and the basic sub-key information centroid with the largest similarity score with each sample semantic description is determined as the basic sub-key information centroid to which the sample semantic description belongs. Thus, the basic sub-key information centroid to which each sample semantic description belongs can be determined.
Step S130, adjusting the basic sub-key information centroids based on similarity scores between the sample semantic descriptions and the basic sub-key information centroids corresponding to the sample semantic descriptions, so as to obtain one or more target sub-key information centroids under the same key information semantic label.
Since the interference voice signal in the voice learning sample set may contain different information contents, the interference information of the current voice learning sample set may not be recognized based on the plurality of basic sub-key information centroids preset in step S130, and in order to increase generalization of the different interference information, the present disclosure further adjusts the basic sub-key information centroids. Specifically, the basic sub-key information centroid under each key information semantic mark is adjusted through adjustment such as sub-key information centroid construction, sub-key information centroid fusion, sub-key information centroid deletion and the like, so that the target sub-key information centroid under each key information semantic mark is obtained.
For example, if the speech learning sample set of the next batch includes 20 speech learning samples of information content (i.e. the speech learning sample annotated with 20 semantic marks of key information), 10 basic sub-key information centroids are introduced for each information content, and then the basic sub-key information centroids under three information content are respectively adjusted through sub-key information centroid construction, sub-key information centroid fusion and sub-key information centroid deletion, so as to obtain 5 target sub-key information centroids under the first information content, 2 target sub-key information centroids under the second information content, and 3 target sub-key information centroids under the twentieth information content of 5 target sub-key information centroids … … under the third information content, namely, the real sample spreading condition in the automatic alignment feature domain; clustering the sample semantic descriptions of the voice learning samples to the target sub-key information centroids to which the sample semantic descriptions belong in the feature domain, neglecting the rest of the target sub-key information centroids with high similarity scores with the sample semantic descriptions, and debugging the voice recognition algorithm to be debugged by taking the rest of the target sub-key information centroids with low similarity scores with the sample semantic descriptions as debugging targets, so that the marking confusion caused by fine-granularity interference voice signals is effectively avoided.
Optionally, adjusting the plurality of basic sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label, including: determining whether sub-sample semantic descriptions corresponding to each basic sub-key information centroid have sample semantic descriptions which are not close to the corresponding basic sub-key information centroid, whether the plurality of basic sub-key information centroids have the basic sub-key information centroid which belongs to the same basic sub-key information centroid and whether the plurality of basic sub-key information centroids have the basic sub-key information centroid with high similarity scores based on similarity scores between the plurality of sample semantic descriptions and the corresponding basic sub-key information centroids; if the sample semantic descriptions not close to the centroid of the subordinate basic sub-key information are provided, namely if the actual information quantity in the sample semantic description set under the same key information semantic label is larger than the quantity of the centroid of the basic sub-key information, generating a new basic sub-key information centroid for the sample semantic descriptions not close to the centroid of the subordinate basic sub-key information so as to cover the sample semantic descriptions under the actual information quantity; if the sample semantic descriptions of the basic sub-key information centroids are not close to the basic sub-key information centroids, deleting the basic sub-key information centroids and the sample semantic descriptions of the basic sub-key information centroids, namely if the similarity scores between the sample semantic descriptions of the basic sub-key information centroids and the basic sub-key information centroids are small, judging that the basic sub-key information centroids are non-valued special sub-key information centroids, and deleting the special sub-key information centroids and the sample semantic descriptions of the special sub-key information centroids; if the basic sub-key information centroids with larger similarity scores are provided, the basic sub-key information centroids with larger similarity scores are fused (i.e. combined), namely if the similarity scores of the basic sub-key information centroids are larger, the basic sub-key information centroids with larger confidence coefficients represent the same information content, and a plurality of sample semantic descriptions with larger similarity scores are gathered to the same class through fusion.
The adjusted target sub-key information centroid may include one or more of the basic sub-key information centroid preset in step S120, the basic sub-key information centroid newly constructed in step S130, and the integrated basic sub-key information centroid, and after adjusting the basic sub-key information centroids under each key information semantic label, one or more target sub-key information centroids under each key information semantic label may be obtained.
Step S140, determining algorithm cost based on target sub-key information centroids under a plurality of key information semantic marks and sample semantic descriptions belonging to each target sub-key information centroid, and debugging a voice recognition algorithm to be debugged based on the algorithm cost.
Based on steps S110 to S130, a plurality of target sub-key information centroids under a plurality of key information semantic tags and sample semantic descriptions belonging to each target sub-key information centroid can be obtained, so that prediction information of each voice learning sample output by the voice recognition algorithm to be debugged is obtained, because the voice learning sample is annotated with the key information semantic tags, the algorithm cost can be determined based on similarity scores between the sample semantic descriptions of the voice learning sample and the target sub-key information centroids under the key information semantic tags corresponding to the voice learning sample, and similarity scores between the sample semantic descriptions of the voice learning sample and the rest of target sub-key information centroids under the rest of key information semantic tags. Optionally, determining the algorithm cost based on the plurality of target sub-key information centroids under the plurality of key information semantic labels and the sample semantic descriptions belonging to the respective target sub-key information centroids includes: for any example semantic description belonging to any target sub-key information centroid, obtaining similarity scores between the example semantic description and the target sub-key information centroid to which the example semantic description belongs, and similarity scores between the example semantic description and the rest target sub-key information centroids; then, determining a sub-price corresponding to the sample semantic description based on the similarity score between the sample semantic description and the mass center of the target sub-key information to which the sample semantic description belongs and the similarity score between the sample semantic description and the mass centers of the rest target sub-key information; and adding the child prices corresponding to the semantic descriptions of each sample belonging to the key information centroid of each target to obtain the algorithm cost. The cost obtaining manner may be based on a general cost function implementation, for example, a cross entropy cost function and a relative entropy cost function, which are not limited in detail.
In practical situations, the speech learning sample with the same information content may be annotated as an interfering speech signal with a different semantic label of the key information, then the remaining target sub-key information centroid in the above may have the same information content as the current sample semantic description, if the algorithm cost is determined such that the sample semantic description is not close to all the remaining target sub-key information centroids, the sample semantic description may not be close to the remaining target sub-key information centroids with the same information content, and obviously is unconventional, then the remaining target sub-key information centroids with higher similarity scores with the current sample semantic description are ignored when the algorithm cost is determined.
Above, optionally, determining the algorithm cost based on the target sub-key information centroids under the plurality of key information semantic tags and the sample semantic descriptions belonging to each target sub-key information centroid includes: and determining the child price corresponding to the sample semantic description based on the similarity score between the sample semantic description and the target sub-key information centroid to which the sample semantic description belongs and the similarity score between the sample semantic description and the rest target sub-key information centroids aiming at each sample semantic description covered by any target sub-key information centroid. The rest target sub-key information centroids comprise target sub-key information centroids which are not larger than a fifth preset score except for the target sub-key information centroids to which the sample semantic description belongs in the target sub-key information centroids under the plurality of key information semantic marks; and determining algorithm cost based on the child price corresponding to each sample semantic description belonging to each target child key information centroid. Based on the method, when the algorithm cost is determined, the sample semantic descriptions can be clustered to the mass center of the target sub-key information, the mass center of the rest target sub-key information with higher similarity scores with the current sample semantic descriptions is ignored, the mass center of the rest target sub-key information with lower similarity scores with the current sample semantic descriptions is not close to the mass center of the rest target sub-key information, the marking confusion caused by fine-granularity interference voice signals is relieved, and the debugging of the algorithm is more accurate. The rest target sub-key information centroids comprise a plurality of target sub-key information centroids of the plurality of target sub-key information semantic marks under the plurality of key information semantic marks, the similarity score between the rest target sub-key information centroids and the sample semantic descriptions is not more than a target sub-key information centroids of a fifth preset score, and when the child price corresponding to the sample semantic descriptions is acquired, the rest target sub-key information centroids are not more than the target sub-key information centroids of the fifth preset score, namely all the rest target sub-key information centroids are the target sub-key information centroids of which the similarity score is not more than the fifth preset score. If the similarity score between the sample semantic description and the mass center of the rest of the target sub-key information is larger than a fifth preset score, the similarity score between the sample semantic description and the mass center of the rest of the target sub-key information is considered to be higher, and the mass center of the rest of the target sub-key information is disregarded at the moment; otherwise, if the similarity score between the semantic description of the sample and the mass center of the rest of the target sub-key information is not larger than the fifth preset score, the similarity score between the semantic description of the sample and the mass center of the rest of the target sub-key information is considered to be lower, and the mass center of the rest of the target sub-key information is reserved.
The similarity score between the above and the sample semantic descriptions is not greater than the remaining target sub-critical information centroid of the fifth preset score, and may be the same information content as the sample semantic descriptions or different information content in fact, so as to avoid that the sample semantic descriptions are not close to the remaining target sub-critical information centroid of the same information content, and at this time, the similarity score between the above and the sample semantic descriptions is not greater than the remaining target sub-critical information centroid of the fifth preset score. Optionally, the fifth preset score may be determined based on an average result of similarity scores between each of the remaining target sub-critical information centroids and the sample semantic descriptions belonging to each of the remaining target sub-critical information centroids and individual discrete coefficients, which may be standard deviations or variances. In other words, each remaining target sub-critical information centroid corresponds to a respective fifth preset score. For example, the fifth preset score is a weighted sum of the average result and the individual discrete coefficients.
For example, the fifth preset score is: gm, hm=km, hm+α·lm, hm
Wherein m is the m information content, hm is the rest of the target sub-key information centroids under the m information content, gm, hm is a fifth preset score corresponding to the rest of the target sub-key information centroids, km, hm is the average result of similarity scores between the h m th rest of the target sub-key information centroids and sample semantic descriptions belonging to the rest of the target sub-key information centroids, lm, hm is the individual discrete coefficient of similarity scores between the rest of the target sub-key information centroids and the sample semantic descriptions belonging to the hm, and alpha is the preset weighting coefficient.
In the embodiment of the disclosure, determining an algorithm cost based on child prices corresponding to semantic descriptions of various examples belonging to the centroid of each target sub-key information comprises: and adding the child prices corresponding to the semantic descriptions of each sample belonging to the key information centroid of each target to obtain the algorithm cost. Optionally, debugging the voice recognition algorithm to be debugged based on the algorithm cost includes: based on the algorithm cost, adopting a gradient optimization strategy to adjust algorithm parameters of the voice recognition algorithm to be debugged, repeating the steps based on the steps S110-S140, and stopping when the algorithm converges.
According to the embodiment of the disclosure, a plurality of basic sub-key information centroids are preset for the sample semantic description sets under the same key information semantic marks, and the basic sub-key information centroids corresponding to the sample semantic descriptions in the sample semantic description sets are determined, so that the method is equivalent to presetting a plurality of sub-information for the sample semantic descriptions under the same key information semantic marks so as to cover interference information possibly existing under the same key information semantic marks; based on similarity scores between the multiple sample semantic descriptions and the basic sub-key information centroids which are respectively corresponding to the sample semantic descriptions, the multiple basic sub-key information centroids are adjusted, so that one or more target sub-key information centroids under the same key information semantic marks can be more suitable for the actual distribution situation of the sample semantic descriptions under the key information semantic marks, namely, the sample semantic descriptions under the same key information semantic marks can be gathered to more accurate target sub-key information centroids, and thus, the mark confusion caused by various kinds of mark errors possibly existing under different key information semantic marks is reduced; the voice recognition algorithm to be debugged is debugged based on the target sub-key information centroid and the sample semantic description determined by the target sub-key information centroid, so that the generalization of the debugged voice recognition algorithm to the error marked interference voice signal can be increased in the training process of training the voice recognition algorithm to be debugged based on the voice learning sample set possibly covering the interference voice signal, the algorithm with good debugging effect can be obtained only through proper parameter adjustment, the requirement on the debugging sample size is low, the efficiency of the debugging process is increased on the premise of ensuring the recognition accuracy of the algorithm, the calculation consumption is saved, and the cost is reduced.
When the basic sub-key information centroids are adjusted, if the actual information content number in the sample semantic description set under the same key information semantic mark is larger than the basic sub-key information centroids, for example, the basic sub-key information centroids are 5, and the sample semantic description set under the same key information semantic mark actually has the sample semantic descriptions of 6 information contents, a new basic sub-key information centroids are generated, and the rest information contents except the basic sub-key information centroids generated by the preset are covered. Optionally, step S130, based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-critical information centroids, adjusts the plurality of basic sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label, including:
s131, based on similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids respectively corresponding to the plurality of sample semantic descriptions, whether the plurality of sample semantic descriptions have isolated semantic descriptions or not is evaluated.
The isolated semantic description is the semantic description with the similarity score between the isolated semantic description and the centroid of the basic sub-key information which the isolated semantic description belongs to being smaller than the first preset score.
S132, if the isolated semantic descriptions exist in the plurality of sample semantic descriptions, based on the isolated semantic descriptions, newly adding basic sub-key information centroids, and adjusting the isolated semantic descriptions to belong to the newly added basic sub-key information centroids.
The target sub-key information centroid comprises a newly added basic sub-key information centroid and a plurality of preset basic sub-key information centroids. In S131, the similarity score between the isolated semantic description and the centroid of the underlying sub-critical information is smaller than the first preset score, which indicates that the isolated semantic description is a sample semantic description that is not close to the centroid of the underlying sub-critical information. Alternatively, the first preset score may be determined based on the average result of the similarity scores between the basic sub-critical information centroids and the sample semantic descriptions belonging to the basic sub-critical information centroids and the individual discrete coefficients, i.e. each basic sub-critical information centroid corresponds to a respective first preset score. Based on this determined first preset score, it is advantageous to more accurately identify isolated semantic descriptions. Optionally, in S131, evaluating whether the plurality of sample semantic descriptions have isolated semantic descriptions based on similarity scores between the plurality of sample semantic descriptions and the respective corresponding basic sub-key information centroids includes: for any basic sub-key information centroid, determining a first preset score corresponding to the basic sub-key information centroid based on an average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid and individual discrete coefficients; evaluating whether the sample semantic descriptions belonging to the basic sub-key information centroids have the sample semantic descriptions with similarity scores smaller than a first preset score or not; and determining the sample semantic descriptions with similarity scores smaller than the first preset scores with the basic sub-key information centroids as isolated semantic descriptions not close to the basic sub-key information centroids. Wherein determining a first preset score corresponding to the centroid of the basic sub-key information based on an average result of similarity scores between the centroid of the basic sub-key information and the semantic descriptions of the examples belonging to the centroid of the basic sub-key information and individual discrete coefficients, comprises: and carrying out difference between the average result corresponding to the centroid of any basic sub-key information and the weighted result of the individual discrete coefficient, and determining the difference result as a first preset score corresponding to the centroid of the basic sub-key information.
If the similarity score between the sample semantic description and the basic sub-key information centroid is smaller than the first preset score, the sample semantic description is actually quite likely to be interference information with different information contents from the basic sub-key information centroid, namely, the isolated semantic description which is not close to the basic sub-key information centroid, the isolated semantic description which is not close to each basic sub-key information centroid is independently determined to be the sample semantic description, the isolated semantic description is removed from each basic sub-key information centroid, a new basic sub-key information centroid is built for the isolated semantic description, and intra-class unfused is reduced. Based on S131, one or more isolated semantic descriptions may be obtained, if one isolated semantic description is obtained, the basic sub-key information centroid is directly deleted, that is, not increased, and if a plurality of isolated semantic descriptions are obtained, an average result or a weighted average result of the plurality of isolated semantic descriptions is taken as the increased basic sub-key information centroid.
According to the embodiment of the disclosure, through selecting the isolated semantic descriptions which are not close to the basic sub-key information centroid of the plurality of sample semantic descriptions based on the similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids corresponding to the plurality of sample semantic descriptions, the basic sub-key information centroids are newly added for the isolated semantic descriptions, other information contents except the basic sub-key information centroids generated by the preset representation information contents can be covered, intra-class contradiction caused by error marks is reduced, and the feature description accuracy of the voice recognition algorithm to be debugged is improved.
When the basic sub-key information centroid is adjusted, if the similarity score between the sample semantic description belonging to the basic sub-key information centroid and the basic sub-key information centroid is small, the basic sub-key information centroid is regarded as a special sub-key information centroid, and the special sub-key information centroid and the sample semantic description belonging to the special sub-key information centroid are deleted. Optionally, in step S13, based on similarity scores between the multiple sample semantic descriptions and the corresponding basic sub-critical information centroids, adjusting the multiple basic sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label, including:
S133, evaluating whether a special sub-key information centroid exists in the plurality of basic sub-key information centroids or not based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids.
The average result of similarity scores between the sample semantic descriptions covered by the special sub-key information centroid and the special sub-key information centroid is not larger than a second preset score.
S134, deleting the special sub-key information centroid and the sample semantic description belonging to the special sub-key information centroid if the special sub-key information centroid exists in the plurality of basic sub-key information centroids, and obtaining the rest basic sub-key information centroids and the sample semantic description belonging to the rest basic sub-key information centroids.
Wherein the target sub-critical information centroid includes the remaining base sub-critical information centroid.
Optionally, in step S133, evaluating whether the plurality of basic sub-critical information centroids have a special sub-critical information centroid based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-critical information centroids includes: for any basic sub-key information centroid, evaluating whether the average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid is not more than a second preset score; and when the average result of the similarity scores between the basic sub-key information centroid and the sample semantic descriptions belonging to the basic sub-key information centroid is not more than a second preset score, determining the basic sub-key information centroid as a special sub-key information centroid. Based on this, a special sub-critical information centroid is determined. The second preset score is set according to actual needs, and is not specifically limited, an average result of similarity scores between a basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid can represent distribution density of similarity scores between the basic sub-key information centroid and the sample semantic descriptions belonging to the basic sub-key information centroid, a plurality of sample semantic descriptions with the same information content can be gathered in a feature domain, if an average result of similarity scores between the sample semantic descriptions covered by one basic sub-key information centroid and the basic sub-key information centroid is not greater than the second preset score, the basic sub-key information centroid and the sample semantic descriptions belonging to the basic sub-key information centroid are not concentrated, the sample semantic descriptions covered by the special sub-key information centroid can have different information contents, and also speech learning sample examples corresponding to the sample semantic descriptions are not clear, so that the sample semantic descriptions cannot be accurately gathered to the matched basic sub-key information centroid, and the average result of similarity scores between the sample semantic descriptions covered by the basic sub-key information centroid is not greater than the second preset score centroid, and the special sub-key information centroid is determined. The S134 deletes the special sub-key information centroid and the sample semantic description of the special sub-key information centroid, that is, the algorithm cost is not obtained based on the special sub-key information centroid and the sample semantic description of the special sub-key information centroid, and the remaining basic sub-key information centroids are basic sub-key information centroids except the special sub-key information centroids in the plurality of basic sub-key information centroids.
Optionally, in step S13, determining one or more groups of pseudo fusion sub-key information centroids in the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids includes:
S135, determining one or more groups of quasi-fusion sub-key information centroids in the basic sub-key information centroids based on similarity scores between the sample semantic descriptions and the basic sub-key information centroids respectively corresponding to the sample semantic descriptions, wherein each group of quasi-fusion sub-key information centroids comprises a plurality of approximate basic sub-key information centroids.
S136, fusing one or more groups of quasi-fusion sub-key information centroids respectively to obtain one or more fusion sub-key information centroids, and adjusting sample semantic descriptions corresponding to the quasi-fusion sub-key information centroids of each group to be the fused sub-key information centroids.
The target sub-critical information centroid comprises a fusion sub-critical information centroid.
If the similarity scores among the basic sub-key information centroids are larger, the basic sub-key information centroids which are most likely to belong to the same information content are considered, and the approximate basic sub-key information centroids are fused. Each key information semantic mark may include an information content or a speech learning sample of a plurality of information contents in real time, and the speech learning samples of the same information content may be annotated as different key information semantic marks, then the plurality of basic sub-key information centroids may include one or more groups of quasi-fusion sub-key information centroids, and each group of quasi-fusion sub-key information centroids may include basic sub-key information centroids under the current key information semantic mark, and may also include basic sub-key information centroids under the rest of the key information semantic marks.
Optionally, S135 may obtain a similarity score between every two basic sub-key information centroids in the plurality of basic sub-key information centroids, where two basic sub-key information centroids with similarity scores greater than the fourth preset score are determined as approximate two basic sub-key information centroids, so as to determine a plurality of approximate basic sub-key information centroids as a set of quasi-fusion sub-key information centroids.
Optionally, S135 determines one or more groups of quasi-fusion sub-key information centroids in the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids, including: determining a third preset score corresponding to each basic sub-key information centroid based on the average result of similarity scores between each basic sub-key information centroid and the sample semantic descriptions belonging to each basic sub-key information centroid and individual discrete coefficients; for any basic sub-key information centroid, evaluating whether the similarity score between the basic sub-key information centroid and the rest of basic sub-key information centroids in the plurality of basic sub-key information centroids is larger than or equal to a fourth preset score, wherein the fourth preset score is the maximum similarity score in the third preset score corresponding to the basic sub-key information centroid and the third preset score corresponding to the rest of basic sub-key information centroids; determining that the remaining base sub-critical information centroids are similar to the base sub-critical information centroids when the similarity score between the base sub-critical information centroids and the remaining base sub-critical information centroids is greater than or equal to the maximum similarity score; and determining the basic sub-critical information centroid and one or more other basic sub-critical information centroids similar to the basic sub-critical information centroid as a set of quasi-fusion sub-critical information centroids. Based on the method, one or more groups of quasi-fusion sub key information centroids can be effectively determined.
Wherein determining a third preset score corresponding to each basic sub-key information centroid based on the average result of similarity scores between each basic sub-key information centroid and the sample semantic descriptions belonging to each basic sub-key information centroid and the individual discrete coefficient comprises: and adding the average result corresponding to each basic sub-key information centroid and the weighted value of the individual discrete coefficient, and determining the added result as a third preset score corresponding to each basic sub-key information centroid. For any basic sub-key information centroid, evaluating whether the similarity score between the basic sub-key information centroid and the rest of basic sub-key information centroids in the plurality of basic sub-key information centroids is larger than or equal to a fourth preset score, wherein the fourth preset score is the maximum similarity score in the third preset score corresponding to the basic sub-key information centroid and the third preset score corresponding to the rest of basic sub-key information centroids, and can be considered to evaluate whether the similarity score of one basic sub-key information centroid and the rest of basic sub-key information centroids is larger or evaluate whether the distribution condition of one basic sub-key information centroid and the rest of basic sub-key information centroids is corresponding to the rest of basic sub-key information centroids, if the similarity score of the basic sub-key information centroids and the rest of basic sub-key information centroids is larger than or equal to the fourth preset score, or the similarity score between the basic sub-key information centroids and the rest of basic sub-key information centroids is larger than or the feature distribution range of the basic sub-key information centroids is the rest of basic sub-key information centroids is larger than the rest of basic sub-key information centroids.
Optionally, S136 fuses one or more sets of quasi-fusion sub-key information centroids respectively to obtain one or more fusion sub-key information centroids, including: and determining the average result of a plurality of basic sub-key information centroids included in the quasi-fusion sub-key information centroids as fusion sub-key information centroids after fusion of the quasi-fusion sub-key information centroids.
Based on the same principle as the method shown in fig. 1, there is also provided in an embodiment of the present disclosure a speech recognition apparatus 10, as shown in fig. 3, the apparatus 10 including:
The voice acquisition module 11 is configured to acquire a voice signal to be recognized, and input the voice signal to be recognized into a target voice recognition algorithm, where the target voice recognition algorithm is obtained by debugging in advance based on a voice sample;
The voice recognition module 12 is configured to recognize the voice signal to be recognized by using the target voice recognition algorithm, so as to obtain a key information recognition result in the voice signal to be recognized;
The algorithm debugging module 13 is configured to debug the target voice recognition algorithm, and specifically includes the following submodules:
the semantic mining module 131 is configured to perform semantic description mining on a plurality of voice learning samples annotated with the same key information semantic tags in a voice learning sample set based on a voice recognition algorithm to be debugged, so as to obtain a sample semantic description set under the same key information semantic tags, where the voice learning sample set includes voice learning samples annotated with the plurality of key information semantic tags;
A sub-centroid setting module 132, configured to preset a plurality of basic sub-key information centroids for the sample semantic description set, and determine basic sub-key information centroids respectively corresponding to a plurality of sample semantic descriptions in the sample semantic description set;
The similarity determining module 133 is configured to adjust the plurality of basic sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-critical information centroids respectively corresponding to the plurality of sample semantic descriptions, so as to obtain one or more target sub-critical information centroids under the same critical information semantic label;
The cost debugging module 134 is configured to determine an algorithm cost based on the target sub-key information centroids under the plurality of key information semantic labels and the sample semantic descriptions belonging to each target sub-key information centroid, and debug the to-be-debugged speech recognition algorithm based on the algorithm cost.
The above embodiment describes the speech recognition device 10 from the viewpoint of a virtual module, and the following describes an NLP platform from the viewpoint of a physical module, specifically as follows:
The disclosed embodiment provides an NLP platform, as shown in fig. 4, the NLP platform 100 includes: a processor 101 and a memory 103. Wherein the processor 101 is coupled to the memory 103, such as via bus 102. Optionally, NLP platform 100 may also include transceiver 104. It should be noted that, in practical applications, the transceiver 104 is not limited to one, and the structure of the NLP platform 100 is not limited to the embodiments of the present disclosure.
The processor 101 may be a CPU, general-purpose processor, GPU, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 101 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 102 may include a path to transfer information between the aforementioned components. Bus 102 may be a PCI bus or an EISA bus, etc. The bus 102 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Memory 103 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disks, laser disks, optical disks, digital versatile disks, blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
Memory 103 is used to store application code for executing the schemes of the present disclosure and is controlled for execution by processor 101. The processor 101 is configured to execute application code stored in the memory 103 to implement what is shown in any of the method embodiments described above.
The embodiment of the disclosure provides an NLP platform, which comprises: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, implement the methods described above. According to the technical scheme, key information in a voice signal is identified through a target voice recognition algorithm which is debugged in advance, the accuracy is high based on a high-performance algorithm obtained through machine learning and debugging, the accuracy of acquiring voice key information can be improved, in the debugging process of the algorithm, a plurality of basic sub-key information centroids are preset for sample semantic description sets under the same key information semantic marks, and the basic sub-key information centroids respectively corresponding to the plurality of sample semantic descriptions in the sample semantic description sets are determined, so that a plurality of sub-information are preset for the plurality of sample semantic descriptions under the same key information semantic marks to cover interference information possibly existing under the same key information semantic marks; based on similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids which are respectively corresponding to the sample semantic descriptions, the plurality of basic sub-key information centroids are adjusted, so that one or more target sub-key information centroids under the same key information semantic marks are obtained, the real distribution situation of the sample semantic descriptions under the key information semantic marks can be more adapted, namely, the sample semantic descriptions under the same key information semantic marks can be gathered to more accurate target sub-key information centroids, and the mark confusion caused by interference information of each mark error possibly existing under different key information semantic marks is reduced; the voice recognition algorithm to be debugged is debugged based on the target sub-key information centroid and the sample semantic description determined by the target sub-key information centroid, so that the generalization of the debugged voice recognition algorithm to the error marked interference voice signal can be increased in the process of debugging the voice recognition algorithm to be debugged based on the voice learning sample set which possibly covers the interference voice signal, the algorithm with good debugging effect can be obtained only through proper parameter adjustment, the requirement on the debugging sample size is low, the efficiency of the debugging process is increased on the premise of ensuring the recognition precision of the algorithm, the calculation consumption is saved, and the cost is reduced.
The disclosed embodiments provide a computer readable storage medium having a computer program stored thereon, which when run on a processor, enables the processor to perform the corresponding content of the foregoing method embodiments.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present disclosure, and it should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principles of the present disclosure, and these improvements and modifications should also be considered as the protection scope of the present disclosure.

Claims (9)

1. A machine learning-based speech recognition method, applied to an NLP platform, the method comprising:
Acquiring a voice signal to be recognized, and inputting the voice signal to be recognized into a target voice recognition algorithm, wherein the target voice recognition algorithm is obtained by debugging in advance based on a voice sample;
The target voice recognition algorithm is used for recognizing the voice signal to be recognized, so that a key information recognition result in the voice signal to be recognized is obtained;
The target voice recognition algorithm comprises the following steps when in debugging:
Performing semantic description mining on a plurality of voice learning samples annotated with the same key information semantic tags in a voice learning sample set based on a voice recognition algorithm to be debugged to obtain a sample semantic description set under the same key information semantic tags, wherein the voice learning sample set comprises voice learning samples annotated with the plurality of key information semantic tags;
presetting a plurality of basic sub-key information centroids for the sample semantic description set, and determining basic sub-key information centroids respectively corresponding to a plurality of sample semantic descriptions in the sample semantic description set;
Adjusting the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the basic sub-key information centroids respectively corresponding to the plurality of sample semantic descriptions to obtain one or more target sub-key information centroids under the same key information semantic label;
Determining algorithm cost based on target sub-key information centroids under a plurality of key information semantic marks and sample semantic descriptions belonging to each target sub-key information centroid, and debugging the voice recognition algorithm to be debugged based on the algorithm cost;
The determining the algorithm cost based on the target sub-key information centroids under the plurality of key information semantic marks and the sample semantic descriptions of the target sub-key information centroids comprises the following steps:
Determining a sub-price corresponding to the sample semantic description based on a similarity score between the sample semantic description and the target sub-key information centroid to which the sample semantic description belongs and a similarity score between the sample semantic description and other target sub-key information centroids, wherein the other target sub-key information centroids comprise target sub-key information centroids which are not greater than a fifth preset score except the target sub-key information centroids to which the sample semantic description belongs in the plurality of target sub-key information centroids under the plurality of key information semantic marks;
And determining the algorithm cost based on the child price corresponding to each sample semantic description belonging to each target child key information centroid.
2. The method of claim 1, wherein adjusting the plurality of base sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding base sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label comprises:
based on similarity scores between the sample semantic descriptions and basic sub-key information centroids corresponding to the sample semantic descriptions, evaluating whether isolated semantic descriptions exist in the sample semantic descriptions, wherein the similarity scores between the isolated semantic descriptions and the basic sub-key information centroids are smaller than a first preset score;
When the plurality of sample semantic descriptions have isolated semantic descriptions, based on the isolated semantic descriptions, newly adding basic sub-key information centroids, and adjusting the isolated semantic descriptions to belong to the newly added basic sub-key information centroids; the target sub-key information centroid comprises a newly added basic sub-key information centroid and a plurality of preset basic sub-key information centroids.
3. The method of claim 2, wherein evaluating whether there is an orphaned semantic description in the plurality of sample semantic descriptions based on similarity scores between the plurality of sample semantic descriptions and respective corresponding base sub-critical information centroids comprises:
For any basic sub-key information centroid, determining a first preset score corresponding to the basic sub-key information centroid based on an average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid and individual discrete coefficients;
evaluating whether the sample semantic descriptions belonging to the basic sub-key information centroids have the sample semantic descriptions with similarity scores smaller than the first preset scores or not;
And determining the sample semantic descriptions with similarity scores smaller than the first preset scores with the basic sub-key information centroid as isolated semantic descriptions not close to the basic sub-key information centroid.
4. A method according to any one of claims 1 to 3, wherein adjusting the plurality of base sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding base sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label comprises:
Based on similarity scores between the sample semantic descriptions and the corresponding basic sub-key information centroids, evaluating whether the basic sub-key information centroids have special sub-key information centroids or not, wherein an average result of the similarity scores between the sample semantic descriptions covered by the special sub-key information centroids and the special sub-key information centroids is not more than a second preset score;
Deleting the special sub-key information centroid and the sample semantic description belonging to the special sub-key information centroid when the plurality of basic sub-key information centroids are provided with the special sub-key information centroids, and obtaining the rest basic sub-key information centroids and the sample semantic description belonging to the rest basic sub-key information centroids; wherein the target sub-critical information centroid includes the remaining base sub-critical information centroid.
5. The method of claim 4, wherein evaluating whether a particular sub-critical information centroid is present in the plurality of base sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding base sub-critical information centroids comprises:
For any basic sub-key information centroid, evaluating whether an average result of similarity scores between the basic sub-key information centroid and sample semantic descriptions belonging to the basic sub-key information centroid is not greater than a second preset score;
and when the average result of similarity scores between the basic sub-key information centroid and the sample semantic descriptions belonging to the basic sub-key information centroid is not more than the second preset score, determining that the basic sub-key information centroid is a special sub-key information centroid.
6. The method of claim 1, wherein adjusting the plurality of base sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding base sub-critical information centroids to obtain one or more target sub-critical information centroids under the same critical information semantic label comprises:
Determining one or more groups of quasi-fusion sub-key information centroids in the plurality of basic sub-key information centroids based on similarity scores between the plurality of sample semantic descriptions and the respectively corresponding basic sub-key information centroids, wherein each group of quasi-fusion sub-key information centroids comprises a plurality of approximate basic sub-key information centroids;
Respectively fusing the one or more groups of quasi-fusion sub-key information centroids to obtain one or more fusion sub-key information centroids, and adjusting sample semantic descriptions corresponding to the quasi-fusion sub-key information centroids to be the fused sub-key information centroids; the target sub-critical information centroid includes the fusion sub-critical information centroid.
7. The method of claim 6, wherein the determining one or more sets of quasi-fusion sub-critical information centroids in the plurality of base sub-critical information centroids based on similarity scores between the plurality of sample semantic descriptions and respectively corresponding base sub-critical information centroids comprises:
Determining a third preset score corresponding to each basic sub-key information centroid based on the average result of similarity scores between each basic sub-key information centroid and the sample semantic descriptions belonging to each basic sub-key information centroid and individual discrete coefficients;
For any basic sub-key information centroid, evaluating whether a similarity score between the basic sub-key information centroid and the rest of basic sub-key information centroids in the plurality of basic sub-key information centroids is greater than or equal to a fourth preset score, wherein the fourth preset score is the maximum similarity score in third preset scores corresponding to the basic sub-key information centroid and the rest of basic sub-key information centroids;
Determining that the remaining base sub-critical information centroid is similar to the base sub-critical information centroid when a similarity score between the base sub-critical information centroid and the remaining base sub-critical information centroid is greater than or equal to the maximum similarity score;
and determining the basic sub-key information centroid and one or more other basic sub-key information centroids similar to the basic sub-key information centroid as a set of quasi-fusion sub-key information centroids.
8. The method of claim 1, wherein the determining a base sub-critical information centroid for each of a plurality of sample semantic descriptions in the set of sample semantic descriptions comprises:
Obtaining similarity scores between the sample semantic descriptions and the centroids of the basic sub-key information according to any sample semantic description in the sample semantic description set;
And determining the basic sub-key information centroid corresponding to the maximum similarity score as the basic sub-key information centroid to which the sample semantic description belongs.
9. An NLP platform, comprising:
one or more processors;
A memory;
One or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, implement the method of any of claims 1-8.
CN202311337518.XA 2023-10-13 2023-10-13 Speech recognition method and NLP platform based on machine learning Active CN117334186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311337518.XA CN117334186B (en) 2023-10-13 2023-10-13 Speech recognition method and NLP platform based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311337518.XA CN117334186B (en) 2023-10-13 2023-10-13 Speech recognition method and NLP platform based on machine learning

Publications (2)

Publication Number Publication Date
CN117334186A CN117334186A (en) 2024-01-02
CN117334186B true CN117334186B (en) 2024-04-30

Family

ID=89279009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311337518.XA Active CN117334186B (en) 2023-10-13 2023-10-13 Speech recognition method and NLP platform based on machine learning

Country Status (1)

Country Link
CN (1) CN117334186B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232114A (en) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 Sentence intension recognizing method, device and computer readable storage medium
CN111586469A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Bullet screen display method and device and electronic equipment
CN111932296A (en) * 2020-07-20 2020-11-13 中国建设银行股份有限公司 Product recommendation method and device, server and storage medium
WO2021082786A1 (en) * 2019-10-30 2021-05-06 腾讯科技(深圳)有限公司 Semantic understanding model training method and apparatus, and electronic device and storage medium
CN114298122A (en) * 2021-10-22 2022-04-08 腾讯科技(深圳)有限公司 Data classification method, device, equipment, storage medium and computer program product
WO2022116442A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech sample screening method and apparatus based on geometry, and computer device and storage medium
CN115238068A (en) * 2022-06-21 2022-10-25 中国科学院自动化研究所 Voice transcription text clustering method and device, electronic equipment and storage medium
CN116403569A (en) * 2023-04-06 2023-07-07 平安科技(深圳)有限公司 Speech recognition method, device, computer equipment and medium based on artificial intelligence
CN117034955A (en) * 2023-05-06 2023-11-10 中国工商银行股份有限公司 Telephone traffic text intention recognition method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232114A (en) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 Sentence intension recognizing method, device and computer readable storage medium
WO2021082786A1 (en) * 2019-10-30 2021-05-06 腾讯科技(深圳)有限公司 Semantic understanding model training method and apparatus, and electronic device and storage medium
CN111586469A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Bullet screen display method and device and electronic equipment
CN111932296A (en) * 2020-07-20 2020-11-13 中国建设银行股份有限公司 Product recommendation method and device, server and storage medium
WO2022116442A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech sample screening method and apparatus based on geometry, and computer device and storage medium
CN114298122A (en) * 2021-10-22 2022-04-08 腾讯科技(深圳)有限公司 Data classification method, device, equipment, storage medium and computer program product
CN115238068A (en) * 2022-06-21 2022-10-25 中国科学院自动化研究所 Voice transcription text clustering method and device, electronic equipment and storage medium
CN116403569A (en) * 2023-04-06 2023-07-07 平安科技(深圳)有限公司 Speech recognition method, device, computer equipment and medium based on artificial intelligence
CN117034955A (en) * 2023-05-06 2023-11-10 中国工商银行股份有限公司 Telephone traffic text intention recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Weakly-Supervised Semantic Segmentation Approach Based on the Centroid Loss: Application to Quality Control and Inspection;Kai Yao,等;《IEEE Access》;20210505;全文 *

Also Published As

Publication number Publication date
CN117334186A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
JP6894058B2 (en) Hazardous address identification methods, computer-readable storage media, and electronic devices
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN113656547B (en) Text matching method, device, equipment and storage medium
CN113657274B (en) Table generation method and device, electronic equipment and storage medium
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN112966088B (en) Unknown intention recognition method, device, equipment and storage medium
CN113806482A (en) Cross-modal retrieval method and device for video text, storage medium and equipment
US11645478B2 (en) Multi-lingual tagging for digital images
WO2014118978A1 (en) Learning method, image processing device and learning program
CN111160027A (en) Cyclic neural network event time sequence relation identification method based on semantic attention
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
CN109637529A (en) Voice-based functional localization method, apparatus, computer equipment and storage medium
JP2004198597A5 (en)
US11625630B2 (en) Identifying intent in dialog data through variant assessment
CN112100509B (en) Information recommendation method, device, server and storage medium
CN117334186B (en) Speech recognition method and NLP platform based on machine learning
CN113987188B (en) Short text classification method and device and electronic equipment
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN110852102B (en) Chinese part-of-speech tagging method and device, storage medium and electronic equipment
CN114254622A (en) Intention identification method and device
CN111400606A (en) Multi-label classification method based on global and local information extraction
CN113139382A (en) Named entity identification method and device
CN117292304B (en) Multimedia data transmission control method and system
CN117131214B (en) Zero sample sketch retrieval method and system based on feature distribution alignment and clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240410

Address after: Room 1205-43, 12th Floor, Building 3, No. 8 East Road, Automobile Museum, Fengtai District, Beijing, 100000 RMB

Applicant after: Beijing Zhicheng Pengzhan Technology Co.,Ltd.

Country or region after: China

Address before: No. 10, 13th Floor, Building A-3, Zone II (Phase 6), National Geospatial Information Industry Base, No. 3, Wudayuan Fourth Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province, 430074

Applicant before: WUHAN SAISIYUN TECHNOLOGY Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant