CN118230720B - Voice semantic recognition method based on AI and TWS earphone - Google Patents

Voice semantic recognition method based on AI and TWS earphone Download PDF

Info

Publication number
CN118230720B
CN118230720B CN202410623558.9A CN202410623558A CN118230720B CN 118230720 B CN118230720 B CN 118230720B CN 202410623558 A CN202410623558 A CN 202410623558A CN 118230720 B CN118230720 B CN 118230720B
Authority
CN
China
Prior art keywords
voice
sequence
speech
network
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410623558.9A
Other languages
Chinese (zh)
Other versions
CN118230720A (en
Inventor
胡孝健
盛子浩
魏开发
徐怀党
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shengjiali Electronics Co ltd
Original Assignee
Shenzhen Shengjiali Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shengjiali Electronics Co ltd filed Critical Shenzhen Shengjiali Electronics Co ltd
Priority to CN202410623558.9A priority Critical patent/CN118230720B/en
Publication of CN118230720A publication Critical patent/CN118230720A/en
Application granted granted Critical
Publication of CN118230720B publication Critical patent/CN118230720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Otolaryngology (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Neurosurgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides a voice semantic recognition method based on AI and TWS earphone, which obtain a target feature representation network through debugging, extract respective feature representations of input voice and candidate voice to match, finish the output of the candidate voice, process long voice, convert the long voice into accurate and simple voice to output, and improve the communication efficiency and quality of users. In the network debugging link, based on the network architecture of the existing machine learning network, an initial distribution mapping table in the distribution vector mapping table is extended through an extension distribution mapping table, and the voice coverage areas of the initial distribution mapping table are initial voice coverage areas, so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas. Because the target voice coverage area is larger than the initial voice coverage area, the obtained target features represent that the network can handle long voice sequences with voice coverage areas larger than the initial voice coverage area.

Description

Voice semantic recognition method based on AI and TWS earphone
Technical Field
The application relates to the field of data processing, in particular to a voice semantic recognition method based on AI and TWS earphone.
Background
With the continuous development of artificial intelligence technology, the AI-based speech semantic recognition method has been widely applied in various fields, such as hearing aid hearing assistance, smart home, medical assistance, etc. However, existing speech semantic recognition methods still present challenges when processing long speech sequences. In the hearing aid field, the hearing impaired user has weak information pickup capability on long voice of an exchange object, and the exchange efficiency and quality are affected.
Conventional hearing aid products, such as TWS headsets for embedded speech recognition systems, are typically designed to handle shorter speech segments with limited speech coverage areas. These systems often cannot efficiently recognize and parse complete semantic information when faced with speech lengths exceeding the system design. In order to process long speech sequences, if they are directly split into multiple shorter speech signals for processing separately, it may result in loss or breakage of important information in the speech sequences, thereby affecting the accuracy and integrity of recognition. In addition, existing speech recognition systems also face certain difficulties in debugging and optimization. Due to the complexity and numerous parameters of the network architecture, global debugging of the entire network often requires significant time and computing resources. Moreover, the effective feature representation that has been learned may be destroyed during the debugging process, resulting in a decrease in system performance. Based on the above, a new network debugging method is needed to overcome the defects of the prior art, and a voice semantic recognition scheme and a product capable of efficiently processing long voice sequences and keeping high recognition accuracy are provided.
Disclosure of Invention
The application aims to provide an AI-based voice semantic recognition method and TWS earphone. The technical scheme of the embodiment of the application is realized as follows: in a first aspect, an embodiment of the present application provides an AI-based speech semantic recognition method, including: acquiring a first voice binary group sample consisting of a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence; acquiring a machine learning network to be debugged, wherein a distribution vector mapping table of the machine learning network to be debugged comprises an initial distribution mapping table and an extension distribution mapping table, the voice coverage areas of the initial distribution mapping table are the initial voice coverage areas, and the extension distribution mapping table is a distribution mapping table for extending the initial distribution mapping table; extending the initial distribution mapping table based on the extended distribution mapping table so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas, wherein the target voice coverage areas are larger than the initial voice coverage areas; performing initial assignment on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network, wherein initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of rest network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a public feature representation network subjected to front debugging, the rest network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the rest network learnable variables comprise the initial distribution mapping table; debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a target feature representation network; acquiring a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed; obtaining first voice sequence features of the target to-be-recognized source voice sequence based on the target feature representation network, and obtaining second voice sequence features of each to-be-processed candidate voice sequence based on the target feature representation network; respectively acquiring a sixth commonality measurement coefficient between the first voice sequence feature and each second voice sequence feature; and outputting the candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient.
As one embodiment, the debugging the extended distribution mapping table of the basic machine learning network based on the first voice tuple sample to obtain a target feature representation network includes: debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a transition machine learning network; acquiring a second voice binary group sample consisting of a second to-be-recognized source voice sequence and target hearing-aid voice matched with the second to-be-recognized source voice sequence; and performing fine optimization on all network learnable variables of the transition machine learning network based on the second voice binary group sample to obtain the target feature representation network.
As an implementation manner, the number of the first voice tuple samples is multiple, each first voice tuple sample includes a first to-be-identified source voice sequence and a target hearing aid voice matched with the first to-be-identified source voice sequence, the debugging is performed on the extended distribution mapping table of the basic machine learning network based on the first voice tuple samples to obtain a transition machine learning network, and the method includes: for a first to-be-recognized source voice sequence in each first voice binary group sample, acquiring an active voice example and a passive voice example of the first to-be-recognized source voice sequence, wherein the active voice example of the first to-be-recognized source voice sequence is a target hearing aid voice of the same first voice binary group sample as the first to-be-recognized source voice sequence, and the passive voice example of the first to-be-recognized source voice sequence is a target hearing aid voice of a different first voice binary group sample as the first to-be-recognized source voice sequence; for each first to-be-identified source voice sequence, outputting a first voice feature of the first to-be-identified source voice sequence, a second voice feature of an active voice example of the first to-be-identified source voice sequence, and a third voice feature of a passive voice example of the first to-be-identified source voice sequence based on the base machine learning network; acquiring a first commonality metric coefficient between the first to-be-identified source voice sequence and active voice examples of the first to-be-identified source voice sequence based on the first voice feature and the second voice feature, and acquiring a second commonality metric coefficient between the first to-be-identified source voice sequence and passive voice examples of the first to-be-identified source voice sequence based on the first voice feature and the third voice feature; generating a first evaluation function based on the first and second commonality metric coefficients; and debugging an extended distribution mapping table of the basic machine learning network based on the first evaluation function to obtain the transition machine learning network.
As one embodiment, the number of the second speech binary group samples is a plurality, each second speech binary group sample includes a second to-be-recognized source speech sequence and a target hearing aid speech matched with the second to-be-recognized source speech sequence, and the fine optimization is performed on all network learnable variables of the transition machine learning network based on the second speech binary group samples to obtain the target feature representation network, which includes: for a second to-be-recognized source voice sequence in each second voice binary group sample, acquiring an active voice example and a passive voice example of the second to-be-recognized source voice sequence, wherein the active voice example of the second to-be-recognized source voice sequence is a target hearing aid voice of the second voice binary group sample which belongs to the same second voice binary group sample as the second to-be-recognized source voice sequence, and the passive voice example of the second to-be-recognized source voice sequence is a target hearing aid voice of a second voice binary group sample which belongs to different second to the second to-be-recognized source voice sequence; outputting, for each second to-be-recognized source speech sequence, fourth speech features of the second to-be-recognized source speech sequence, fifth speech features of positive speech examples of the second to-be-recognized source speech sequence, and sixth speech features of negative speech examples of the second to-be-recognized source speech sequence based on the transitional machine learning network; acquiring a third common metric coefficient between the second to-be-recognized source voice sequence and positive voice examples of the second to-be-recognized source voice sequence based on the fourth voice feature and the fifth voice feature, and acquiring a fourth common metric coefficient between the second to-be-recognized source voice sequence and negative voice examples of the second to-be-recognized source voice sequence based on the fourth voice feature and the sixth voice feature; determining a second evaluation function based on the third and fourth commonality metric coefficients; generating a target evaluation function based on the second evaluation function; and performing fine optimization on all network learnable variables of the transition machine learning network based on the target evaluation function to obtain the target feature representation network.
As one embodiment, the generating the target evaluation function based on the second evaluation function includes: acquiring the remaining network learnable variables of the transition machine learning network; determining a network-learnable variable evaluation function based on the starting learnable variable and the adjusted learnable variable of the remaining network-learnable variables; the target evaluation function is generated based on the second evaluation function and the network-learnable variable evaluation function.
As one embodiment, a voice coverage area of the target hearing assistance voice matched with the first to-be-identified source voice sequence is larger than the initial voice coverage area, and the obtaining a first voice binary group sample composed of the first to-be-identified source voice sequence and the target hearing assistance voice matched with the first to-be-identified source voice sequence includes: determining the target hearing aid voice in voice sequences between voice coverage areas which are larger than the initial voice coverage areas, and determining a first to-be-recognized source voice sequence matched with the target hearing aid voice; and generating the first voice binary group sample based on the first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence.
As one embodiment, the determining the target hearing assistance voice in a voice sequence between voice coverage areas greater than the initial voice coverage area includes: screening voice sequences between voice coverage areas larger than the initial voice coverage areas from a public voice information base as candidate voice sequences; determining the target hearing aid speech based on the candidate speech sequence; the determining a first to-be-identified source voice sequence matched with the target hearing aid voice comprises the following steps: and taking the to-be-recognized source voice sequence matched with the target hearing aid voice in the public voice information base as a first to-be-recognized source voice sequence matched with the target hearing aid voice.
As one embodiment, the number of candidate voice sequences is a plurality, and the determining the target hearing assistance voice based on the candidate voice sequences includes: for each candidate voice sequence, determining the distribution condition of core content of the candidate voice sequence in the candidate voice sequence, wherein the core content is information matched with a first to-be-identified source voice sequence corresponding to the candidate voice sequence; and if the distribution condition is the target distribution information, removing the candidate voice sequences from the plurality of candidate voice sequences according to the set percentage to obtain the target hearing aid voice.
As an embodiment, the determining the distribution of the core content of the candidate speech sequence in the candidate speech sequence includes: dividing the candidate voice sequences according to the initial voice coverage areas to obtain a plurality of voice subsequences; respectively obtaining fifth commonality measurement coefficients between the plurality of voice subsequences and a first to-be-recognized source voice sequence corresponding to the candidate voice sequence; and taking the voice subsequence with the maximum fifth commonality measurement coefficient as the distribution condition of the core content of the candidate voice sequence in the candidate voice sequence.
As one embodiment, the determining the target hearing assistance voice in a voice sequence between voice coverage areas greater than the initial voice coverage area includes: determining a voice sequence without hearing aid voice, which is larger than the initial voice coverage area, in voice materials without hearing aid voice as the target hearing aid voice; the determining a first to-be-identified source voice sequence matched with the target hearing aid voice comprises the following steps: and constructing the target hearing aid voice based on a voice generating network according to a constraint template comprising the target hearing aid voice and the guiding command, and obtaining a first to-be-recognized source voice sequence matched with the target hearing aid voice.
In a second aspect, the application provides a TWS headset comprising a memory and a processor, the memory storing a network of characteristic representations and a computer program executable on the processor, the processor executing the computer program to effect the steps of: acquiring a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed; obtaining first voice sequence features of the target to-be-recognized source voice sequence based on the target feature representation network, and obtaining second voice sequence features of each to-be-processed candidate voice sequence based on the target feature representation network; respectively acquiring a sixth commonality measurement coefficient between the first voice sequence feature and each second voice sequence feature; outputting a candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient; wherein the debugging process of the target feature representation network comprises the following steps: acquiring a first voice binary group sample consisting of a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence; acquiring a machine learning network to be debugged, wherein a distribution vector mapping table of the machine learning network to be debugged comprises an initial distribution mapping table and an extension distribution mapping table, the voice coverage areas of the initial distribution mapping table are the initial voice coverage areas, and the extension distribution mapping table is a distribution mapping table for extending the initial distribution mapping table; extending the initial distribution mapping table based on the extended distribution mapping table so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas, wherein the target voice coverage areas are larger than the initial voice coverage areas; performing initial assignment on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network, wherein initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of rest network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a public feature representation network subjected to front debugging, the rest network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the rest network learnable variables comprise the initial distribution mapping table; and debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a target feature representation network.
The beneficial effects of the application include: the application provides a voice semantic recognition method based on AI and TWS earphone, which obtain a target feature representation network through debugging, extract respective feature representations of input voice and candidate voice based on the target feature representation network to match, finish the output of the candidate voice, process long voice, convert the long voice into accurate and simple voice to output, and improve the communication efficiency and quality of users. In the debugging link of the target feature representation network, the initial distribution mapping table in the distribution vector mapping table is extended through the extension distribution mapping table based on the network architecture of the existing machine learning network, and the voice coverage areas of the initial distribution mapping table are initial voice coverage areas, so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas. Because the target voice coverage area is larger than the initial voice coverage area, the target features obtained through machine learning network debugging to be debugged represent that the network can process long voice sequences between voice coverage areas that are larger than the initial voice coverage area. When the network is debugged, initial assignment is carried out on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network. The initial variable of the extended distribution mapping table in the basic machine learning network is obtained by any assignment, the initial variable of the rest network variable in the basic machine learning network is obtained by initial assignment based on the network variable of the public feature representation network with the finished front debugging, the rest network variable is all network variable of the basic machine learning network, the rest network variable except the extended distribution mapping table comprises the initial distribution mapping table, the rest network variable with the front debugging can be fixed in the debugging process, and only the extended distribution mapping table of the basic machine learning network is debugged based on the acquired first voice binary group sample to obtain the target feature representation network. Based on the method, the voice semantics of the pre-debugged characteristic representation network learning can be maintained, so that the network debugging speed is increased. The application extends the voice coverage area of the distribution vector mapping table of the network to be debugged, and debugs the local network learnable variable of the machine learning network to be debugged, so as to increase and extend the maximum voice coverage area of the target characteristic representation network on the basis of maintaining the semantic information of the pre-obtained voice sequence and ensuring the debugging efficiency. Therefore, when the long voice sequence is processed, sequence splitting or other segmentation processing modes are not needed, so that the semantic information of the voice sequence is more complete, and the accuracy of the returned hearing-aid voice is improved.
In the following description, other features will be partially set forth. Upon review of the ensuing disclosure and the accompanying figures, those skilled in the art will in part discover these features or will be able to ascertain them through production or use thereof. The features of the present application may be implemented and obtained by practicing or using the various aspects of the methods, tools, and combinations that are set forth in the detailed examples described below.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
Fig. 1 is an application scenario schematic diagram of an AI-based speech semantic recognition method according to an embodiment of the present application.
Fig. 2 is a flowchart of a voice semantic recognition method based on AI according to an embodiment of the present application.
Fig. 3 is a schematic diagram of a TWS earphone according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description reference is made to "some embodiments," "as one implementation/scheme," "in one implementation," which describe a subset of all possible embodiments, but it is to be understood that "some embodiments," "as one implementation/scheme," "in one implementation," can be the same subset or different subsets of all possible embodiments, and can be combined with each other without conflict.
In the following description, the terms "first", "second", "third", and the like are used merely to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third", and the like may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
The AI-based speech semantic recognition method provided by the embodiment of the application can be executed by a computing system in combination with TWS headphones (or terminal equipment), and the terminal equipment can be a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a special message device, a portable game device) and other types of terminals. The implementation process of the AI-based voice semantic recognition method comprises two stages of machine learning network debugging and application, wherein the machine learning network debugging stage is executed by adopting a computing system, and the application is executed by adopting TWS (time wavelength) headphones (or terminal equipment).
Fig. 1 is a schematic diagram of an application scenario of an AI-based speech semantic recognition method according to an embodiment of the present application. The application scenario provided by the embodiment of the application includes the terminal device 100, the network 200 and the TWS earphone 300, and the communication connection between the terminal device 100 and the TWS earphone 300 is realized through the network 200 (such as a bluetooth network). In the application process, the speaker voice signal (i.e. the source voice sequence) is received through the TWS earphone 300, then sent to the terminal device 100, processed through the target feature representation network in the terminal device 100, and the processing result is sent to the TWS earphone 300 for playing; or the terminal equipment 100 collects the voice signals of the interlocutors, processes the voice signals through the target feature representation network in the terminal equipment 100, and sends the processing results to the TWS earphone 300 for playing; or receives the voice signal of the interlocutor through the TWS earphone 300, processes the voice signal by the target feature representation network in the TWS earphone 300, and plays the processing result. The specific examples are not limited.
The embodiment of the application provides an AI-based voice semantic recognition method, as shown in fig. 2, which comprises the following steps:
Step S110, a first voice binary group sample composed of a first to-be-recognized source voice sequence and target hearing aid voice matched with the first to-be-recognized source voice sequence is obtained.
In step S110, specific speech data is acquired, which will be used for subsequent training to optimize the machine learning model. Specifically, the computer system needs to acquire two types of voice data in this step: the first to-be-identified source voice sequence and the target hearing aid voice matched with the first to-be-identified source voice sequence. The two are combined to form a first speech two-tuple sample. In practical applications, such sample collection typically occurs in certain usage scenarios, such as in hearing aid systems designed for hearing impaired patients. For example, hearing impaired patients wear smart TWS (truly Wireless stereo) headphones, which not only have conventional audio playback functionality, but also receive and analyze speech signals in the environment in real time and then communicate information to the patient in a more concise and clear manner.
When the speaker speaks a longer length of speech to the patient, the length of speech is recorded as the first sequence of source speech to be identified. This piece of speech may contain multiple sentences and complex information, and it may be difficult for hearing impaired patients to understand and memorize such long pieces of speech. Thus, the TWS headset receives the segment of speech signal and processes it through an internal machine learning model. In the network model debugging stage, in order to train and optimize the model, a simplified version consistent with the original speech semantics is required to be used as supervision information, namely target hearing aid speech. This simplified version of speech is typically much shorter and clearer, containing only the core information in the original speech, to facilitate quick understanding and response by the patient.
For example, if the interlocutor says: "do you get your weather today really good, sunny, you want to walk to park with me? "the segment of speech will be recorded as a first sequence of source speech to be identified. While the corresponding target hearing assistance voices may be: "do weather good today to walk to park? "such simplified version retains the primary intent of the original speech while removing redundant and modifier words.
In step S110, the computer system needs to collect a plurality of such first speech tuple samples for use in training the machine learning network in subsequent steps. The samples help the model learn how to extract key information from complex voice signals and generate simple and clear hearing-aid voices, so that the communication experience of hearing-impaired patients is improved.
Step S120, a machine learning network to be debugged is obtained, wherein a distribution vector mapping table of the machine learning network to be debugged includes an initial distribution mapping table and an extended distribution mapping table, the voice coverage areas of the initial distribution mapping table are initial voice coverage areas, and the extended distribution mapping table is a distribution mapping table for extending the initial distribution mapping table.
In step S120, a distribution vector mapping table is obtained, which converts the sequence position information of each frame of speech in the speech sequence into a vector representation that can be understood and used by the model.
The machine learning network to be debugged refers to a machine learning model or network that needs to be trained and adjusted. This network does not converge in the initial state and requires optimization of its performance by debugging (training). For example, a Deep Neural Network (DNN) may be used as a machine learning network to be debugged, by training to improve its accuracy in recognizing speech.
The distribution vector map is a data structure for converting sequence position information of each frame of speech in the speech sequence into a vector representation that can be understood and used by the model. In brief, it helps a machine learning network understand the temporal order and positional relationship in a speech sequence. For example, assume that there is a sequence of voices, each frame of voice corresponding to a point in time. The distribution vector map may convert these points in time into a series of vectors that are then input into the machine learning network, helping the network understand the timing structure of the speech.
The initial distribution mapping table is a part of a distribution vector mapping table specifically designed to cover an initial speech interval, i.e. to convert the position information in the initial speech sequence into vectors. For example, if the speech sequence is 5 seconds at maximum, the initial distribution map can handle the position transitions of all speech frames within these 5 seconds.
The extended distribution mapping table is an extension to the initial distribution mapping table for handling speech sequences that exceed the initial speech coverage area. When the length of the voice sequence exceeds the coverage of the initial distribution mapping table, the extended distribution mapping table is activated to ensure that the entire voice sequence is properly processed. For example, if the user's voice exceeds 5 seconds, say, 10 seconds, then the extended profile map is responsible for processing the additional 5 seconds of voice, converting its location information into a vector for use by the machine learning network.
The voice coverage area refers to the length or time range of a voice sequence that the distribution mapping table can handle. For the initial distribution mapping table, it has an initial voice coverage area; and an extended distribution mapping table is used to extend the coverage area. For example, if the initial distribution mapping table has a 5 second speech coverage, it can process a maximum of 5 seconds of speech sequence. If longer speech needs to be processed, this coverage area needs to be extended by means of an extended distribution mapping table.
Specifically, in step S120, the computer system obtains a machine learning network to be debugged (i.e., requiring training). This network has a key data structure, namely a distributed vector mapping table, which, as mentioned above, functions to convert the sequence position information of each frame of speech in the input speech sequence into a vector representation that the model can understand and use, also known as a position-coding embedding matrix.
In one application scenario, it is assumed that an intelligent speech hearing aid is being developed that is able to understand long-spread speech and perform hearing-aid speech return based on the speech. Because the speaker's voice may be very long, beyond the processing power of conventional voice recognition systems, a machine learning network capable of processing long sequences of voice is needed. In this network, the distribution vector map can convert the position information of each frame of speech in the speech sequence (e.g., what part of the speech sequence this frame is) into a vector form that the model can understand. This conversion process is necessary for machine learning models because it helps the models understand the order and context of speech, thereby more accurately recognizing and understanding the speech content.
The distribution vector mapping table consists of two parts: an initial distribution mapping table and an extended distribution mapping table. The initial distribution map is designed to cover an initial speech interval, i.e. the maximum length of the speech sequence it can handle. This length may be a temporal length or a number of sampling points, depending on the implementation of the network. For example, if the initial voice coverage interval is set to 10 seconds of voice, then the initial distribution map needs to be able to process all voice frames within the 10 seconds. However, the speaker's voice may exceed this initial interval, which is where the extended distribution mapping table functions. The extended distribution mapping table is designed to extend the capabilities of the initial distribution mapping table so that it can handle longer speech sequences. In this way, even if the user's voice is out of the original designed coverage, the machine learning network can still accurately recognize and understand.
Step S130, extending the initial distribution mapping table based on the extended distribution mapping table, so that the voice coverage area of the distribution vector mapping table extends from the initial voice coverage area to a target voice coverage area, where the target voice coverage area is larger than the initial voice coverage area.
In step S130, the initial distribution mapping table is extended to expand the voice coverage area, so as to adapt to the longer voice sequence. For example, suppose that TWS is developing a speech recognition system that needs to be able to accurately recognize and respond to a user's long-range speech. In an initial phase, the machine learning network of the system may only be able to process voice of a short length, such as voice within 5 seconds. However, a speaker can speak for up to 10 seconds or more in actual communication, which requires the system to be able to process longer sequences of speech. To achieve this, the initial distribution mapping table is extended in step S130. Specifically, the extended distribution mapping table is utilized to extend the voice coverage area of the initial distribution mapping table. The initial distribution map would otherwise cover 5 seconds of speech, but by extension the new distribution vector map would be able to cover longer speech, say 10 seconds or more.
This extension process is to represent longer speech sequence position information by adding additional vectors on the basis of the initial distribution mapping table. These additional vectors are generated from the extended distribution map. The extended distribution vector mapping table will have a larger speech coverage area, thereby enabling longer speech to be processed. In this way, the TWS headset voice recognition system can better adapt to the actual demands of users, and accurate recognition and response can be achieved whether the voice is a phrase or a long voice. The TWS earphone has the advantages that the user experience is greatly improved, and the TWS earphone has stronger competitiveness and practicability in the field of voice recognition.
Step S140, performing initial assignment on the network learnable variables of the machine learning network to be debugged, to obtain a basic machine learning network, where initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of remaining network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a pre-debugging completed public feature representation network, the remaining network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the remaining network learnable variables include the initial distribution mapping table.
Specifically, network-learnable variables in a machine learning network to be debugged are initialized. These network-learnable variables are parameters that the machine learning network needs to adjust during the training process, such as weights, biases, learning rates, etc., that are trained to optimize the performance of the network.
In step S140, the initial learnable variables of the extended distribution mapping table are arbitrarily assigned, i.e. randomly generated, because the extended distribution mapping table is used to process speech sequences beyond the initial speech coverage, and this part of the data may not be sufficient or diverse at the beginning of training, and thus these variables are initialized by arbitrary assignment for adjustment and optimization in the subsequent training process. For the remaining network-learnable variables in the underlying machine-learning network, their starting learnable variables are initially assigned based on the network-learnable variables of the pre-debugging (i.e., pre-train) completed public feature representation network. This means that the variables are initialized with existing, pre-trained parameters of the common feature representation network, such as the open-source feature representation model. This may speed training and improve network performance because these pre-trained parameters already contain some feature extraction and representation capabilities.
For example, assume a machine learning network is to be initialized for speech recognition. In this network, an extended distribution map is used to handle longer speech sequences. In the initialization phase, the TWS ear may arbitrarily assign a value to a variable of the extended distribution map, such as initializing with a random number. Other parts of the network, such as the initial distribution mapping table and other network layers, can be initialized by reference to parameters of a trained public feature representation network (such as an open-source speech recognition model). By the method, a basic machine learning network with certain performance can be quickly built, and a solid foundation is laid for subsequent training and optimization. The initialization method not only improves training efficiency, but also enables the network to be better adapted to specific voice recognition tasks.
Step S150, based on the first speech binary group sample, debugging the extended distribution mapping table of the basic machine learning network to obtain a target feature representation network.
Step S150 is performed after the initialization of the underlying machine learning network has been completed, primarily for the purpose of debugging the extended distribution mapping table to obtain a more accurate and efficient target feature representation network. This is achieved primarily by training the network using the first speech tuple samples.
Specifically, an extended distribution mapping table in a base machine learning network is trained using a first speech tuple sample. The goal of the training is to optimize the parameters of the extended distribution mapping table so that the network can better handle long voice sequences beyond the initial voice coverage area. During training, the voice samples are input into the network, and the weight and deviation of the extended distribution mapping table are adjusted through a back propagation algorithm to minimize the difference between the network prediction result and the real label. This process is repeated until the performance of the network reaches a preset standard or the number of training rounds reaches a preset upper limit.
For example, suppose a machine learning network is being developed for long speech recognition. In step S150, the extended distribution mapping table is trained using the first speech tuple samples. Through continuous iteration and optimization, parameters of the extended distribution mapping table are gradually adjusted to an optimal state, so that the network can more accurately identify information in a long voice sequence. After training is completed, a target feature representation network is obtained, and the network not only contains an optimized extended distribution mapping table, but also retains the initial distribution mapping table and pre-training parameters of other network layers. The network structure enables TWS (time varying signal) phones embedded with target feature representation networks to process voice inputs of various lengths more efficiently, thereby improving user experience and recognition accuracy.
Step S110-S150, the debugging process of the target feature representation network is completed, the initial distribution mapping table in the distribution vector mapping table is extended through the extension distribution mapping table based on the network architecture of the existing machine learning network, and the voice coverage areas of the initial distribution mapping table are initial voice coverage areas, so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to the target voice coverage areas. Because the target voice coverage area is larger than the initial voice coverage area, the target features obtained through machine learning network debugging to be debugged represent that the network can process long voice sequences between voice coverage areas that are larger than the initial voice coverage area. When the network is debugged, initial assignment is carried out on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network. The initial variable of the extended distribution mapping table in the basic machine learning network is obtained by any assignment, the initial variable of the rest network variable in the basic machine learning network is obtained by initial assignment based on the network variable of the public feature representation network with the finished front debugging, the rest network variable is all network variable of the basic machine learning network, the rest network variable except the extended distribution mapping table comprises the initial distribution mapping table, the rest network variable with the front debugging can be fixed in the debugging process, and only the extended distribution mapping table of the basic machine learning network is debugged based on the acquired first voice binary group sample to obtain the target feature representation network. Based on the method, the voice semantics of the pre-debugged characteristic representation network learning can be maintained, so that the network debugging speed is increased. The application extends the voice coverage area of the distribution vector mapping table of the network to be debugged, and debugs the local network learnable variable of the machine learning network to be debugged, so as to increase and extend the maximum voice coverage area of the target characteristic representation network on the basis of maintaining the semantic information of the pre-obtained voice sequence and ensuring the debugging efficiency. Therefore, when the long voice sequence is processed, sequence splitting or other segmentation processing modes are not needed, so that the semantic information of the voice sequence is more complete, and the accuracy of the returned hearing-aid voice is improved.
The following steps S160 to S190 are processes of completing speech semantic recognition for speech response by applying the target feature representation network, where the target feature representation network may be installed in the TWS earphone, or may be installed in a terminal device that is communicatively connected to the TWS earphone, for example, in a smart phone, and will be described below by taking installation in the TWS earphone as an example. Specifically, the process includes:
step S160, a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed are obtained.
In step S160, the TWS earphone receives, via its built-in microphone or other sound capturing device, speech signals from the external environment, which signals constitute a target to-be-identified source speech sequence. At the same time, a plurality of candidate speech sequences to be processed are obtained, which may be speech content that is simplified or refined in different ways. The candidate speech sequences to be processed may be obtained from other devices of the network, such as a server or terminal device, for example, via a cell phone connected to the TWS headset Bluetooth.
In one example scenario, assume that an impaired person wearing a TWS headset is talking to others. When the opposite party speaks a relatively complex section, the TWS earphone firstly captures the complete voice signal, namely the source voice sequence to be recognized by the target. Because hearing impaired people may have difficulty understanding lengthy or complex sentences, TWS headphones require an internal processing mechanism to reduce the speech to a more understandable version.
In this process, the TWS headset generates a plurality of simplified candidate speech sequences according to a predetermined algorithm or model. These candidate speech sequences may contain the main information of the source speech but are expressed in a more concise and straightforward manner. For example, if the source speech is "open day three afternoon in library portal, a project plan for the next week needs to be discussed," one of the candidate speech sequences may be "open day three point library, discussing project plan.
The candidate voice sequences can be obtained from a preset voice library and then generated, and in practical application, the TWS earphone can select the most suitable candidate voice sequence according to the preference and the setting of the user and play the most suitable candidate voice sequence to the hearing impaired, so that the hearing impaired can better understand and respond to voice information of the opposite party.
Step S170, obtaining a first speech sequence feature of the target to-be-recognized source speech sequence based on the target feature representation network, and obtaining a second speech sequence feature of each of the candidate speech sequences to be processed based on the target feature representation network.
In step S170, the TWS headset processes the target to-be-identified source speech sequence and each candidate speech sequence to be processed, respectively, using the target feature representation network, thereby extracting their feature representations.
The target feature representation network is obtained by debugging in the steps S110-S150, and is used for extracting a machine learning model of the voice feature. For a target to-be-identified source voice sequence, the TWS earphone inputs the target to the target feature representation network, and outputs a first voice sequence feature, such as a voice vector, wherein the voice vector captures key acoustic characteristics and language information in the source voice sequence and is the basis for subsequent voice matching and identification.
Likewise, for each candidate speech sequence to be processed, the TWS headset inputs it into the target feature representation network, respectively, resulting in a second speech sequence feature corresponding to each candidate speech sequence. These feature vectors will be used in the subsequent similarity calculation and matching process.
Step S180, obtaining sixth commonality measurement coefficients between the first speech sequence feature and each of the second speech sequence features.
In step S180, the TWS earphone calculates a common metric coefficient between the first speech sequence feature of the target to-be-identified source speech sequence and the second speech sequence feature of each candidate speech sequence to be processed, which reflects the degree of similarity between the two sets of features.
For example, assume that the TWS headset has extracted features of a target to-be-identified source speech sequence (first speech sequence features) and features of a plurality of candidate speech sequences to be processed (second speech sequence features) through a target feature representation network. To determine which candidate speech sequence is most similar to the target to-be-identified source speech sequence, i.e., most accurately conveys the primary content of the source speech, the TWS headset calculates a commonality metric between these features.
The commonality metric coefficient may be calculated by various algorithms, such as cosine similarity, euclidean distance, or pearson correlation coefficient. Taking cosine similarity as an example, it measures the similarity of two vectors in direction, with a value range between-1 and 1. The closer the value is to 1, the closer the direction of the two vectors is, i.e. the higher the similarity of the two speech sequences in content. In actual operation, the TWS headset calculates a commonality metric between the first speech sequence feature and each of the second speech sequence features, respectively. For example, if there is a candidate speech sequence whose feature vector is very close in direction to that of the target to-be-identified source speech sequence, then the coefficient of commonality between them will be high, meaning that this candidate speech sequence is likely to accurately convey the primary content of the source speech.
In this way, the TWS headset can objectively evaluate the similarity of each candidate speech sequence to the target to-be-identified source speech sequence, thereby providing data support for subsequent selection of the best candidate speech sequence. This is particularly important for users such as hearing impaired people who require voice assistance because it helps them understand the conversation more accurately, improving communication efficiency.
Step S190, outputting the candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient.
In step S190, the TWS earphone finds the second speech sequence feature with the largest sixth commonality metric coefficient according to the sixth commonality metric coefficient calculated previously, and the candidate speech sequence to be processed corresponding to the feature is the sequence most similar to the target to-be-identified source speech sequence.
For example, assume that the TWS headset has calculated a coefficient of commonality metric between a plurality of candidate speech sequences and a target to-be-identified source speech sequence. These coefficients reflect the degree of similarity of each candidate speech sequence to the target speech sequence. The TWS headset now needs to select the most similar one of these candidate speech sequences to output. In performing step S190, the TWS earphone first compares the magnitudes of all the commonality metric coefficients, finding the largest one thereof. The second speech sequence feature corresponding to this largest common metric coefficient represents the candidate speech sequence most similar to the target to-be-identified source speech sequence. Next, the TWS earpiece outputs this most similar candidate speech sequence.
For example, if the target to-be-identified source speech sequence is "please meet at the library gate three pm, while the candidate speech sequences have" three pm library see in open, four pm library gate see in open, and "rare three point book" etc. By calculating the commonality measurement coefficient, the TWS earphone finds that the candidate voice sequence has the highest similarity with the target voice sequence in the library of three points in the afternoon in the open. Thus, in step S190, the TWS headset will select and output this candidate speech sequence. This output may be in the form of speech that is played through the speakers of the TWS headset. Thus, the user can clearly receive the simplified voice content with the main information reserved, and the function can greatly improve the communication efficiency especially in a noisy environment or in an occasion where the intention of the opposite party needs to be understood quickly.
As an implementation manner, the step S150 of debugging the extended distribution mapping table of the basic machine learning network based on the first speech tuple sample to obtain the target feature representation network may specifically include:
step S151, based on the first speech tuple sample, debugging an extended distribution mapping table of the basic machine learning network to obtain a transitional machine learning network.
In step S151, the computer system uses the first speech tuple sample to debug-train the extended distribution mapping table of the underlying machine learning network, so that the network can learn and understand the mapping relationship from the source speech to the target hearing aid speech more accurately. In the debugging process, the computer system continuously adjusts parameters and mapping relations of the network according to the sample data so as to improve the recognition and conversion capability of the network to the voice characteristics. This process continues to obtain a transitional machine learning network (intermediate network) that has a preliminary capability in recognizing and processing speech features, but requires further optimization and tuning.
As an implementation manner, the number of the first voice binary group samples is multiple, and each first voice binary group sample comprises a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence. In the embodiment of the present application, step S151, based on the first speech tuple sample, debugs the extended distribution mapping table of the basic machine learning network to obtain a transitional machine learning network, which may specifically include:
Step S1511, for a first to-be-identified source voice sequence in each of the first voice binary group samples, acquiring an active voice example and a passive voice example of the first to-be-identified source voice sequence, where the active voice example of the first to-be-identified source voice sequence is a target hearing assistance voice belonging to the same first voice binary group sample as the first to-be-identified source voice sequence, and the passive voice example of the first to-be-identified source voice sequence is a target hearing assistance voice belonging to a different first voice binary group sample as the first to-be-identified source voice sequence.
In step S1511, the computer system performs the following operations for each first speech tuple sample:
1. Acquisition of positive speech examples (positive samples):
An example of a positive speech is a target hearing assistance speech that matches a first sequence of source speech to be identified. In other words, it is the correct output or label for the source speech sequence. For example, assume that the first sequence of source voices to be identified is "please open a window" and that the target hearing aid voices that match are "windowed". In this example, "open window" is an active speech example of a source speech sequence that "please open window".
2. Acquisition of negative speech examples (negative samples):
a negative speech example is target hearing assistance speech from a different other first speech tuple sample than the current first to-be-identified source speech sequence. Continuing with the example above, if the source speech sequence of another first speech tuple sample is "please close the door", its target hearing assistance speech is "close the door". In this case, the "door closed" can be used as a negative speech example of the source speech sequence "please open the window".
In practice, the computer system may randomly select negative speech examples from a large number of first speech tuple samples or according to some policy to increase the generalization ability and robustness of the model.
In this way, step S1511 provides two key training references for each first to-be-identified source speech sequence: one is the correct output (active speech example) for guiding the model to learn the correct mapping; the other is erroneous output (negative speech example) to help the model distinguish between different speech sequences and to improve its recognition. The application of the method can significantly improve the performance of the machine learning network in speech recognition and conversion tasks.
Step S1512, for each first to-be-identified source voice sequence, outputting a first voice feature of the first to-be-identified source voice sequence, a second voice feature of a positive voice example of the first to-be-identified source voice sequence and a third voice feature of a negative voice example of the first to-be-identified source voice sequence based on the basic machine learning network.
In step S1512, the computer system performs feature extraction on each first to-be-identified source speech sequence and its corresponding positive and negative speech examples using the underlying machine learning network. Specifically, the following operations are performed:
1. Extracting first voice characteristics of a first to-be-identified source voice sequence:
The first speech feature refers to data extracted from the source speech sequence that is representative of its acoustic characteristics. For example, assuming that the first to-be-identified source speech sequence is a speech signal corresponding to "please open a window", the computer system may extract features of pitch, timbre, speech speed, etc. of the speech through the underlying machine learning network, and encode the features as a feature vector, such as [0.8, 0.3, 0.5, ], which is the first speech feature of the speech sequence.
2. Extracting second speech features of the positive speech examples:
An example of positive speech is target hearing-aid speech that matches the source speech sequence, whose speech features are also extracted by the computer system. Continuing with the example above, the "open window" is the target hearing aid voice that matches the "please open window" and the system will extract the corresponding features from the voice, such as pitch, articulation, etc., and form a second voice feature vector, such as [0.7, 0.4, 0.6.
3. Extracting third speech features of the negative speech examples:
Negative examples of speech are target hearing-aid speech from different speech two-tuple samples, which the computer system also extracts features. Assuming that the negative speech example is "door-closed," the computer system extracts its unique speech features, such as pitch fluctuations, loudness of sound, etc., from the segment of speech and forms a third speech feature vector, such as [0.2, 0.9, 0.1. These feature vectors are the basis for subsequent processing and learning by the machine learning model. By comparing and analyzing these feature vectors, the model is able to learn how to extract useful information from the complex speech signal and to perform speech recognition and conversion accordingly.
Step S1513, based on the first voice feature and the second voice feature, acquiring a first commonality metric coefficient between the first to-be-identified source voice sequence and the positive voice example of the first to-be-identified source voice sequence, and based on the first voice feature and the third voice feature, acquiring a second commonality metric coefficient between the first to-be-identified source voice sequence and the negative voice example of the first to-be-identified source voice sequence.
In step S1513, the computer system calculates the similarity using the previously extracted speech features. Specifically, the following operations are performed:
First, a first similarity measure coefficient between a first to-be-recognized source voice sequence and an active voice example thereof is calculated, namely a similarity evaluation result:
The first co-metric coefficient reflects a degree of association between the source speech sequence and the target hearing assistance speech that it matches. For example, if the first sequence of source speech to be identified is "please open a window", an example of which is "windowed", the computer system compares the first speech feature and the second speech feature of both, such as pitch, speech speed, etc., and calculates a value to represent the similarity between them. This value is the first co-measure coefficient, which may be a value between 0 and 1, with a closer to 1 indicating a more similar.
Then, a second common metric coefficient between the first to-be-identified source voice sequence and the negative voice example thereof is calculated:
The second commonality metric reflects a degree of similarity between the source speech sequence and the target hearing assistance speech from the other speech two-tuple samples. Assuming that the negative speech example is "door closed", the computer system compares the first speech feature of "please open the window" with the third speech feature of "door closed", and also calculates a similarity value. This value is the second co-metric coefficient, which should also be a value between 0 and 1, but typically this value will be smaller than the first co-metric coefficient, since the matching degree of the negative speech example to the source speech sequence should be lower than the positive speech example.
In practical applications, the method for calculating the similarity may include cosine similarity, euclidean distance, and the like. For example, cosine similarity is the evaluation of the cosine value of the angle between two vectors by calculating them. The computer system uses the extracted speech feature vectors to calculate these similarity coefficients according to specific algorithms and formulas.
These commonality metrics play an important role in the subsequent training and optimization process, and they help the system to understand which features are critical to distinguishing between different speech sequences, thereby improving the accuracy of speech recognition and conversion. In a hearing-aid scene, accurate similarity calculation is helpful for a system to better recognize a voice instruction of a user and convert the voice instruction into clear and understandable target hearing-aid voice, so that user experience is improved.
Step S1514 generates a first evaluation function based on the first and second commonality metric coefficients.
In step S1514, an evaluation function, also called a loss function, is generated, which is used to quantify the accuracy of the model prediction. In this step, the computer system constructs an evaluation function using the previously calculated first commonality measure coefficient (similarity between the source speech sequence and its positive speech examples) and the second commonality measure coefficient (similarity between the source speech sequence and its negative speech examples). The main purpose of this evaluation function is to guide the training of the machine learning model so that it can better learn the mapping relationship from the source speech sequence to the target hearing aid speech.
Specifically, the design of the evaluation function (loss function) is generally based on the following principle: maximizing similarity between a source speech sequence and its active speech examples: this means that the loss function should give a positive evaluation (i.e. a low loss value) when the network is able to correctly map the source speech sequence to its corresponding target hearing aid speech. Minimizing similarity between the source speech sequence and its negative speech examples: this means that the model should be able to distinguish between different speech sequences, avoiding erroneous mapping of the source speech sequence onto other uncorrelated target hearing aid speech. When the model makes such a false mapping, the loss function may give a negative evaluation (i.e., a higher loss value).
By way of specific example, assume that there is a source speech sequence "please open a window", with the positive speech example being "open a window" and the negative speech example being "close a door". If the model is able to accurately map "please open a window" to "open a window", the first commonality metric coefficient will be high, at which point the loss function calculates a lower loss value, indicating that the prediction of the model is accurate. Conversely, if the model erroneously maps "please open the window" to "close the door," the second commonality metric coefficient would be relatively high (although it should still be lower than the first commonality metric coefficient), at which point the loss function calculates a high loss value to reflect the prediction error of the model. In practical applications, the specific form of the loss function may vary from model to model and from application scenario to application scenario. For example, in deep learning, common loss functions include mean square error loss, cross entropy loss, and the like. These loss functions are all calculated from the differences between the predicted output of the model and the actual labels, thereby guiding the training and optimization process of the model.
Illustratively, the first evaluation function may be the following formula:
L1=
Wherein L 1 is a first evaluation function, d is the number of first voice tuple samples, wx is a first to-be-recognized source voice sequence in the x-th first voice tuple sample, zx is a target hearing-aid voice in the x-th first voice tuple sample, that is, an active voice example of the first to-be-recognized source voice sequence in the x-th first voice tuple sample, x is less than or equal to d, zxy' is a y-th negative voice example corresponding to wx, and y is less than or equal to d. S (wx, zx) is a first common metric coefficient and S (wx, zxy') is a second common metric coefficient.
In the embodiment of the application, the loss function is carefully designed and the first commonality measurement coefficient and the second commonality measurement coefficient are utilized for training, so that the machine learning model can be helped to more accurately identify and convert the voice command of the user, and the performance and the user experience of the hearing-aid device are improved.
Step S1515, debugging the extended distribution mapping table of the basic machine learning network based on the first evaluation function, to obtain the transition machine learning network.
In step S1515, the computer system debugs the extended distribution map of the underlying machine learning network using the previously constructed first evaluation function. Specifically, the parameters in the extended distribution mapping table are fine-tuned according to the loss value calculated by the first evaluation function. These parameters may include weights, biases, etc., which directly affect the prediction results of the model. The goal of the debugging is to find a set of parameters that minimize the loss value of the model on the training data.
To sum up, the distance is performed in a specific scenario, and it is assumed that the basic machine learning network is a deep neural network, which is used to map the voice command of the user to the corresponding target hearing assistance voice. During the training process, the computer system trains this network using a large number of voice data pairs. Each pair of voice data includes a source voice sequence and a target hearing assistance voice. The computer system firstly extracts the voice characteristics and obtains a preliminary mapping relation through a basic machine learning network. Then, the computer system debugs the parameters in the extended distribution mapping table according to the loss value calculated by the first evaluation function, so that the predicted result of the basic machine learning network is more similar to the target hearing aid voice. During the debugging process, the computer system may update the parameters in the mapping table by using an optimization algorithm such as gradient descent. Through multiple iterative training and adjustment, the computer system finally obtains an optimized transition machine learning network. This network may be more excellent in predictive performance than the underlying machine learning network, and may be able to more accurately recognize and translate the user's voice instructions.
In practical application of hearing aid devices, the performance improvement of the transitional machine learning network means that users can interact with the devices more smoothly and naturally, so that user experience and satisfaction are improved. For example, when a speaker makes a "please open a window" voice, the optimized transitional machine learning network can more accurately recognize this voice and generate a clear, intelligible target hearing assistance voice "open a window" to help hearing impaired patients quickly understand the speaker's intent.
Step S152, a second voice binary group sample composed of a second to-be-recognized source voice sequence and target hearing aid voice matched with the second to-be-recognized source voice sequence is obtained.
The data structure of the second speech binary group sample may refer to the first speech binary group sample, and will not be described herein.
And step S153, performing fine optimization on all network learnable variables of the transition machine learning network based on the second voice binary group sample to obtain the target feature representation network.
In step S153, the computer system fine-tunes all network learnable variables of the transitional machine learning network using the second speech tuple sample. These network-learnable variables include parameters such as weights, biases, etc. of the model, which have an important impact on the performance of the model. The goal of the fine tuning is to achieve better performance of the transitional machine learning network on the new data set (i.e., the second speech two-tuple sample) by making small adjustments to these parameters.
Specifically, the computer system first loads the transitional machine learning network and uses the second speech tuple sample as input data. The computer system then calculates the gap between the predicted and actual results of the model under the current parameters, e.g., as measured by an evaluation function. The computer system then uses an optimization algorithm (e.g., gradient descent) to update the learnable variables of the transitional machine learning network to minimize the value of the loss function. This process is iterated until the performance of the transitional machine learning network reaches a preset standard or the number of iterations reaches an upper limit.
For example, suppose that a user speaks "please help me open a window" in a voice that is blurry or faster. Prior to fine tuning, the transitional machine learning network may not accurately recognize the speech signal to which this speech corresponds. By fine tuning using a second speech tuple sample containing the user's speech, the transitional machine learning network can learn the pronunciation characteristics and speech speed habits of the speech, thereby more accurately recognizing and converting the speech. Finally, the trimmed model is used as a target feature to represent the network, so that the user voice can be more accurately recognized and converted, and the performance and user experience of the equipment (such as TWS earphone) applying the transition machine learning network are improved.
As an implementation manner, the number of the second speech tuple samples is plural, each second speech tuple sample includes a second to-be-recognized source speech sequence and a target hearing assistance speech matched with the second to-be-recognized source speech sequence, and the step S153 of performing fine optimization on all network learnable variables of the transition machine learning network based on the second speech tuple samples to obtain the target feature representation network specifically may include:
Step S1531, for the second to-be-recognized source speech sequence in each second speech binary group sample, obtaining an active speech example and a passive speech example of the second to-be-recognized source speech sequence, where the active speech example of the second to-be-recognized source speech sequence is a target hearing assistance speech of the same second speech binary group sample as the second to-be-recognized source speech sequence, and the passive speech example of the second to-be-recognized source speech sequence is a target hearing assistance speech of a second speech binary group sample different from the second to-be-recognized source speech sequence.
In step S1531, the computer system processes for each second speech tuple sample consisting of the second sequence of source speech to be recognized and the target hearing aid speech matching it.
For each second to-be-identified source speech sequence, the computer system obtains its positive speech instance and negative speech instance. Both examples play a vital role in the training process. An example of a positive speech is a target hearing aid speech belonging to the same second speech tuple sample as the second to-be-recognized source speech sequence, i.e. a correct response corresponding to the source speech sequence. In contrast, the negative speech example is the target hearing aid speech belonging to a different second speech tuple sample than the second to-be-recognized source speech sequence. This means that these examples are not expected responses of the source speech sequence, but responses of other speech or erroneous responses. By distinguishing between positive and negative speech examples, the computer system is able to more effectively train and optimize the machine learning model so that it can respond more accurately and sensitively in the face of the user's speech. The step not only improves the performance of the network and enhances the user experience, but also ensures that the voice of the user can be summarized and transmitted correctly and timely.
Step S1532, for each second to-be-identified source speech sequence, outputting fourth speech features of the second to-be-identified source speech sequence, fifth speech features of positive speech examples of the second to-be-identified source speech sequence, and sixth speech features of negative speech examples of the second to-be-identified source speech sequence based on the transitional machine learning network.
Step S1532 is that in optimizing the transition machine learning network, for each second to-be-identified source speech sequence, the computer system utilizes the transition machine learning network to extract and output relevant speech features. These features play a key role in the training and optimization of the model. Specifically, for each second to-be-identified source speech sequence, the computer system outputs three types of speech features through the transitional machine learning network:
1. fourth speech feature of the second to-be-recognized source speech sequence: this is a characteristic representation of the source speech sequence itself, possibly containing information of the tempo, pitch, timbre etc. of the speech signal.
2. Fifth speech feature of positive speech example of the second to-be-recognized source speech sequence: this is the characteristic of the desired response corresponding to the source speech sequence, i.e., the positive sample characteristic.
3. Sixth speech feature of negative speech example of the second to-be-recognized source speech sequence: this is a feature of an incorrect or unexpected response, i.e., a negative sample feature.
These features may be multidimensional vectors containing various information extracted from the speech signal, such as MFCC (Mel-frequency cepstral coefficient), spectrograms, phoneme recognition results, etc. By analyzing and comparing these features, the computer system is able to evaluate the recognition capabilities of the transitional machine learning network for different speech sequences and optimize the network accordingly. Specifically, in step S1532, the computer system may utilize a transitional machine learning network (which may be a deep learning model, such as a recurrent neural network RNN, long short term memory network LSTM, or transducer, etc.) to extract these key features. These features will then be used to calculate a commonality metric to measure the similarity between the source speech sequence and its positive and negative examples, thereby providing guidance for fine tuning of the network.
Step S1533, based on the fourth voice feature and the fifth voice feature, obtains a third common metric coefficient between the second to-be-identified source voice sequence and the positive voice example of the second to-be-identified source voice sequence, and based on the fourth voice feature and the sixth voice feature, obtains a fourth common metric coefficient between the second to-be-identified source voice sequence and the negative voice example of the second to-be-identified source voice sequence.
In step S1533, the computer system calculates co-metric coefficients for measuring similarity between the source speech sequence and its positive and negative speech examples using the speech features extracted in the previous step.
Specifically, the computer system first calculates a third commonality metric coefficient based on a fourth speech feature (i.e., a feature of the second to-be-identified source speech sequence) and a fifth speech feature (i.e., a feature of the active speech example). This coefficient reflects the degree of similarity between the source speech sequence and its correct response.
Likewise, the computer system also calculates a fourth commonality metric coefficient based on the fourth speech feature and the sixth speech feature (i.e., the features of the negative speech examples). This coefficient measures the similarity between the source speech sequence and its incorrect or unexpected response.
The computation of the co-metric coefficients typically involves complex mathematical operations such as cosine similarity, euclidean distance, etc., which can be used to quantify the similarity between two feature vectors. In a hearing aid scenario, the calculation of these coefficients helps to evaluate and improve the performance of the model, ensuring that the device can more accurately understand and respond to the user's voice instructions. Through this step, the computer system is able to more accurately understand the model's ability to recognize and process speech instructions, thereby providing a powerful basis for subsequent model optimization. The hearing aid system not only improves hearing aid accuracy and response speed, but also greatly improves user experience, and ensures that hearing-impaired patients can smoothly and accurately receive source voice information.
Step S1534 determines a second evaluation function based on the third and fourth commonality metric coefficients.
Specifically, the computer system determines the second evaluation function based on a third co-metric coefficient (measuring similarity between the source speech sequence and its correct response) and a fourth co-metric coefficient (measuring similarity between the source speech sequence and its incorrect response). The objective of the design of this evaluation function is to enable the model to maximize the similarity between the source speech and its correct response while minimizing the similarity to the erroneous response during the training process. In practical applications, the evaluation function may be a complex mathematical expression that comprehensively considers a plurality of factors including the magnitude of the commonality metric coefficient, the accuracy of model prediction, the confidence of the prediction result, and the like. For example, a simple loss function may be based on the difference of two common metric coefficients, i.e. it is desirable that the third common metric coefficient is as large as possible and the fourth common metric coefficient is as small as possible.
It is understood that the manner of determining the second evaluation function may refer to the formula of determining the first evaluation function in the aforementioned step S1514.
Step S1535 generates a target evaluation function based on the second evaluation function.
Specifically, the computer system constructs a final objective evaluation function based on the second evaluation function determined in the previous step, in combination with other possible constraints, regularization terms, or a priori knowledge. The target evaluation function not only considers the similarity between the source voice sequence and the correct response of the source voice sequence, but also possibly comprises factors such as complexity, generalization capability and the like of the model so as to ensure that the trained model can accurately recognize and process voice instructions, and has good generalization performance.
In constructing the target evaluation function, the computer system may employ various techniques, such as adding regularization terms to prevent model overfitting, or introducing a priori knowledge to constrain the parameter space of the model. The technical means are to enable the target evaluation function to reflect the performance of the model more comprehensively and accurately, so that the model is guided to approach the global optimal solution continuously in the training process.
For example, in constructing the objective evaluation function, a regularization term may be introduced to constrain the complexity of the network to prevent the occurrence of overfitting. Thus, the final objective evaluation function may be:
Objective=L 1 +λ Complexity, where λ is a trade-off coefficient for adjusting the weight between the loss term and the complexity term. In this way, the model is ensured to accurately recognize the voice command, and meanwhile, the model is ensured to have good generalization capability and stability.
As one embodiment, the step S1535 generates the target evaluation function based on the second evaluation function, specifically includes:
Step S15351 of obtaining the remaining network learnable variables of the transitional machine learning network;
Step S15352 of determining a network-learnable variable evaluation function based on the initial learnable variable and the adjusted learnable variable of the remaining network-learnable variables;
Step S15353 generates the target evaluation function based on the second evaluation function and the network-learnable variable evaluation function.
In this embodiment, step S15351 obtains the remaining network learnable variables of the transition machine learning network. The computer system determines which network parameters (e.g., weights and biases) are learnable and have not been fixed or optimized. These learnable variables are critical in the machine learning model training process, as they are continually adjusted during the training process to minimize the loss function. For example, the transitional machine learning network is a deep learning neural network that includes multiple layers and nodes, each node having corresponding weights and bias parameters. Some of these parameters may have been optimized in a previous training step, while the remaining network-learnable variables are needed for this step.
In step S15352, a network-learnable variable evaluation function is determined based on the initial learnable variable and the adjusted learnable variable of the remaining network-learnable variables. At this step, the computer system first sets initial values of these variables (initial learnable variables) and then continuously adjusts these variables (adjusted learnable variables) during the training process. Based on the variations in these variables, the computer system defines an evaluation function (i.e., a network-learnable variable evaluation function) for evaluating the impact of these variable adjustments on the model performance.
Taking the weights in the neural network as an example, the initial weights may be randomly given at the time of network initialization, and the adjusted weights are adjusted according to the feedback of the loss function during training. The network-learnable variable evaluation function may be defined based on a change in model prediction accuracy before and after weight adjustment.
As one embodiment, the network-learnable variable evaluation function may be a mean square error function. The formula may be referred to as follows:
L2=
Where L 2 is the mean square error between the starting and adjusted learner variables of the remaining network learner variables, R is the number of remaining network learner variables, The adjusted learner variable for the x-th remaining network learner variable,The starting learnable variable of the rest network learnable variables of the x < R.
In step S15353, the target evaluation function is generated based on the second evaluation function and the network-learnable variable evaluation function. In this step, the computer system combines the second evaluation function (loss function based on the commonality metric) determined in the previous step with the network-learnable variable evaluation function to generate a comprehensive objective evaluation function. This objective evaluation function takes into account not only the behavior of the model on the training data (embodied by the second evaluation function), but also the impact of the model parameter adjustment on the performance (embodied by the network-learnable variable evaluation function).
In particular, the target evaluation function may be a weighted sum of the second evaluation function and the network-learnable variable evaluation function. By minimizing this objective evaluation function, the computer system can simultaneously optimize the prediction accuracy of the model and the parameter adjustment strategy, thereby improving the performance of the hearing aid device and enabling it to more accurately communicate the source speech to the hearing impaired patient.
An exemplary formula for the objective evaluation function is as follows:
Lt=L1+λL2
Step S1536, performing fine optimization on all network learnable variables of the transition machine learning network based on the target evaluation function to obtain the target feature representation network.
In step S1536, all network learnable variables of the transitional machine learning network are finely optimized to obtain a final target feature representation network. This process, often referred to as "fine-tuning", aims to further improve the performance of the model by adjusting its parameters. In particular implementations, the computer system fine-tunes all network-learnable variables in the transitional machine learning network according to the objective evaluation function generated in the previous step. These learnable variables typically include parameters such as weights, biases, etc. of the neural network. The goal of the fine tuning is to minimize the target evaluation function, which means that the model's performance on the training data will be optimal. During the fine tuning process, the computer system may use an optimization algorithm (e.g., gradient descent) to gradually adjust the network parameters. Each step of adjustment determines the direction and step size of parameter update according to the gradient information of the target evaluation function. Through multiple iterative updates, the model will gradually approach the optimal solution.
In the fine tuning process, the computer system adjusts each layer of parameters in the network step by step according to the objective evaluation function. These adjustments may include changing the connection weights of neurons, adjusting bias terms, and the like. By fine tuning, the neural network will gradually learn how to extract effective feature representations from the input speech signal, thereby more accurately identifying the speech content. Eventually, the trimmed neural network will become the target feature representation network.
As an implementation manner, a speech coverage area of the target hearing assistance speech matched with the first to-be-identified source speech sequence is larger than that of the initial speech coverage area, and step S110 may include obtaining a first speech binary group sample composed of the first to-be-identified source speech sequence and the target hearing assistance speech matched with the first to-be-identified source speech sequence:
Step S111, determining the target hearing-aid voice in voice sequences between voice coverage areas larger than the initial voice coverage areas, and determining a first to-be-recognized source voice sequence matched with the target hearing-aid voice;
Step S112, generating the first speech binary group sample based on the first to-be-identified source speech sequence and the target hearing assistance speech matched with the first to-be-identified source speech sequence.
In step S111, the computer system first looks for the target hearing aid voice within a larger voice coverage area. This "voice coverage area" refers to the length of signal time or the number of samples of the voice sequence that exceeds the initial voice coverage area. For example, if the initial voice coverage area is 5 seconds of voice signals, then in step S111 the computer system may select the target hearing aid voice from among a 6 second or longer voice signal. After determining the target hearing assistance sounds, the computer system further determines a first sequence of source sounds to be identified that matches the target hearing assistance sounds. By "matched" is meant that both are consistent in content, i.e., they convey the same semantic information.
In step S112, the computer system combines the first to-be-identified source speech sequence with the target hearing-aid speech matched with the first to-be-identified source speech sequence to generate a first speech binary group sample. This sample will be used as part of the training data for subsequent training of the feature representation network.
In one embodiment, in the step S111, the determining the target hearing assistance voice in the voice sequence between the voice coverage areas greater than the initial voice coverage area may specifically include: screening voice sequences between voice coverage areas larger than the initial voice coverage areas from a public voice information base as candidate voice sequences; the target hearing assistance speech is determined based on the candidate speech sequence.
In the step S111, determining a first to-be-recognized source voice sequence that is matched with the target hearing assistance voice may specifically include: and taking the to-be-recognized source voice sequence matched with the target hearing aid voice in the public voice information base as a first to-be-recognized source voice sequence matched with the target hearing aid voice.
In step S1111, the computer system first filters the speech sequences between the speech coverage areas in the public speech information base that are larger than the speech sequences between the initial speech coverage areas, and uses these sequences as candidate speech sequences. For example, if the initial voice coverage is 5 seconds apart, the computer system looks for those voice sequences that are longer than 5 seconds. This is done to ensure that there is sufficient speech data for subsequent processing and analysis.
In step S1112, the computer system determines a target hearing assistance voice from the candidate voice sequence. This step may involve an evaluation of the quality, clarity, content, etc. of the speech sequence to ensure that the selected target hearing aid speech is of high quality and suitable for use in a hearing aid device. For example, the computer system may select those speech sequences that are pronounced, at moderate speech speeds, and have no significant noise disturbance as the target hearing aid speech.
After the target hearing aid voice is determined, a source voice sequence to be recognized, which is matched with the target hearing aid voice, is found from the public voice information base. By "matched" is meant that both are consistent in content, i.e., they convey the same semantic information. The computer system takes the matched source voice sequences to be recognized as first source voice sequences to be recognized, which are matched with target hearing aid voices. This is done to ensure accuracy and consistency of the training data, thereby improving the training effect of the feature representation network.
As one embodiment, the number of the candidate voice sequences is a plurality, and the determining the target hearing assistance voice based on the candidate voice sequences may specifically include: for each candidate voice sequence, determining the distribution condition of core content of the candidate voice sequence in the candidate voice sequence, wherein the core content is information matched with a first to-be-identified source voice sequence corresponding to the candidate voice sequence; and if the distribution condition is the target distribution information, removing the candidate voice sequences from the plurality of candidate voice sequences according to the set percentage to obtain the target hearing aid voice.
Specifically, the computer system processes each candidate speech sequence to determine the distribution of its core content in the sequence. The term "core content" as used herein refers to information that matches a first to-be-identified source speech sequence corresponding to a candidate speech sequence, i.e., semantic content that is communicated in common with the first to-be-identified source speech sequence. To determine the distribution of core content, the computer system may employ speech recognition techniques to extract text information in the speech sequence and determine matching portions by text alignment. The computer system judges whether the target distribution information is met according to the distribution condition of the core content. If so, the computer system clears a part of the candidate voice sequences according to the set percentage, and finally the target hearing-aid voice is obtained. The "target distribution information" herein is a preset position, for example, the core content should be located in the middle part of the voice sequence, or the ratio of the core content should reach a certain ratio. The computer system determines whether to retain each candidate speech sequence as a candidate for the target hearing aid speech by comparing the core content distribution of the sequence with the target distribution information. For example, if the computer system requires that the core content occupy at least 60% of the speech sequence, only candidate speech sequences that meet this condition will be retained. In summary, step S1112 determines the target hearing assistance voice finally by determining the distribution of the core content in the candidate voice sequences, and screening the voice sequences meeting the conditions according to the target distribution information. This process helps ensure that the selected target hearing assistance speech meets certain requirements, improving the accuracy and efficiency of subsequent feature representation network training.
As an embodiment, the determining the distribution of the core content of the candidate speech sequence in the candidate speech sequence may specifically include: dividing the candidate voice sequences according to the initial voice coverage areas to obtain a plurality of voice subsequences; respectively obtaining fifth commonality measurement coefficients between the plurality of voice subsequences and a first to-be-recognized source voice sequence corresponding to the candidate voice sequence; and taking the voice subsequence with the maximum fifth commonality measurement coefficient as the distribution condition of the core content of the candidate voice sequence in the candidate voice sequence.
In this embodiment, the computer system first segments the candidate speech sequence according to the initial speech coverage area to obtain a plurality of speech subsequences. For example, if the initial voice coverage area is 5 seconds apart and the candidate voice sequence total is 15 seconds, then the computer system segments this candidate voice sequence into three 5 second voice subsequences. This is done to facilitate subsequent separate analysis of each sub-sequence.
The computer system then obtains fifth commonality measurement coefficients between the plurality of voice subsequences and the first to-be-identified source voice sequences corresponding to the candidate voice sequences, respectively. This coefficient is used to measure the similarity or commonality between two speech sequences. For example, cosine similarity, pearson correlation coefficient, etc. may be used to calculate this commonality metric coefficient. The system compares the similarity of each voice subsequence to the first to-be-identified source voice sequence to determine which subsequence contains the most core content.
Then, the computer system determines the voice subsequence with the largest coefficient as the distribution situation of the core content in the candidate voice sequence according to the calculated fifth commonality measurement coefficient. This means that this sub-sequence matches the first to-be-identified source speech sequence most closely and therefore most likely contains the core content. For example, if the fifth co-metric coefficient for three voice sub-sequences is 0.7, 0.9, and 0.6, respectively, then the system would consider that sub-sequence with coefficient 0.9 to contain core content.
In summary, the distribution of the core content in the candidate speech sequence is determined by dividing the candidate speech sequence, calculating the commonality metric coefficient and finding the speech subsequence with the largest coefficient. This process helps the computer system more accurately recognize important information in the speech sequence, providing powerful support for subsequent feature representation network training.
As another embodiment, in step S111, the determining the target hearing assistance voice in the voice sequence between the voice coverage areas greater than the initial voice coverage area may specifically include: and determining a voice sequence without hearing aid voice among voice materials without hearing aid voice, wherein the voice coverage area of the voice sequences is larger than the voice sequence without hearing aid voice among the initial voice coverage areas, as the target hearing aid voice.
Based on this, the determining a first to-be-identified source speech sequence that matches the target hearing assistance speech comprises: and constructing the target hearing aid voice based on a voice generating network according to a constraint template comprising the target hearing aid voice and the guiding command, and obtaining a first to-be-recognized source voice sequence matched with the target hearing aid voice.
Specifically, in the above embodiment, the computer system determines the target hearing assistance voice and the first to-be-recognized source voice sequence matched with the target hearing assistance voice through a specific method.
First, the computer system screens out a speech sequence having a speech coverage area greater than the original speech coverage area from speech material without hearing aid speech. Such material without hearing aid speech may come from various public or non-public speech libraries, which contain speech sequences of various speech coverage lengths. For example, the computer system may sort out all voice sequences from a large voice database that are longer than 10 seconds. During this process, the system will exclude material that already contains hearing aid speech to ensure that the selected speech sequence is original, raw. Once the sequences of speech without hearing aid speech between the speech coverage areas are selected to be greater than the original speech coverage areas, the system determines these sequences as target hearing aid speech. These target hearing aid voices will be used in subsequent voice construction and recognition processes.
Next, to determine a first sequence of source voices to be identified that matches the target hearing voice, the computer uses a constraint template (also called a hint) that contains the target hearing voice and the guidance command. This constraint template provides the necessary guidance and constraints for the speech generation network to ensure that the generated speech sequence is consistent in content with the target hearing aid speech. Based on this constraint template, the computer system may use a generated voice neural network (i.e., a voice generation network) to voice construct the target hearing assistance voices. This neural network may be constructed based on Transformer, RNN (recurrent neural network) or other advanced speech generation models. During the training process, the network learns how to generate a speech sequence that matches the target hearing aid speech according to a given constraint template. Through this process, the computer system is able to generate a first sequence of source speech to be identified that matches the target hearing assistance speech. These sequences are not only highly consistent in content with the target hearing aid speech, but also preserve the characteristics of the original speech, such as tone quality and intonation. This is critical for subsequent speech recognition and development of hearing aid devices.
According to the embodiment, the determination of the target hearing aid voice and the generation of the first to-be-identified source voice sequence matched with the target hearing aid voice are realized by screening the materials without the hearing aid voice and constructing the voice by utilizing the generated voice neural network. The method not only improves the accuracy of voice recognition, but also provides more natural and real voice experience for users of hearing-aid equipment.
In summary, the present application provides a voice semantic recognition method based on AI, which obtains a target feature representation network through debugging, extracts respective feature representations of an input voice and a candidate voice based on the target feature representation network to match, completes output of the candidate voice, can process long voice, converts the long voice into accurate and concise voice to output, and improves user communication efficiency and quality. In the debugging link of the target feature representation network, the initial distribution mapping table in the distribution vector mapping table is extended through the extension distribution mapping table based on the network architecture of the existing machine learning network, and the voice coverage areas of the initial distribution mapping table are initial voice coverage areas, so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas. Because the target voice coverage area is larger than the initial voice coverage area, the target features obtained through machine learning network debugging to be debugged represent that the network can process long voice sequences between voice coverage areas that are larger than the initial voice coverage area. When the network is debugged, initial assignment is carried out on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network. The initial variable of the extended distribution mapping table in the basic machine learning network is obtained by any assignment, the initial variable of the rest network variable in the basic machine learning network is obtained by initial assignment based on the network variable of the public feature representation network with the finished front debugging, the rest network variable is all network variable of the basic machine learning network, the rest network variable except the extended distribution mapping table comprises the initial distribution mapping table, the rest network variable with the front debugging can be fixed in the debugging process, and only the extended distribution mapping table of the basic machine learning network is debugged based on the acquired first voice binary group sample to obtain the target feature representation network. Based on the method, the voice semantics of the pre-debugged characteristic representation network learning can be maintained, so that the network debugging speed is increased. The application extends the voice coverage area of the distribution vector mapping table of the network to be debugged, and debugs the local network learnable variable of the machine learning network to be debugged, so as to increase and extend the maximum voice coverage area of the target characteristic representation network on the basis of maintaining the semantic information of the pre-obtained voice sequence and ensuring the debugging efficiency. Therefore, when the long voice sequence is processed, sequence splitting or other segmentation processing modes are not needed, so that the semantic information of the voice sequence is more complete, and the accuracy of the returned hearing-aid voice is improved.
It should be noted that, if the technical scheme of the present application relates to personal or private information, the product applying the technical scheme of the present application explicitly informs the personal information processing rule and obtains personal autonomous consent before processing the personal information. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, simultaneously meets the requirement of 'explicit consent', and simultaneously collects the information within the scope of laws and regulations. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.
It should be noted that, in the embodiment of the present application, if the method is implemented in the form of a software functional module, and sold or used as a separate product, the method may also be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium, including several instructions for causing a TWS headset (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
The embodiment of the application provides a TWS earphone, which comprises a memory and a processor, wherein the memory stores a characteristic representation network and a computer program capable of running on the processor, and the processor realizes the following steps when executing the computer program: acquiring a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed; obtaining first voice sequence features of the target to-be-recognized source voice sequence based on the target feature representation network, and obtaining second voice sequence features of each to-be-processed candidate voice sequence based on the target feature representation network; respectively acquiring a sixth commonality measurement coefficient between the first voice sequence feature and each second voice sequence feature; outputting a candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient; wherein the debugging process of the target feature representation network comprises the following steps: acquiring a first voice binary group sample consisting of a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence; acquiring a machine learning network to be debugged, wherein a distribution vector mapping table of the machine learning network to be debugged comprises an initial distribution mapping table and an extension distribution mapping table, the voice coverage areas of the initial distribution mapping table are the initial voice coverage areas, and the extension distribution mapping table is a distribution mapping table for extending the initial distribution mapping table; extending the initial distribution mapping table based on the extended distribution mapping table so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas, wherein the target voice coverage areas are larger than the initial voice coverage areas; performing initial assignment on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network, wherein initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of rest network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a public feature representation network subjected to front debugging, the rest network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the rest network learnable variables comprise the initial distribution mapping table; and debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a target feature representation network.
Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method. The computer readable storage medium may be transitory or non-transitory.
Embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above-described method. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
Fig. 3 is a schematic hardware entity diagram of a TWS earphone according to an embodiment of the present application, and as shown in fig. 3, the hardware entity of the TWS earphone 300 includes: a processor 310, a communication interface 320, and a memory 330, wherein: the processor 310 generally controls the overall operation of the TWS headset 300. The communication interface 320 may enable the TWS headset to communicate with other terminals or servers over a network. The memory 330 is configured to store instructions and applications executable by the processor 310, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by various modules in the processor 310 and the TWS headset 300, which may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM). Data transfer may occur between processor 310, communication interface 320, and memory 330 via bus 340. It should be noted here that: the description of the storage medium and apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.
Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium, including several instructions for causing a TWS headset (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The foregoing is merely an embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application.

Claims (11)

1. An AI-based speech semantic recognition method, the method comprising:
acquiring a first voice binary group sample consisting of a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence;
Acquiring a machine learning network to be debugged, wherein a distribution vector mapping table of the machine learning network to be debugged comprises an initial distribution mapping table and an extension distribution mapping table, the voice coverage areas of the initial distribution mapping table are the initial voice coverage areas, and the extension distribution mapping table is a distribution mapping table for extending the initial distribution mapping table;
Extending the initial distribution mapping table based on the extended distribution mapping table so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas, wherein the target voice coverage areas are larger than the initial voice coverage areas;
Performing initial assignment on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network, wherein initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of rest network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a public feature representation network subjected to front debugging, the rest network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the rest network learnable variables comprise the initial distribution mapping table;
Debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a target feature representation network;
acquiring a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed;
obtaining first voice sequence features of the target to-be-recognized source voice sequence based on the target feature representation network, and obtaining second voice sequence features of each to-be-processed candidate voice sequence based on the target feature representation network;
Respectively acquiring a sixth commonality measurement coefficient between the first voice sequence feature and each second voice sequence feature;
and outputting the candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient.
2. The method of claim 1, wherein the debugging the extended distribution mapping table of the underlying machine learning network based on the first speech tuple samples to obtain a target feature representation network comprises:
Debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a transition machine learning network;
acquiring a second voice binary group sample consisting of a second to-be-recognized source voice sequence and target hearing-aid voice matched with the second to-be-recognized source voice sequence;
and performing fine optimization on all network learnable variables of the transition machine learning network based on the second voice binary group sample to obtain the target feature representation network.
3. The method of claim 2, wherein the number of the first speech tuple samples is plural, each of the first speech tuple samples includes a first to-be-identified source speech sequence and a target hearing assistance speech matched with the first to-be-identified source speech sequence, and the debugging the extended distribution mapping table of the base machine learning network based on the first speech tuple samples to obtain a transitional machine learning network includes:
For a first to-be-recognized source voice sequence in each first voice binary group sample, acquiring an active voice example and a passive voice example of the first to-be-recognized source voice sequence, wherein the active voice example of the first to-be-recognized source voice sequence is a target hearing aid voice of the same first voice binary group sample as the first to-be-recognized source voice sequence, and the passive voice example of the first to-be-recognized source voice sequence is a target hearing aid voice of a different first voice binary group sample as the first to-be-recognized source voice sequence;
For each first to-be-identified source voice sequence, outputting a first voice feature of the first to-be-identified source voice sequence, a second voice feature of an active voice example of the first to-be-identified source voice sequence, and a third voice feature of a passive voice example of the first to-be-identified source voice sequence based on the base machine learning network;
Acquiring a first commonality metric coefficient between the first to-be-identified source voice sequence and active voice examples of the first to-be-identified source voice sequence based on the first voice feature and the second voice feature, and acquiring a second commonality metric coefficient between the first to-be-identified source voice sequence and passive voice examples of the first to-be-identified source voice sequence based on the first voice feature and the third voice feature;
generating a first evaluation function based on the first and second commonality metric coefficients;
And debugging an extended distribution mapping table of the basic machine learning network based on the first evaluation function to obtain the transition machine learning network.
4. The method of claim 2, wherein the number of second speech tuple samples is a plurality, each of the second speech tuple samples including a second to-be-identified source speech sequence and a target hearing assistance speech matching the second to-be-identified source speech sequence, the fine optimizing all network-learnable variables of the transition machine learning network based on the second speech tuple samples to obtain the target feature representation network, comprising:
For a second to-be-recognized source voice sequence in each second voice binary group sample, acquiring an active voice example and a passive voice example of the second to-be-recognized source voice sequence, wherein the active voice example of the second to-be-recognized source voice sequence is a target hearing aid voice of the second voice binary group sample which belongs to the same second voice binary group sample as the second to-be-recognized source voice sequence, and the passive voice example of the second to-be-recognized source voice sequence is a target hearing aid voice of a second voice binary group sample which belongs to different second to the second to-be-recognized source voice sequence;
Outputting, for each second to-be-recognized source speech sequence, fourth speech features of the second to-be-recognized source speech sequence, fifth speech features of positive speech examples of the second to-be-recognized source speech sequence, and sixth speech features of negative speech examples of the second to-be-recognized source speech sequence based on the transitional machine learning network;
Acquiring a third common metric coefficient between the second to-be-recognized source voice sequence and positive voice examples of the second to-be-recognized source voice sequence based on the fourth voice feature and the fifth voice feature, and acquiring a fourth common metric coefficient between the second to-be-recognized source voice sequence and negative voice examples of the second to-be-recognized source voice sequence based on the fourth voice feature and the sixth voice feature;
Determining a second evaluation function based on the third and fourth commonality metric coefficients;
Generating a target evaluation function based on the second evaluation function;
And performing fine optimization on all network learnable variables of the transition machine learning network based on the target evaluation function to obtain the target feature representation network.
5. The method of claim 4, wherein the generating a target evaluation function based on the second evaluation function comprises:
acquiring the remaining network learnable variables of the transition machine learning network;
determining a network-learnable variable evaluation function based on the starting learnable variable and the adjusted learnable variable of the remaining network-learnable variables;
The target evaluation function is generated based on the second evaluation function and the network-learnable variable evaluation function.
6. The method according to any one of claims 1-5, wherein a speech coverage area of target hearing assistance speech matching the first to-be-identified source speech sequence is larger than the initial speech coverage area, the obtaining a first speech binary group sample consisting of the first to-be-identified source speech sequence and the target hearing assistance speech matching the first to-be-identified source speech sequence, comprising:
Determining the target hearing aid voice in voice sequences between voice coverage areas which are larger than the initial voice coverage areas, and determining a first to-be-recognized source voice sequence matched with the target hearing aid voice;
and generating the first voice binary group sample based on the first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence.
7. The method of claim 6, wherein said determining the target hearing assistance voice in a voice sequence between voice coverage areas greater than the initial voice coverage area comprises:
screening voice sequences between voice coverage areas larger than the initial voice coverage areas from a public voice information base as candidate voice sequences;
determining the target hearing aid speech based on the candidate speech sequence;
the determining a first to-be-identified source voice sequence matched with the target hearing aid voice comprises the following steps:
And taking the to-be-recognized source voice sequence matched with the target hearing aid voice in the public voice information base as a first to-be-recognized source voice sequence matched with the target hearing aid voice.
8. The method of claim 7, wherein the number of candidate speech sequences is a plurality, and wherein the determining the target hearing assistance speech based on the candidate speech sequences comprises:
For each candidate voice sequence, determining the distribution condition of core content of the candidate voice sequence in the candidate voice sequence, wherein the core content is information matched with a first to-be-identified source voice sequence corresponding to the candidate voice sequence;
and if the distribution condition is the target distribution information, removing the candidate voice sequences from the plurality of candidate voice sequences according to the set percentage to obtain the target hearing aid voice.
9. The method of claim 8, wherein the determining a distribution of core content of the candidate speech sequence in the candidate speech sequence comprises:
Dividing the candidate voice sequences according to the initial voice coverage areas to obtain a plurality of voice subsequences;
respectively obtaining fifth commonality measurement coefficients between the plurality of voice subsequences and a first to-be-recognized source voice sequence corresponding to the candidate voice sequence;
And taking the voice subsequence with the maximum fifth commonality measurement coefficient as the distribution condition of the core content of the candidate voice sequence in the candidate voice sequence.
10. The method of claim 6, wherein said determining the target hearing assistance voice in a voice sequence between voice coverage areas greater than the initial voice coverage area comprises:
Determining a voice sequence without hearing aid voice, which is larger than the initial voice coverage area, in voice materials without hearing aid voice as the target hearing aid voice;
the determining a first to-be-identified source voice sequence matched with the target hearing aid voice comprises the following steps:
And constructing the target hearing aid voice based on a voice generating network according to a constraint template comprising the target hearing aid voice and the guiding command, and obtaining a first to-be-recognized source voice sequence matched with the target hearing aid voice.
11. A TWS headset comprising a memory and a processor, the memory storing a network of characteristic representations and a computer program executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring a target to-be-identified source voice sequence and a plurality of candidate voice sequences to be processed;
obtaining first voice sequence features of the target to-be-recognized source voice sequence based on the target feature representation network, and obtaining second voice sequence features of each to-be-processed candidate voice sequence based on the target feature representation network;
Respectively acquiring a sixth commonality measurement coefficient between the first voice sequence feature and each second voice sequence feature;
Outputting a candidate voice sequence to be processed corresponding to the second voice sequence feature with the largest sixth commonality measurement coefficient;
wherein the debugging process of the target feature representation network comprises the following steps:
acquiring a first voice binary group sample consisting of a first to-be-identified source voice sequence and target hearing aid voice matched with the first to-be-identified source voice sequence;
Acquiring a machine learning network to be debugged, wherein a distribution vector mapping table of the machine learning network to be debugged comprises an initial distribution mapping table and an extension distribution mapping table, the voice coverage areas of the initial distribution mapping table are the initial voice coverage areas, and the extension distribution mapping table is a distribution mapping table for extending the initial distribution mapping table;
Extending the initial distribution mapping table based on the extended distribution mapping table so that the voice coverage areas of the distribution vector mapping table extend from the initial voice coverage areas to target voice coverage areas, wherein the target voice coverage areas are larger than the initial voice coverage areas;
Performing initial assignment on network learnable variables of the machine learning network to be debugged to obtain a basic machine learning network, wherein initial learnable variables of an extended distribution mapping table in the basic machine learning network are obtained by performing arbitrary assignment, initial learnable variables of rest network learnable variables in the basic machine learning network are obtained by performing initial assignment on network learnable variables of a public feature representation network subjected to front debugging, the rest network learnable variables are network learnable variables except for the extended distribution mapping table in all network learnable variables of the basic machine learning network, and the rest network learnable variables comprise the initial distribution mapping table;
and debugging an extended distribution mapping table of the basic machine learning network based on the first voice binary group sample to obtain a target feature representation network.
CN202410623558.9A 2024-05-20 2024-05-20 Voice semantic recognition method based on AI and TWS earphone Active CN118230720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410623558.9A CN118230720B (en) 2024-05-20 2024-05-20 Voice semantic recognition method based on AI and TWS earphone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410623558.9A CN118230720B (en) 2024-05-20 2024-05-20 Voice semantic recognition method based on AI and TWS earphone

Publications (2)

Publication Number Publication Date
CN118230720A CN118230720A (en) 2024-06-21
CN118230720B true CN118230720B (en) 2024-07-19

Family

ID=91506313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410623558.9A Active CN118230720B (en) 2024-05-20 2024-05-20 Voice semantic recognition method based on AI and TWS earphone

Country Status (1)

Country Link
CN (1) CN118230720B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645956A (en) * 2023-06-16 2023-08-25 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021225198A1 (en) * 2020-05-08 2021-11-11 엘지전자 주식회사 Artificial intelligence device for recognizing speech and method thereof
CN115101061A (en) * 2022-07-14 2022-09-23 京东科技信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645956A (en) * 2023-06-16 2023-08-25 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Also Published As

Publication number Publication date
CN118230720A (en) 2024-06-21

Similar Documents

Publication Publication Date Title
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
WO2021143326A1 (en) Speech recognition method and apparatus, and device and storage medium
Zhou et al. Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function.
CN111009237B (en) Voice recognition method and device, electronic equipment and storage medium
CN110517664A (en) Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
Wang et al. Speaker recognition using convolutional neural network with minimal training data for smart home solutions
CN113314119A (en) Voice recognition intelligent household control method and device
CN112837669A (en) Voice synthesis method and device and server
US20240096332A1 (en) Audio signal processing method, audio signal processing apparatus, computer device and storage medium
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
WO2024139805A1 (en) Audio processing method and related device
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
CN118230720B (en) Voice semantic recognition method based on AI and TWS earphone
KR102663654B1 (en) Adaptive visual speech recognition
Tamm et al. Pre-trained speech representations as feature extractors for speech quality assessment in online conferencing applications
US12002451B1 (en) Automatic speech recognition
CN116129856A (en) Training method of speech synthesis model, speech synthesis method and related equipment
KR20230120790A (en) Speech Recognition Healthcare Service Using Variable Language Model
CN114333846A (en) Speaker identification method, device, electronic equipment and storage medium
Mandel et al. Learning a concatenative resynthesis system for noise suppression
Ma et al. Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Maiti et al. Concatenative Resynthesis Using Twin Networks.
CN117854509B (en) Training method and device for whisper speaker recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant