WO2021189903A1 - Audio-based user state identification method and apparatus, and electronic device and storage medium - Google Patents

Audio-based user state identification method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2021189903A1
WO2021189903A1 PCT/CN2020/131983 CN2020131983W WO2021189903A1 WO 2021189903 A1 WO2021189903 A1 WO 2021189903A1 CN 2020131983 W CN2020131983 W CN 2020131983W WO 2021189903 A1 WO2021189903 A1 WO 2021189903A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
target
spectrogram
voice signal
training
Prior art date
Application number
PCT/CN2020/131983
Other languages
French (fr)
Chinese (zh)
Inventor
魏文琦
王健宗
贾雪丽
张之勇
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189903A1 publication Critical patent/WO2021189903A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an audio-based user state recognition method, device, electronic equipment, and storage medium.
  • An audio-based user state recognition method provided in this application includes:
  • the user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  • the present application also provides an audio-based user state recognition device, which includes:
  • the model generation module is used to obtain an audio training set, perform feature conversion on each audio in the audio training set to obtain a target spectrogram atlas; based on the attention mechanism and small sample learning, use the target spectrogram atlas to
  • the pre-built deep learning network model is trained to obtain the user state recognition model;
  • the state recognition module is used to perform feature conversion on the audio of the user to be recognized when the audio of the user to be recognized is received to obtain the spectrogram to be recognized; use the user state recognition model to perform the feature conversion on the spectrogram to be recognized Recognize and get the result of user status recognition.
  • This application also provides an electronic device, which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the following steps:
  • the user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  • the present application also provides a computer-readable storage medium in which at least one instruction is stored, and when the at least one instruction is executed by a processor in an electronic device, the following steps are implemented:
  • the user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  • FIG. 1 is a schematic flowchart of an audio-based user state recognition method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of a detailed process of obtaining a target spectrogram set in an audio-based user state recognition method provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of modules of an audio-based user state recognition device provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of the internal structure of an electronic device that implements an audio-based user state recognition method provided by an embodiment of the application;
  • This application provides an audio-based user status recognition method.
  • FIG. 1 it is a schematic flowchart of an audio-based user state recognition method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the audio-based user state recognition method includes:
  • the audio training set is a collection of audios containing initial tags.
  • the initial tags are the user's disease conditions, such as acute bronchitis, chronic pharyngitis, pertussis, fever; further, Since the user's cough audio has corresponding sound features under different disease conditions, preferably, the audio training set is a collection of cough audio corresponding to different disease conditions, wherein the sound feature is the frequency domain of the cough audio
  • the characteristics can be represented by a spectrogram.
  • the embodiment of the present application performs feature transformation on the audio training set to obtain the target spectrogram atlas, including:
  • each audio in the audio training set is resampled to obtain the corresponding digital voice signal.
  • the present application uses a digital-to-analog converter to resample each audio in the audio training set.
  • a pre-emphasis operation is performed on each audio in the audio training set
  • the performing the pre-emphasis operation on each audio in the audio training set includes: re-sampling each audio in the audio training set to obtain the corresponding digital voice signal;
  • the digital voice signal is pre-emphasized to obtain a standard digital voice signal, and all the standard digital voice signals are summarized to obtain a voice signal set.
  • x(t) is the digital voice signal
  • t is the time
  • y(t) is the standard digital voice signal
  • is the preset adjustment value of the pre-emphasis operation, preferably, the value of ⁇
  • the range is [0.9,1.0].
  • the standard voice signal in the voice signal set can only reflect the change of audio in the time domain, and cannot reflect the audio characteristics of the standard voice signal.
  • the audio is more intuitive and clear, and feature conversion is performed on each standard digital voice signal in the voice signal set.
  • performing feature conversion on each standard digital voice signal in the voice signal set includes: using a preset sound processing algorithm to map each standard digital voice signal in the voice signal set on a frequency Domain, the corresponding target spectrogram is obtained, and all the target spectrograms are summarized to obtain the target spectrogram set.
  • the sound processing algorithm described in this application is the Mel filter algorithm.
  • the above steps only perform feature conversion on each audio of the audio training set, and will not affect the initial label corresponding to each audio of the audio training set, so the target spectrogram is set Each target spectrogram has a corresponding initial label.
  • the target spectrogram atlas is used to pre-build
  • the deep learning network model is trained to obtain an audio-based user state recognition model.
  • the training of the pre-built deep learning network model by using the target spectrogram atlas includes:
  • Step A The target spectrogram set is divided into a training set and a test set;
  • the target spectrogram set is divided into a training set and a test set, and the robustness of the model is enhanced by continuously testing the training model by using the test set, and dividing the target spectrogram set into a training set And a test set, including: classifying each target spectrogram in the target spectrogram atlas according to the corresponding initial label to obtain the corresponding classification target spectrogram atlas; randomly taking out from the classification target spectrogram atlas A preset number of target spectrograms are used as the test subset, and the complement of the training subset in the classified spectrogram set is used as the training subset; all the training sets of the training subset are summarized, and all the training sets are summarized.
  • the test subset obtains
  • Step B Use the training set to train the deep learning network to obtain an initial recognition model, test the initial recognition model according to the test set to obtain a loss value, and return to step when the loss value is greater than a preset threshold A.
  • the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
  • the deep learning network in the embodiment of the present application is a convolutional neural network.
  • the size of the images in the target spectrogram atlas may be different, which in turn leads to the target extracted by the deep learning network model during the training process.
  • the target spectrograms in the spectrogram set have different feature dimensions and cannot be uniformly trained.
  • the embodiment of the present application uses the training set to compare the pre-deep learning network , It is necessary to add an attention mechanism processing layer before the fully connected layer of the deep learning network model to perform image feature alignment, where the attention mechanism processing layer performs a feature alignment network according to different image feature dimensions, for example: target sound spectrum
  • the image feature a of the feature extraction performed on the deep learning network model in Figure A is a D*T1 dimensional matrix
  • the image feature b of the target spectrogram B that is feature extraction performed on the deep learning network model is a D*T2 dimensional matrix
  • the attention mechanism processing layer converts the preset weight matrix of image feature a multiplied by T1*1 into a D-dimensional matrix, and converts the preset weight matrix of image feature b multiplied by T2*1 into a D-dimensional matrix to realize the image Feature a and image feature b are aligned.
  • the embodiment of the present application needs to perform the initial recognition model to verify the recognition ability of the model to facilitate the training and adjustment of the model.
  • the recognition category of the initial recognition model in the embodiment of the present application is the same as the category of the initial tags in the target spectrogram atlas.
  • the recognition categories in the initial recognition model also have the same two types: chronic pharyngitis and fever.
  • testing the initial recognition model according to the test set to obtain a loss value includes: extracting a feature vector corresponding to each of the initial tags in the initial recognition model to obtain a target feature Vector; use the initial recognition model to perform feature extraction on each target spectrogram in the test subset to obtain a test feature vector; calculate the target feature vector and the test feature vector corresponding to each of the initial tags Calculate the average value of all the loss distance values to obtain the loss value.
  • the embodiment of the present application adopts an Euclidean distance calculation method to calculate the distance between the target feature vector corresponding to each of the initial tags and the test feature vector.
  • the different recognition types of the initial model correspond to different fully connected layer nodes, and the fully connected layer nodes have corresponding sequences.
  • the embodiment of the present application obtains the full range corresponding to each recognition type of the initial recognition model.
  • the output values of the connection layer nodes are combined in the order of the corresponding fully connected layer nodes to obtain the corresponding target feature vector; further, in the embodiment of the present application, each target spectrogram in the test subset is input to the office
  • the initial recognition model according to the initial label corresponding to each target spectrogram in the test subset, the output value of the fully connected layer node corresponding to the recognition category in the initial recognition model is obtained, and the output value of the fully connected layer node is obtained according to the corresponding fully connected layer node. Combine sequentially to obtain the test feature vector.
  • the audio training set may be stored in a blockchain node.
  • the audio of the user to be identified is of the same category as the audio in the audio training set.
  • the audio of the user to be identified is the user's cough audio. Audio training set
  • the method for performing feature conversion on the audio of the user to be identified in the embodiment of the present application is the same as the above-mentioned method for performing feature conversion on each audio of the audio training set.
  • the user status recognition result is the user's health status, such as acute bronchitis, chronic pharyngitis, pertussis, and fever.
  • feature conversion is performed on each audio in the audio training set to obtain the target spectrogram atlas, so that the features in the audio in the audio training set are clearer and more intuitive, and the accuracy of subsequent model training is increased; Attention mechanism and small sample learning, using the target spectrogram atlas to train a pre-built deep learning network model to obtain a user state recognition model, which enhances the robustness and training accuracy of the model under the small sample training set; Perform feature conversion on the audio of the user to be identified to obtain the spectrogram to be identified, so that the audio features of the user to be identified are more clear and intuitive, and the recognition accuracy of the subsequent model is improved; The to-be-recognized spectrogram is recognized, and the user state recognition result is obtained. A small amount of more easily available audio data is used to train the model, which reduces the data resource consumption of the model training. Only the user's audio can be used to recognize the user state. Enhance the practicality of the model.
  • FIG. 3 it is a functional block diagram of the audio-based user state recognition device of the present application.
  • the audio-based user state recognition apparatus 100 described in this application can be installed in an electronic device.
  • the audio-based user state recognition device may include a model generation module 101 and a state recognition module 102.
  • the module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the model generation module 101 is used to obtain an audio training set, perform feature conversion on each audio in the audio training set to obtain a target spectrogram set; based on the attention mechanism and small sample learning, use the target spectrogram Set to train the pre-built deep learning network model to obtain the user state recognition model.
  • the audio training set is a collection of audios containing initial tags.
  • the initial tags are the user's disease conditions, such as acute bronchitis, chronic pharyngitis, pertussis, fever; further, Since the user's cough audio has corresponding sound features under different disease conditions, preferably, the audio training set is a collection of cough audio corresponding to different disease conditions, wherein the sound feature is the frequency domain of the cough audio
  • the characteristics can be represented by a spectrogram.
  • the model generation module 101 in this embodiment of the present application uses the following means to perform feature transformation on the audio training set to obtain the Target sound spectrum atlas, including:
  • each audio in the audio training set is resampled to obtain the corresponding digital voice signal.
  • the present application uses a digital-to-analog converter to resample each audio in the audio training set.
  • a pre-emphasis operation is performed on each audio in the audio training set
  • the pre-emphasis operation on each audio in the audio training set includes: re-sampling each audio in the audio training set to obtain the corresponding digital voice signal;
  • the digital voice signal is pre-emphasized to obtain a standard digital voice signal, and all the standard digital voice signals are summarized to obtain a voice signal set.
  • model generation module 101 uses the following formula to perform the pre-emphasis operation:
  • x(t) is the digital voice signal
  • t is the time
  • y(t) is the standard digital voice signal
  • is the preset adjustment value of the pre-emphasis operation, preferably, the value of ⁇
  • the range is [0.9,1.0].
  • the standard voice signal in the voice signal set can only reflect the change of audio in the time domain, and cannot reflect the audio characteristics of the standard voice signal.
  • the audio is more intuitive and clear, and feature conversion is performed on each standard digital voice signal in the voice signal set.
  • the model generation module 101 in the embodiment of the present application uses the following means to perform feature conversion on each standard digital voice signal in the voice signal set, including: using a preset voice processing algorithm to concentrate the voice signal Each standard digital speech signal is mapped in the frequency domain to obtain a corresponding target spectrogram, and all the target spectrograms are summarized to obtain the target spectrogram set.
  • the sound processing algorithm described in this application is the Mel filter algorithm.
  • the above steps only perform feature conversion on each audio of the audio training set, and will not affect the initial label corresponding to each audio of the audio training set, so the target spectrogram is set Each target spectrogram has a corresponding initial label.
  • the target spectrogram atlas is used to pre-build
  • the deep learning network model is trained to obtain an audio-based user state recognition model.
  • the model generation module 101 uses the following methods to train the pre-built deep learning network model, including:
  • Step A The target spectrogram set is divided into a training set and a test set;
  • the target spectrogram set is divided into a training set and a test set, and the robustness of the model is enhanced by continuously testing the training model by using the test set, and dividing the target spectrogram set into a training set
  • the test set including: classifying each target spectrogram in the target spectrogram atlas according to the corresponding initial label to obtain the corresponding classification target spectrogram atlas; randomly taking out from the classification target spectrogram atlas
  • a preset number of target spectrograms are used as the test subset, and the complement of the training subset in the classified spectrogram set is used as the training subset; all the training sets of the training subset are summarized, and all the training sets are summarized.
  • the test subset obtains a
  • Step B Use the training set to train the deep learning network to obtain an initial recognition model, test the initial recognition model according to the test set to obtain a loss value, and return to step when the loss value is greater than a preset threshold A.
  • the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
  • the deep learning network in the embodiment of the present application is a convolutional neural network.
  • the size of the images in the target spectrogram atlas may be different, which in turn leads to the target extracted by the deep learning network model during the training process.
  • the target spectrograms in the spectrogram set have different feature dimensions and cannot be uniformly trained.
  • the embodiment of the present application uses the training set to compare the pre-deep learning network , It is necessary to add an attention mechanism processing layer before the fully connected layer of the deep learning network model to perform image feature alignment, where the attention mechanism processing layer performs a feature alignment network according to different image feature dimensions, for example: target sound spectrum
  • the image feature a of the feature extraction performed on the deep learning network model in Figure A is a D*T1 dimensional matrix
  • the image feature b of the target spectrogram B that is feature extraction performed on the deep learning network model is a D*T2 dimensional matrix
  • the attention mechanism processing layer converts the preset weight matrix of image feature a multiplied by T1*1 into a D-dimensional matrix, and converts the preset weight matrix of image feature b multiplied by T2*1 into a D-dimensional matrix to realize the image Feature a and image feature b are aligned.
  • the embodiment of the present application needs to perform the initial recognition model to verify the recognition ability of the model to facilitate the training and adjustment of the model.
  • the recognition category of the initial recognition model in the embodiment of the present application is the same as the category of the initial tags in the target spectrogram atlas.
  • the recognition categories in the initial recognition model also have the same two types: chronic pharyngitis and fever.
  • the model generation module 101 in the embodiment of the present application obtains the loss value by the following means, including: extracting the feature vector corresponding to each of the initial tags in the initial recognition model to obtain the target feature vector;
  • the recognition model performs feature extraction on each target spectrogram in the test subset to obtain a test feature vector; calculates the distance between the target feature vector corresponding to each initial tag and the test feature vector to obtain the loss distance Value; Calculate the average of all the loss distance values to obtain the loss value.
  • the embodiment of the present application adopts an Euclidean distance calculation method to calculate the distance between the target feature vector corresponding to each of the initial tags and the test feature vector.
  • the model generation module 101 described in this embodiment of the application obtains each of the initial recognition models.
  • the output values of the fully connected layer nodes corresponding to the recognition category are combined according to the order of the corresponding fully connected layer nodes to obtain the corresponding target feature vector; further, the model generation module 101 described in the embodiment of the present application combines the Each target spectrogram in the test subset is input to the initial recognition model, and the fully connected layer node corresponding to the recognition category in the initial recognition model is obtained according to the initial label corresponding to each target spectrogram in the test subset The output values of are combined according to the order of the corresponding fully connected layer nodes to obtain the test feature vector.
  • the audio training set may be stored in a blockchain node.
  • the state recognition module 102 is configured to, when receiving the audio of the user to be recognized, perform feature conversion on the audio of the user to be recognized to obtain the spectrogram to be recognized; The spectrum is identified, and the user status identification result is obtained.
  • the audio of the user to be identified is of the same category as the audio in the audio training set.
  • the audio of the user to be identified is the user's cough audio. Audio training set
  • the method for performing feature conversion on the audio of the user to be identified in the embodiment of the present application is the same as the above-mentioned method for performing feature conversion on each audio of the audio training set.
  • the user status recognition result is the user's disease condition, such as acute bronchitis, chronic pharyngitis, whooping cough, and fever.
  • FIG. 44 it is a schematic structural diagram of an electronic device that implements an audio-based user state recognition method according to the present application.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as an audio-based user state recognition program.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital) equipped on the electronic device 1.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of an audio-based user status recognition program, etc., but also to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc.
  • the processor 10 is the control core of the electronic device (Control Unit), using various interfaces and lines to connect the various components of the entire electronic device, by running or executing programs or modules stored in the memory 11 (for example, audio-based user status recognition programs, etc.), and calling
  • the data in the memory 11 is used to perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI) bus or an extended industry standard structure (extended industry standard structure). industry standard architecture, EISA for short) bus, etc.
  • PCI peripheral component interconnect
  • extended industry standard structure extended industry standard structure
  • EISA industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the audio-based user state recognition program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
  • the user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  • the integrated module/unit of the electronic device 1 can be stored in a computer-readable storage medium. It can be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the following steps:
  • the user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  • the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store a block chain node Use the created data, etc.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

An audio-based user state identification method and apparatus, and an electronic device and a computer-readable storage medium. The method comprises: acquiring an audio training set, and performing feature conversion on each piece of audio in the audio training set so as to obtain a target sonogram set (S1); on the basis of an attention mechanism and small sample learning, training a pre-constructed deep learning network model by using the target sonogram set so as to obtain a user state identification model (S2); when audio of a user to be subjected to identification is received, performing feature conversion on the audio of said user so as to obtain a sonogram to be subjected to identification (S3); and identifying said sonogram by using the user state identification model so as to obtain a user state identification result (S4). In addition, the present application further relates to blockchain technology, and the audio training set can be stored in a blockchain. By using the method, the consumption of data resources is reduced, and the practicability of a model is enhanced.

Description

基于音频的用户状态识别方法、装置、电子设备及存储介质Audio-based user state recognition method, device, electronic equipment and storage medium
本申请要求于2020年10月9日提交中国专利局、申请号为CN202011074898.9、名称为“基于音频的用户状态识别方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with the application number CN202011074898.9 and titled "Audio-based user status recognition method, device and storage medium" on October 9, 2020. The entire content of the application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种基于音频的用户状态识别方法、装置、电子设备及存储介质。This application relates to the field of artificial intelligence, and in particular to an audio-based user state recognition method, device, electronic equipment, and storage medium.
背景技术Background technique
随着智慧生活的理念逐渐普及,用户状态成为了智慧生活的核心关注点,因此用户状态的识别成为了一件非常重要的事情,如识别用户当前的健康状态,尤其在传染病大肆流行的时候,时刻了解每个人的健康状态很重要。一般情况下,需要用户到医院找医生进行身体检查才能了解其健康情况,医院本身就充满了各种病菌,去医院检查存在着被感染的风险。With the gradual popularization of the concept of smart life, user status has become the core concern of smart life. Therefore, the identification of user status has become a very important thing, such as identifying the current health status of users, especially when infectious diseases are spreading. , It’s important to know everyone’s health at all times. Under normal circumstances, users need to go to the hospital to find a doctor for a physical examination to understand their health. The hospital itself is full of various germs, and there is a risk of infection when going to the hospital for examination.
技术问题technical problem
发明人意识到,目前,通常利用大量的用户的医学图像(如胸部X光)来训练机器学习模型实现用户状态识别来确定用户的健康状态,但是大量的用户的医学图像耗费了大量的数据资源,且用户的医学图像获取门槛较高导致实用性不强无法更好地推广普及。The inventor realized that at present, a large number of medical images of users (such as chest X-rays) are usually used to train machine learning models to realize user status recognition to determine the health status of users, but a large number of medical images of users consume a lot of data resources. , And the user’s high threshold for obtaining medical images leads to poor practicability and cannot be better promoted.
技术解决方案Technical solutions
本申请提供的一种基于音频的用户状态识别方法,包括:An audio-based user state recognition method provided in this application includes:
获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
本申请还提供一种基于音频的用户状态识别装置,所述装置包括:The present application also provides an audio-based user state recognition device, which includes:
模型生成模块,用于获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;The model generation module is used to obtain an audio training set, perform feature conversion on each audio in the audio training set to obtain a target spectrogram atlas; based on the attention mechanism and small sample learning, use the target spectrogram atlas to The pre-built deep learning network model is trained to obtain the user state recognition model;
状态识别模块,用于当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The state recognition module is used to perform feature conversion on the audio of the user to be recognized when the audio of the user to be recognized is received to obtain the spectrogram to be recognized; use the user state recognition model to perform the feature conversion on the spectrogram to be recognized Recognize and get the result of user status recognition.
本申请还提供一种电子设备,所述电子设备包括:This application also provides an electronic device, which includes:
存储器,存储至少一个指令;及Memory, storing at least one instruction; and
处理器,执行所述存储器中存储的指令以实现如下步骤:The processor executes the instructions stored in the memory to implement the following steps:
获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行时实现如下步骤:The present application also provides a computer-readable storage medium in which at least one instruction is stored, and when the at least one instruction is executed by a processor in an electronic device, the following steps are implemented:
获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
附图说明Description of the drawings
图1为本申请一实施例提供的基于音频的用户状态识别方法的流程示意图;FIG. 1 is a schematic flowchart of an audio-based user state recognition method provided by an embodiment of this application;
图2为本申请一实施例提供的基于音频的用户状态识别方法中得到目标声谱图集的详细流程示意图;2 is a schematic diagram of a detailed process of obtaining a target spectrogram set in an audio-based user state recognition method provided by an embodiment of the application;
图3为本申请一实施例提供的基于音频的用户状态识别装置的模块示意图;3 is a schematic diagram of modules of an audio-based user state recognition device provided by an embodiment of the application;
图4为本申请一实施例提供的实现基于音频的用户状态识别方法的电子设备的内部结构示意图;4 is a schematic diagram of the internal structure of an electronic device that implements an audio-based user state recognition method provided by an embodiment of the application;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the present invention
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
本申请提供一种基于音频的用户状态识别方法。参照图1所示,为本申请一实施例提供的基于音频的用户状态识别方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides an audio-based user status recognition method. Referring to FIG. 1, it is a schematic flowchart of an audio-based user state recognition method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,基于音频的用户状态识别方法包括:In this embodiment, the audio-based user state recognition method includes:
S1、获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;S1. Obtain an audio training set, perform feature conversion on each audio in the audio training set, to obtain a target spectrogram atlas;
本申请实施例中,所述音频训练集为包含初始标签的音频的集合,较佳地,所述初始标签为用户的疾病情况,例如:急性支气管炎、慢性咽炎、百日咳、发烧;进一步地,由于不同疾病情况下用户的咳嗽音频具有对应的声音特征,因此,较佳地,所述音频训练集为不同的疾病情况对应的咳嗽音频的集合,其中,所述声音特征为咳嗽音频的频域特征可用声谱图表示。In the embodiment of the present application, the audio training set is a collection of audios containing initial tags. Preferably, the initial tags are the user's disease conditions, such as acute bronchitis, chronic pharyngitis, pertussis, fever; further, Since the user's cough audio has corresponding sound features under different disease conditions, preferably, the audio training set is a collection of cough audio corresponding to different disease conditions, wherein the sound feature is the frequency domain of the cough audio The characteristics can be represented by a spectrogram.
进一步地,为了后续模型更好的所述音频训练集中的每个音频的特征更加直观清晰,本申请实施例对所述音频训练集进行特征变换,得到所述目标声谱图集,包括:Further, in order to make the feature of each audio in the audio training set more intuitive and clearer for a better subsequent model, the embodiment of the present application performs feature transformation on the audio training set to obtain the target spectrogram atlas, including:
S11、对所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;S11. Resample each audio in the audio training set to obtain a corresponding digital voice signal;
本申请实施例中,为了便于对所述音频训练集中的每个音频进行数据处理,对所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号,较佳的地,本申请实施例利用数模转换器对所述音频训练集中的每个音频进行重采样。In the embodiment of the present application, in order to facilitate data processing of each audio in the audio training set, each audio in the audio training set is resampled to obtain the corresponding digital voice signal. Preferably, the present application The embodiment uses a digital-to-analog converter to resample each audio in the audio training set.
S12、对所述数字语音信号进行预加重,得到标准数字语音信号;S12. Pre-emphasize the digital voice signal to obtain a standard digital voice signal;
S13、汇总所有的所述标准数字语音信号,得到语音信号集;S13. Summarize all the standard digital voice signals to obtain a voice signal set;
本申请实施例中,为了对所述音频训练集获取过程中导致的音频信息丢失进行补偿,对所述音频训练集中的每个音频进行预加重操作,In the embodiment of the present application, in order to compensate for the loss of audio information caused during the acquisition of the audio training set, a pre-emphasis operation is performed on each audio in the audio training set,
详细地,本申请实施例中,所述对所述音频训练集中的每个音频进行预加重操作,包括:将所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;将所述数字语音信号进行预加重,得到标准数字语音信号,汇总所有的所述标准数字语音信号,得到语音信号集。In detail, in the embodiment of the present application, the performing the pre-emphasis operation on each audio in the audio training set includes: re-sampling each audio in the audio training set to obtain the corresponding digital voice signal; The digital voice signal is pre-emphasized to obtain a standard digital voice signal, and all the standard digital voice signals are summarized to obtain a voice signal set.
详细地,本申请实施例利用如下公式进行所述预加重操作:In detail, the embodiment of the present application uses the following formula to perform the pre-emphasis operation:
y(t)=x(t)-μx(t-1)y(t)=x(t)-μx(t-1)
其中,x(t)为所述数字语音信号,t为时间,y(t)为所述标准数字语音信号,μ为所述预加重操作的预设调节值,较佳地,μ的取值范围为[0.9,1.0]。Wherein, x(t) is the digital voice signal, t is the time, y(t) is the standard digital voice signal, μ is the preset adjustment value of the pre-emphasis operation, preferably, the value of μ The range is [0.9,1.0].
S14、对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集。S14. Perform feature conversion on each standard digital voice signal included in the voice signal set to obtain a target spectrogram set.
本申请实施例中,所述语音信号集中的标准语音信号只能体现音频在时域上的变化,不能体现所述标准语音信号的音频特征,为了体现所述标准语音信号的音频特征,使音频特征更加直观和清晰,对所述语音信号集中的每个标准数字语音信号进行特征转换。In the embodiment of the present application, the standard voice signal in the voice signal set can only reflect the change of audio in the time domain, and cannot reflect the audio characteristics of the standard voice signal. In order to reflect the audio characteristics of the standard voice signal, the audio The features are more intuitive and clear, and feature conversion is performed on each standard digital voice signal in the voice signal set.
详细地,本申请实施例中对所述语音信号集中的每个标准数字语音信号进行特征转换,包括:利用预设声音处理算法,将所述语音信号集中的每个标准数字语音信号映射在频域,得到对应的目标声谱图,汇总所有的所述目标声谱图的得到所述目标声谱图集。In detail, in the embodiment of the present application, performing feature conversion on each standard digital voice signal in the voice signal set includes: using a preset sound processing algorithm to map each standard digital voice signal in the voice signal set on a frequency Domain, the corresponding target spectrogram is obtained, and all the target spectrograms are summarized to obtain the target spectrogram set.
较佳地,本申请中所述声音处理算法为梅尔滤波算法。Preferably, the sound processing algorithm described in this application is the Mel filter algorithm.
本申请实施例中,上述步骤只是对所述音频训练集的每个音频进行特征转换,不会影响所述音频训练集的每个音频对应的所述初始标签,所以所述目标声谱图集中每个目标声谱图都有对应的初始标签。In the embodiment of this application, the above steps only perform feature conversion on each audio of the audio training set, and will not affect the initial label corresponding to each audio of the audio training set, so the target spectrogram is set Each target spectrogram has a corresponding initial label.
S2、基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;S2, based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
本申请实施例中,由于所述音频训练集中的样本数量过少,为了保证后续模型的训练精度及鲁棒性,基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到基于音频的用户状态识别模型。In the embodiment of this application, since the number of samples in the audio training set is too small, in order to ensure the training accuracy and robustness of the subsequent model, based on the attention mechanism and small sample learning, the target spectrogram atlas is used to pre-build The deep learning network model is trained to obtain an audio-based user state recognition model.
详细地,本申请实施例中,所述利用所述目标声谱图集对预构建的深度学习网络模型进行训练,包括:In detail, in the embodiment of the present application, the training of the pre-built deep learning network model by using the target spectrogram atlas includes:
步骤A:将所述目标声谱图集划分为训练集及测试集;Step A: The target spectrogram set is divided into a training set and a test set;
本申请实施例中,由于所述目标声谱图集中的样本数据较少且不易获取,直接将所述目标声谱图集作为训练集会导致后续模型的鲁棒性较差,因此,本申请实施例将所述目标声谱图集划分为训练集及测试集,通过利用测试集不断的对训练模型进行测试调整增强模型的鲁棒性,所述将所述目标声谱图集划分为训练集及测试集,包括:将所述目标声谱图集中的每个目标声谱图按照对应的初始标签进行分类,得到对应的分类目标声谱图集;从所述分类目标声谱图集中随机取出预设数量的目标声谱图作为测试子集,将所述分类声谱图集中所述训练子集的补集作为训练子集;汇总所有的所述训练子集的训练集,汇总所有的所述测试子集得到测试集,较佳地,本申请实施例中所述预设数量为1。In the embodiment of the present application, since the sample data in the target spectrogram atlas is small and difficult to obtain, directly using the target spectrogram atlas as the training set will result in poor robustness of the subsequent model. Therefore, the implementation of this application For example, the target spectrogram set is divided into a training set and a test set, and the robustness of the model is enhanced by continuously testing the training model by using the test set, and dividing the target spectrogram set into a training set And a test set, including: classifying each target spectrogram in the target spectrogram atlas according to the corresponding initial label to obtain the corresponding classification target spectrogram atlas; randomly taking out from the classification target spectrogram atlas A preset number of target spectrograms are used as the test subset, and the complement of the training subset in the classified spectrogram set is used as the training subset; all the training sets of the training subset are summarized, and all the training sets are summarized. The test subset obtains a test set. Preferably, the preset number in the embodiment of the present application is 1.
步骤B:利用所述训练集对所述深度学习网络进行训练得到初始识别模型,根据所述测试集对所述初始识别模型进行测试得到损失值,当所述损失值大于预设阈值时返回步骤A,当所述损失值小于或等于预设阈值时,将所述初始识别模型作为用户状态识别模型。Step B: Use the training set to train the deep learning network to obtain an initial recognition model, test the initial recognition model according to the test set to obtain a loss value, and return to step when the loss value is greater than a preset threshold A. When the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
较佳地,本申请实施例中所述深度学习网络为卷积神经网络。Preferably, the deep learning network in the embodiment of the present application is a convolutional neural network.
本申请实施例中,由于所述音频训练集中的音频时间可能不一致,导致所述目标声谱图集中的图像大小存在差异,进而导致所述深度学习网络模型在训练的过程中提取的所述目标声谱图集中的目标声谱图的特征维度不同,无法进行统一训练,因此,为了更好地利用所述音频训练集中的数据,本申请实施例利用所述训练集对所述深度学习网络前,需要在所述深度学习网络模型的全连接层前增加注意力机制处理层进行图像特征对齐,其中,所述注意力机制处理层根据图像特征维度不同进行特征对齐的网络,例如:目标声谱图A的在所述深度学习网络模型进行特征提取的图像特征a为D*T1维矩阵,目标声谱图B的在所述深度学习网络模型进行特征提取的图像特征b为D*T2维矩阵,所述注意力机制处理层将图像特征a乘上T1*1的预设权重矩阵转化为D维矩阵,将图像特征b乘上T2*1的预设权重矩阵转化为D维矩阵,实现图像特征a及图像特征b的特征对齐。In the embodiment of this application, since the audio time in the audio training set may be inconsistent, the size of the images in the target spectrogram atlas may be different, which in turn leads to the target extracted by the deep learning network model during the training process. The target spectrograms in the spectrogram set have different feature dimensions and cannot be uniformly trained. Therefore, in order to make better use of the data in the audio training set, the embodiment of the present application uses the training set to compare the pre-deep learning network , It is necessary to add an attention mechanism processing layer before the fully connected layer of the deep learning network model to perform image feature alignment, where the attention mechanism processing layer performs a feature alignment network according to different image feature dimensions, for example: target sound spectrum The image feature a of the feature extraction performed on the deep learning network model in Figure A is a D*T1 dimensional matrix, and the image feature b of the target spectrogram B that is feature extraction performed on the deep learning network model is a D*T2 dimensional matrix The attention mechanism processing layer converts the preset weight matrix of image feature a multiplied by T1*1 into a D-dimensional matrix, and converts the preset weight matrix of image feature b multiplied by T2*1 into a D-dimensional matrix to realize the image Feature a and image feature b are aligned.
进一步地,由于所述训练集的中的样本数量较少,所以本申请实施例需要对所述初始识别模型进行以验证该模型的识别能力方便对模型进行训练调整。Further, since the number of samples in the training set is small, the embodiment of the present application needs to perform the initial recognition model to verify the recognition ability of the model to facilitate the training and adjustment of the model.
详细地,本申请实施例中所述初始识别模型的识别类别和所述目标声谱图集中的初始标签的类别相同,例如:所述目标声谱图集中共有两种初始标签为慢性咽炎、发烧,那么所述初始识别模型中的识别类别也有相同的两种为慢性咽炎、发烧。进一步地,本申请实施例中所述根据所述测试集对所述初始识别模型进行测试得到损失值,包括:提取所述初始识别模型中每种所述初始标签对应的特征向量,得到目标特征向量;利用所述初始识别模型对所述测试子集中的每个目标声谱图进行特征提取,得到测试特征向量;计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离,得到损失距离值;计算所有所述损失距离值的平均值,得到所述损失值。较佳地,本申请实施例采用欧氏距离计算方法计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离。In detail, the recognition category of the initial recognition model in the embodiment of the present application is the same as the category of the initial tags in the target spectrogram atlas. For example, there are two initial tags in the target spectrogram atlas: chronic pharyngitis and fever. , Then the recognition categories in the initial recognition model also have the same two types: chronic pharyngitis and fever. Further, in the embodiment of the present application, testing the initial recognition model according to the test set to obtain a loss value includes: extracting a feature vector corresponding to each of the initial tags in the initial recognition model to obtain a target feature Vector; use the initial recognition model to perform feature extraction on each target spectrogram in the test subset to obtain a test feature vector; calculate the target feature vector and the test feature vector corresponding to each of the initial tags Calculate the average value of all the loss distance values to obtain the loss value. Preferably, the embodiment of the present application adopts an Euclidean distance calculation method to calculate the distance between the target feature vector corresponding to each of the initial tags and the test feature vector.
进一步地,本领域技术人员可知所述初始模型的不同识别类别连接对应不同的全连接层节点,全连接层节点有对应的顺序,本申请实施例获取初始识别模型的每种识别类别对应的全连接层节点的输出值并按照对应的全连接层节点的顺序进行组合,得到对应的所述目标特征向量;进一步地,本申请实施例将所述测试子集中的每个目标声谱图输入所述初始识别模型,根据所述测试子集中的每个目标声谱图对应的初始标签,获取所述初始识别模型中对应识别类别的全连接层节点的输出值并按照对应的全连接层节点的顺序进行组合,得到所述测试特征向量。Furthermore, those skilled in the art can know that the different recognition types of the initial model correspond to different fully connected layer nodes, and the fully connected layer nodes have corresponding sequences. The embodiment of the present application obtains the full range corresponding to each recognition type of the initial recognition model. The output values of the connection layer nodes are combined in the order of the corresponding fully connected layer nodes to obtain the corresponding target feature vector; further, in the embodiment of the present application, each target spectrogram in the test subset is input to the office According to the initial recognition model, according to the initial label corresponding to each target spectrogram in the test subset, the output value of the fully connected layer node corresponding to the recognition category in the initial recognition model is obtained, and the output value of the fully connected layer node is obtained according to the corresponding fully connected layer node. Combine sequentially to obtain the test feature vector.
本申请的另一实施例中,为了保证数据的隐私性,所述音频训练集可以存储在区块链节点中。In another embodiment of the present application, in order to ensure data privacy, the audio training set may be stored in a blockchain node.
S3、当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;S3. When the audio of the user to be identified is received, perform feature conversion on the audio of the user to be identified to obtain the spectrogram to be identified;
本申请实施例中,所述待识别用户的音频与所述音频训练集中的音频的类别相同,较佳地,本申请实施例中,所述待识别用户的音频为用户的咳嗽音频。音频训练集In the embodiment of the present application, the audio of the user to be identified is of the same category as the audio in the audio training set. Preferably, in the embodiment of the present application, the audio of the user to be identified is the user's cough audio. Audio training set
进一步地,本申请实施例中对所述待识别用户的音频进行特征转换的方法与上述对所述音频训练集的每个音频进行特征转换的方法相同。Further, the method for performing feature conversion on the audio of the user to be identified in the embodiment of the present application is the same as the above-mentioned method for performing feature conversion on each audio of the audio training set.
S4、利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。S4. Recognizing the spectrogram to be recognized by using the user state recognition model to obtain a user state recognition result.
本申请实施例中,所述用户状态识别结果为用户的健康状态,例如:急性支气管炎、慢性咽炎、百日咳、发烧。In the embodiment of the present application, the user status recognition result is the user's health status, such as acute bronchitis, chronic pharyngitis, pertussis, and fever.
本申请实施例中,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集,使所述音频训练集中音频中的特征更加清晰直观,增加了后续模型训练的精度;基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型,增强了小样本训练集下的模型的鲁棒性及训练精度;对所述待识别用户的音频进行特征转换,得到待识别声谱图,使所述待识别用户的音频特征更加清晰直观,提高了后续模型的识别精度;利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果,利用少量的更易得的音频数据进行模型的训练,降低了模型的训练的数据资源消耗,只需要用户的音频就可以对用户状态进行识别,增强了模型的实用性。In the embodiment of the present application, feature conversion is performed on each audio in the audio training set to obtain the target spectrogram atlas, so that the features in the audio in the audio training set are clearer and more intuitive, and the accuracy of subsequent model training is increased; Attention mechanism and small sample learning, using the target spectrogram atlas to train a pre-built deep learning network model to obtain a user state recognition model, which enhances the robustness and training accuracy of the model under the small sample training set; Perform feature conversion on the audio of the user to be identified to obtain the spectrogram to be identified, so that the audio features of the user to be identified are more clear and intuitive, and the recognition accuracy of the subsequent model is improved; The to-be-recognized spectrogram is recognized, and the user state recognition result is obtained. A small amount of more easily available audio data is used to train the model, which reduces the data resource consumption of the model training. Only the user's audio can be used to recognize the user state. Enhance the practicality of the model.
如图3所示,是本申请基于音频的用户状态识别装置的功能模块图。As shown in Fig. 3, it is a functional block diagram of the audio-based user state recognition device of the present application.
本申请所述基于音频的用户状态识别装置100可以安装于电子设备中。根据实现的功能,所述基于音频的用户状态识别装置可以包括模型生成模块101、状态识别模块102。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The audio-based user state recognition apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the audio-based user state recognition device may include a model generation module 101 and a state recognition module 102. The module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
所述模型生成模块101用于获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型。The model generation module 101 is used to obtain an audio training set, perform feature conversion on each audio in the audio training set to obtain a target spectrogram set; based on the attention mechanism and small sample learning, use the target spectrogram Set to train the pre-built deep learning network model to obtain the user state recognition model.
本申请实施例中,所述音频训练集为包含初始标签的音频的集合,较佳地,所述初始标签为用户的疾病情况,例如:急性支气管炎、慢性咽炎、百日咳、发烧;进一步地,由于不同疾病情况下用户的咳嗽音频具有对应的声音特征,因此,较佳地,所述音频训练集为不同的疾病情况对应的咳嗽音频的集合,其中,所述声音特征为咳嗽音频的频域特征可用声谱图表示。In the embodiment of the present application, the audio training set is a collection of audios containing initial tags. Preferably, the initial tags are the user's disease conditions, such as acute bronchitis, chronic pharyngitis, pertussis, fever; further, Since the user's cough audio has corresponding sound features under different disease conditions, preferably, the audio training set is a collection of cough audio corresponding to different disease conditions, wherein the sound feature is the frequency domain of the cough audio The characteristics can be represented by a spectrogram.
进一步地,为了后续模型更好的所述音频训练集中的每个音频的特征更加直观清晰,本申请实施例所述模型生成模块101利用如下手段对所述音频训练集进行特征变换,得到所述目标声谱图集,包括:Further, in order for the subsequent model to have better features of each audio in the audio training set more intuitive and clear, the model generation module 101 in this embodiment of the present application uses the following means to perform feature transformation on the audio training set to obtain the Target sound spectrum atlas, including:
对所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;Re-sampling each audio in the audio training set to obtain a corresponding digital voice signal;
本申请实施例中,为了便于对所述音频训练集中的每个音频进行数据处理,对所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号,较佳的地,本申请实施例利用数模转换器对所述音频训练集中的每个音频进行重采样。In the embodiment of the present application, in order to facilitate data processing of each audio in the audio training set, each audio in the audio training set is resampled to obtain the corresponding digital voice signal. Preferably, the present application The embodiment uses a digital-to-analog converter to resample each audio in the audio training set.
对所述数字语音信号进行预加重,得到标准数字语音信号;Performing pre-emphasis on the digital voice signal to obtain a standard digital voice signal;
汇总所有的所述标准数字语音信号,得到语音信号集;Summarize all the standard digital voice signals to obtain a voice signal set;
本申请实施例中,为了对所述音频训练集获取过程中导致的音频信息丢失进行补偿,对所述音频训练集中的每个音频进行预加重操作,In the embodiment of the present application, in order to compensate for the loss of audio information caused during the acquisition of the audio training set, a pre-emphasis operation is performed on each audio in the audio training set,
详细地,本申请实施例中,所述对所述音频训练集中的每个音频进行预加重操作,包括:对所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;将所述数字语音信号进行预加重,得到标准数字语音信号,汇总所有的所述标准数字语音信号,得到语音信号集。In detail, in the embodiment of the present application, the pre-emphasis operation on each audio in the audio training set includes: re-sampling each audio in the audio training set to obtain the corresponding digital voice signal; The digital voice signal is pre-emphasized to obtain a standard digital voice signal, and all the standard digital voice signals are summarized to obtain a voice signal set.
详细地,本申请实施例所述模型生成模块101利用如下公式进行所述预加重操作:In detail, the model generation module 101 according to the embodiment of the present application uses the following formula to perform the pre-emphasis operation:
y(t)=x(t)-μx(t-1)y(t)=x(t)-μx(t-1)
其中,x(t)为所述数字语音信号,t为时间,y(t)为所述标准数字语音信号,μ为所述预加重操作的预设调节值,较佳地,μ的取值范围为[0.9,1.0]。Wherein, x(t) is the digital voice signal, t is the time, y(t) is the standard digital voice signal, μ is the preset adjustment value of the pre-emphasis operation, preferably, the value of μ The range is [0.9,1.0].
对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集。Perform feature conversion on each standard digital voice signal included in the voice signal set to obtain a target spectrogram set.
本申请实施例中,所述语音信号集中的标准语音信号只能体现音频在时域上的变化,不能体现所述标准语音信号的音频特征,为了体现所述标准语音信号的音频特征,使音频特征更加直观和清晰,对所述语音信号集中的每个标准数字语音信号进行特征转换。In the embodiment of the present application, the standard voice signal in the voice signal set can only reflect the change of audio in the time domain, and cannot reflect the audio characteristics of the standard voice signal. In order to reflect the audio characteristics of the standard voice signal, the audio The features are more intuitive and clear, and feature conversion is performed on each standard digital voice signal in the voice signal set.
详细地,本申请实施例中所述模型生成模块101利用如下手段对所述语音信号集中的每个标准数字语音信号进行特征转换,包括:利用预设声音处理算法,将所述语音信号集中的每个标准数字语音信号映射在频域,得到对应的目标声谱图,汇总所有的所述目标声谱图的得到所述目标声谱图集。In detail, the model generation module 101 in the embodiment of the present application uses the following means to perform feature conversion on each standard digital voice signal in the voice signal set, including: using a preset voice processing algorithm to concentrate the voice signal Each standard digital speech signal is mapped in the frequency domain to obtain a corresponding target spectrogram, and all the target spectrograms are summarized to obtain the target spectrogram set.
较佳地,本申请中所述声音处理算法为梅尔滤波算法。Preferably, the sound processing algorithm described in this application is the Mel filter algorithm.
本申请实施例中,上述步骤只是对所述音频训练集的每个音频进行特征转换,不会影响所述音频训练集的每个音频对应的所述初始标签,所以所述目标声谱图集中每个目标声谱图都有对应的初始标签。In the embodiment of this application, the above steps only perform feature conversion on each audio of the audio training set, and will not affect the initial label corresponding to each audio of the audio training set, so the target spectrogram is set Each target spectrogram has a corresponding initial label.
本申请实施例中,由于所述音频训练集中的样本数量过少,为了保证后续模型的训练精度及鲁棒性,基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到基于音频的用户状态识别模型。In the embodiment of this application, since the number of samples in the audio training set is too small, in order to ensure the training accuracy and robustness of the subsequent model, based on the attention mechanism and small sample learning, the target spectrogram atlas is used to pre-build The deep learning network model is trained to obtain an audio-based user state recognition model.
详细地,本申请实施例中,所述模型生成模块101利用如下手段对预构建的深度学习网络模型进行训练,包括:In detail, in the embodiment of the present application, the model generation module 101 uses the following methods to train the pre-built deep learning network model, including:
步骤A:将所述目标声谱图集划分为训练集及测试集;Step A: The target spectrogram set is divided into a training set and a test set;
本申请实施例中,由于所述目标声谱图集中的样本数据较少且不易获取,直接将所述目标声谱图集作为训练集会导致后续模型的鲁棒性较差,因此,本申请实施例将所述目标声谱图集划分为训练集及测试集,通过利用测试集不断的对训练模型进行测试调整增强模型的鲁棒性,所述将所述目标声谱图集划分为训练集及测试集,包括:将所述目标声谱图集中的每个目标声谱图按照对应的初始标签进行分类,得到对应的分类目标声谱图集;从所述分类目标声谱图集中随机取出预设数量的目标声谱图作为测试子集,将所述分类声谱图集中所述训练子集的补集作为训练子集;汇总所有的所述训练子集的训练集,汇总所有的所述测试子集得到测试集,较佳地,本申请实施例中所述预设数量为1。In the embodiment of the present application, since the sample data in the target spectrogram atlas is small and difficult to obtain, directly using the target spectrogram atlas as the training set will result in poor robustness of the subsequent model. Therefore, the implementation of this application For example, the target spectrogram set is divided into a training set and a test set, and the robustness of the model is enhanced by continuously testing the training model by using the test set, and dividing the target spectrogram set into a training set And the test set, including: classifying each target spectrogram in the target spectrogram atlas according to the corresponding initial label to obtain the corresponding classification target spectrogram atlas; randomly taking out from the classification target spectrogram atlas A preset number of target spectrograms are used as the test subset, and the complement of the training subset in the classified spectrogram set is used as the training subset; all the training sets of the training subset are summarized, and all the training sets are summarized. The test subset obtains a test set. Preferably, the preset number in the embodiment of the present application is 1.
步骤B:利用所述训练集对所述深度学习网络进行训练得到初始识别模型,根据所述测试集对所述初始识别模型进行测试得到损失值,当所述损失值大于预设阈值时返回步骤A,当所述损失值小于或等于预设阈值时,将所述初始识别模型作为用户状态识别模型。Step B: Use the training set to train the deep learning network to obtain an initial recognition model, test the initial recognition model according to the test set to obtain a loss value, and return to step when the loss value is greater than a preset threshold A. When the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
较佳地,本申请实施例中所述深度学习网络为卷积神经网络。Preferably, the deep learning network in the embodiment of the present application is a convolutional neural network.
本申请实施例中,由于所述音频训练集中的音频时间可能不一致,导致所述目标声谱图集中的图像大小存在差异,进而导致所述深度学习网络模型在训练的过程中提取的所述目标声谱图集中的目标声谱图的特征维度不同,无法进行统一训练,因此,为了更好地利用所述音频训练集中的数据,本申请实施例利用所述训练集对所述深度学习网络前,需要在所述深度学习网络模型的全连接层前增加注意力机制处理层进行图像特征对齐,其中,所述注意力机制处理层根据图像特征维度不同进行特征对齐的网络,例如:目标声谱图A的在所述深度学习网络模型进行特征提取的图像特征a为D*T1维矩阵,目标声谱图B的在所述深度学习网络模型进行特征提取的图像特征b为D*T2维矩阵,所述注意力机制处理层将图像特征a乘上T1*1的预设权重矩阵转化为D维矩阵,将图像特征b乘上T2*1的预设权重矩阵转化为D维矩阵,实现图像特征a及图像特征b的特征对齐。In the embodiment of this application, since the audio time in the audio training set may be inconsistent, the size of the images in the target spectrogram atlas may be different, which in turn leads to the target extracted by the deep learning network model during the training process. The target spectrograms in the spectrogram set have different feature dimensions and cannot be uniformly trained. Therefore, in order to make better use of the data in the audio training set, the embodiment of the present application uses the training set to compare the pre-deep learning network , It is necessary to add an attention mechanism processing layer before the fully connected layer of the deep learning network model to perform image feature alignment, where the attention mechanism processing layer performs a feature alignment network according to different image feature dimensions, for example: target sound spectrum The image feature a of the feature extraction performed on the deep learning network model in Figure A is a D*T1 dimensional matrix, and the image feature b of the target spectrogram B that is feature extraction performed on the deep learning network model is a D*T2 dimensional matrix The attention mechanism processing layer converts the preset weight matrix of image feature a multiplied by T1*1 into a D-dimensional matrix, and converts the preset weight matrix of image feature b multiplied by T2*1 into a D-dimensional matrix to realize the image Feature a and image feature b are aligned.
进一步地,由于所述训练集的中的样本数量较少,所以本申请实施例需要对所述初始识别模型进行以验证该模型的识别能力方便对模型进行训练调整。Further, since the number of samples in the training set is small, the embodiment of the present application needs to perform the initial recognition model to verify the recognition ability of the model to facilitate the training and adjustment of the model.
详细地,本申请实施例中所述初始识别模型的识别类别和所述目标声谱图集中的初始标签的类别相同,例如:所述所述目标声谱图集中共有两种初始标签为慢性咽炎、发烧,那么所述初始识别模型中的识别类别也有相同的两种为慢性咽炎、发烧。进一步地,本申请实施例中所述模型生成模块101利用如下手段得到损失值,包括:提取所述初始识别模型中每种所述初始标签对应的特征向量,得到目标特征向量;利用所述初始识别模型对所述测试子集中的每个目标声谱图进行特征提取,得到测试特征向量;计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离,得到损失距离值;计算所有所述损失距离值的平均值,得到所述损失值。较佳地,本申请实施例采用欧氏距离计算方法计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离。In detail, the recognition category of the initial recognition model in the embodiment of the present application is the same as the category of the initial tags in the target spectrogram atlas. For example, there are two initial tags in the target spectrogram atlas that are chronic pharyngitis. , Fever, then the recognition categories in the initial recognition model also have the same two types: chronic pharyngitis and fever. Further, the model generation module 101 in the embodiment of the present application obtains the loss value by the following means, including: extracting the feature vector corresponding to each of the initial tags in the initial recognition model to obtain the target feature vector; The recognition model performs feature extraction on each target spectrogram in the test subset to obtain a test feature vector; calculates the distance between the target feature vector corresponding to each initial tag and the test feature vector to obtain the loss distance Value; Calculate the average of all the loss distance values to obtain the loss value. Preferably, the embodiment of the present application adopts an Euclidean distance calculation method to calculate the distance between the target feature vector corresponding to each of the initial tags and the test feature vector.
进一步地,本领域技术人员可知所述初始模型的不同识别类别连接对应不同的全连接层节点,全连接层节点有对应的顺序,本申请实施例所述模型生成模块101获取初始识别模型的每种识别类别对应的全连接层节点的输出值并按照对应的全连接层节点的顺序进行组合,得到对应的所述目标特征向量;进一步地,本申请实施例所述模型生成模块101将所述测试子集中的每个目标声谱图输入所述初始识别模型,根据所述测试子集中的每个目标声谱图对应的初始标签,获取所述初始识别模型中对应识别类别的全连接层节点的输出值并按照对应的全连接层节点的顺序进行组合,得到所述测试特征向量。Further, those skilled in the art can know that the different recognition types of the initial model correspond to different fully connected layer nodes, and the fully connected layer nodes have a corresponding sequence. The model generation module 101 described in this embodiment of the application obtains each of the initial recognition models. The output values of the fully connected layer nodes corresponding to the recognition category are combined according to the order of the corresponding fully connected layer nodes to obtain the corresponding target feature vector; further, the model generation module 101 described in the embodiment of the present application combines the Each target spectrogram in the test subset is input to the initial recognition model, and the fully connected layer node corresponding to the recognition category in the initial recognition model is obtained according to the initial label corresponding to each target spectrogram in the test subset The output values of are combined according to the order of the corresponding fully connected layer nodes to obtain the test feature vector.
本申请的另一实施例中,为了保证数据的隐私性,所述音频训练集可以存储在区块链节点中。In another embodiment of the present application, in order to ensure data privacy, the audio training set may be stored in a blockchain node.
所述状态识别模块102用于当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The state recognition module 102 is configured to, when receiving the audio of the user to be recognized, perform feature conversion on the audio of the user to be recognized to obtain the spectrogram to be recognized; The spectrum is identified, and the user status identification result is obtained.
本申请实施例中,所述待识别用户的音频与所述音频训练集中的音频的类别相同,较佳地,本申请实施例中,所述待识别用户的音频为用户的咳嗽音频。音频训练集In the embodiment of the present application, the audio of the user to be identified is of the same category as the audio in the audio training set. Preferably, in the embodiment of the present application, the audio of the user to be identified is the user's cough audio. Audio training set
进一步地,本申请实施例中对所述待识别用户的音频进行特征转换的方法与上述对所述音频训练集的每个音频进行特征转换的方法相同。Further, the method for performing feature conversion on the audio of the user to be identified in the embodiment of the present application is the same as the above-mentioned method for performing feature conversion on each audio of the audio training set.
本申请实施例中,所述用户状态识别结果为用户的疾病情况,例如:急性支气管炎、慢性咽炎、百日咳、发烧。In the embodiment of the present application, the user status recognition result is the user's disease condition, such as acute bronchitis, chronic pharyngitis, whooping cough, and fever.
如图44所示,是本申请实现基于音频的用户状态识别方法的电子设备的结构示意图。As shown in FIG. 44, it is a schematic structural diagram of an electronic device that implements an audio-based user state recognition method according to the present application.
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如基于音频的用户状态识别程序。The electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as an audio-based user state recognition program.
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card, SMC)、安全数字(Secure Digital, SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如基于音频的用户状态识别程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. Digital, SD) card, flash memory card (Flash Card) and so on. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of an audio-based user status recognition program, etc., but also to temporarily store data that has been output or will be output.
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如基于音频的用户状态识别程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。The processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc. The processor 10 is the control core of the electronic device (Control Unit), using various interfaces and lines to connect the various components of the entire electronic device, by running or executing programs or modules stored in the memory 11 (for example, audio-based user status recognition programs, etc.), and calling The data in the memory 11 is used to perform various functions of the electronic device 1 and process data.
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The bus may be a peripheral component interconnect (PCI) bus or an extended industry standard structure (extended industry standard structure). industry standard architecture, EISA for short) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
图4仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图4示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device 1 may also include a power source (such as a battery) for supplying power to various components. Preferably, the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power The device implements functions such as charge management, discharge management, and power consumption management. The power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators. The electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may also include a user interface. The user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)). Optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only for illustrative purposes, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的基于音频的用户状态识别程序12是多个指令的组合,在所述处理器10中运行时,可以实现:The audio-based user state recognition program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
具体地,所述处理器10对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instructions by the processor 10, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是易失性,也可以是非易失性。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. It can be volatile or non-volatile. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the following steps:
获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
进一步地,所述计算机可用存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store a block chain node Use the created data, etc.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Claims (20)

  1. 一种基于音频的用户状态识别方法,其中,所述方法包括:An audio-based user status recognition method, wherein the method includes:
    获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
    基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
    当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
    利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  2. 如权利要求1所述的基于音频的用户状态识别方法,其中,所述对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集,包括:The audio-based user state recognition method according to claim 1, wherein said performing feature conversion on each audio in said audio training set to obtain a target spectrogram atlas comprises:
    将所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;Resample each audio in the audio training set to obtain a corresponding digital voice signal;
    对所述数字语音信号进行预加重,得到标准数字语音信号;Performing pre-emphasis on the digital voice signal to obtain a standard digital voice signal;
    汇总所有的所述标准数字语音信号,得到语音信号集;Summarize all the standard digital voice signals to obtain a voice signal set;
    对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集。Perform feature conversion on each standard digital voice signal included in the voice signal set to obtain a target spectrogram set.
  3. 如权利要求2所述的基于音频的用户状态识别方法,其中,所述对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集,包括:3. The audio-based user state recognition method according to claim 2, wherein said performing feature conversion on each standard digital voice signal included in said voice signal set to obtain a target spectroscopic atlas includes:
    利用预设声音处理算法,将所述语音信号集中的每个标准数字语音信号映射在频域,得到对应的目标声谱图;Using a preset sound processing algorithm to map each standard digital voice signal in the voice signal set in the frequency domain to obtain a corresponding target spectrogram;
    汇总所有的所述目标声谱图得到所述目标声谱图集。Collect all the target spectrograms to obtain the target spectrogram set.
  4. 如权利要求1所述的基于音频的用户状态识别方法,其中,所述利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型,包括:The audio-based user state recognition method according to claim 1, wherein said training a pre-built deep learning network model by using said target spectroscopic atlas to obtain a user state recognition model comprises:
    将所述目标声谱图集随机划分为训练集及测试集;Randomly dividing the target sound spectrum atlas into a training set and a test set;
    利用所述训练集对所述深度学习网络模型进行训练得到初始识别模型;Training the deep learning network model by using the training set to obtain an initial recognition model;
    根据所述测试集对所述初始识别模型进行测试得到损失值;Testing the initial recognition model according to the test set to obtain a loss value;
    当所述损失值大于预设阈值时返回所述将所述目标声谱图集随机划分为训练集及测试集步骤;Returning to the step of randomly dividing the target spectrogram set into a training set and a test set when the loss value is greater than a preset threshold;
    当所述损失值小于或等于预设阈值时,将所述初始识别模型作为用户状态识别模型。When the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
  5. 如权利要求4所述的基于音频的用户状态识别方法,其中,所述将所述目标声谱图集随机划分为训练集及测试集,包括:The audio-based user state recognition method of claim 4, wherein the randomly dividing the target spectrogram set into a training set and a test set includes:
    将所述目标声谱图集中的每个目标声谱图按照对应的初始标签进行分类,得到对应的分类目标声谱图集;Classify each target spectrogram in the target spectrogram set according to the corresponding initial label to obtain the corresponding classified target spectrogram set;
    从所述分类目标声谱图集中随机取出预设数量的目标声谱图作为测试子集,将所述分类声谱图集中所述测试子集的补集作为训练子集;Randomly taking out a preset number of target spectrograms from the classified target spectrogram set as a test subset, and using a complement of the test subset in the classified spectrogram set as a training subset;
    汇总所有的所述训练子集得到训练集;Summarize all the training subsets to obtain a training set;
    汇总所有的所述测试子集得到测试集。Summarize all the test subsets to obtain a test set.
  6. 如权利要求5所述的基于音频的用户状态识别方法,其中,所述根据所述测试集对所述初始识别模型进行测试得到损失值,包括:5. The audio-based user state recognition method according to claim 5, wherein said testing said initial recognition model according to said test set to obtain a loss value comprises:
    提取所述初始识别模型中每种所述初始标签对应的特征向量,得到目标特征向量;Extracting a feature vector corresponding to each of the initial tags in the initial recognition model to obtain a target feature vector;
    利用所述初始识别模型对所述测试集中的每个目标声谱图进行特征提取,得到对应的测试特征向量;Using the initial recognition model to perform feature extraction on each target spectrogram in the test set to obtain a corresponding test feature vector;
    计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离,得到损失距离值;Calculating the distance between the target feature vector and the test feature vector corresponding to each of the initial tags to obtain a loss distance value;
    计算所有所述损失距离值的平均值,得到所述损失值。Calculate the average value of all the loss distance values to obtain the loss value.
  7. 如权利要求1至6中任何一项所述的基于音频的用户状态识别方法,其中,所述音频训练集为不同的疾病情况对应的咳嗽音频的集合。The audio-based user state recognition method according to any one of claims 1 to 6, wherein the audio training set is a set of cough audio corresponding to different disease conditions.
  8. 一种基于音频的用户状态识别装置,其中,所述装置包括:An audio-based user state recognition device, wherein the device includes:
    模型生成模块,用于获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;The model generation module is used to obtain an audio training set, perform feature conversion on each audio in the audio training set to obtain a target spectrogram atlas; based on the attention mechanism and small sample learning, use the target spectrogram atlas to The pre-built deep learning network model is trained to obtain the user state recognition model;
    状态识别模块,用于当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The state recognition module is used to perform feature conversion on the audio of the user to be recognized when the audio of the user to be recognized is received to obtain the spectrogram to be recognized; use the user state recognition model to perform the feature conversion on the spectrogram to be recognized Recognize and get the result of user status recognition.
  9. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device includes:
    至少一个处理器;以及,At least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the following steps:
    获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
    基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
    当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
    利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  10. 如权利要求9所述的电子设备,其中,所述对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集,包括:9. The electronic device according to claim 9, wherein said performing feature conversion on each audio in said audio training set to obtain a target sound spectrum atlas comprises:
    将所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;Resample each audio in the audio training set to obtain a corresponding digital voice signal;
    对所述数字语音信号进行预加重,得到标准数字语音信号;Performing pre-emphasis on the digital voice signal to obtain a standard digital voice signal;
    汇总所有的所述标准数字语音信号,得到语音信号集;Summarize all the standard digital voice signals to obtain a voice signal set;
    对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集。Perform feature conversion on each standard digital voice signal included in the voice signal set to obtain a target spectrogram set.
  11. 如权利要求10所述的电子设备,其中,所述对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集,包括:10. The electronic device according to claim 10, wherein said performing feature conversion on each standard digital voice signal included in said voice signal set to obtain a target spectrogram set comprises:
    利用预设声音处理算法,将所述语音信号集中的每个标准数字语音信号映射在频域,得到对应的目标声谱图;Using a preset sound processing algorithm to map each standard digital voice signal in the voice signal set in the frequency domain to obtain a corresponding target spectrogram;
    汇总所有的所述目标声谱图得到所述目标声谱图集。Collect all the target spectrograms to obtain the target spectrogram set.
  12. 如权利要求9所述的电子设备,其中,所述利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型,包括:9. The electronic device according to claim 9, wherein the training a pre-built deep learning network model by using the target spectrogram atlas to obtain a user state recognition model comprises:
    将所述目标声谱图集随机划分为训练集及测试集;Randomly dividing the target sound spectrum atlas into a training set and a test set;
    利用所述训练集对所述深度学习网络模型进行训练得到初始识别模型;Training the deep learning network model by using the training set to obtain an initial recognition model;
    根据所述测试集对所述初始识别模型进行测试得到损失值;Testing the initial recognition model according to the test set to obtain a loss value;
    当所述损失值大于预设阈值时返回所述将所述目标声谱图集随机划分为训练集及测试集步骤;When the loss value is greater than the preset threshold, return to the step of randomly dividing the target spectrogram set into a training set and a test set;
    当所述损失值小于或等于预设阈值时,将所述初始识别模型作为用户状态识别模型。When the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
  13. 如权利要求12所述的电子设备,其中,所述将所述目标声谱图集随机划分为训练集及测试集,包括:The electronic device according to claim 12, wherein the randomly dividing the target spectrogram set into a training set and a test set comprises:
    将所述目标声谱图集中的每个目标声谱图按照对应的初始标签进行分类,得到对应的分类目标声谱图集;Classify each target spectrogram in the target spectrogram set according to the corresponding initial label to obtain a corresponding classified target spectrogram set;
    从所述分类目标声谱图集中随机取出预设数量的目标声谱图作为测试子集,将所述分类声谱图集中所述测试子集的补集作为训练子集;Randomly taking out a preset number of target spectrograms from the classified target spectrogram set as a test subset, and using a complement of the test subset in the classified spectrogram set as a training subset;
    汇总所有的所述训练子集得到训练集;Summarize all the training subsets to obtain a training set;
    汇总所有的所述测试子集得到测试集。Summarize all the test subsets to obtain a test set.
  14. 如权利要求13所述的电子设备,其中,所述根据所述测试集对所述初始识别模型进行测试得到损失值,包括:15. The electronic device of claim 13, wherein said testing said initial recognition model according to said test set to obtain a loss value comprises:
    提取所述初始识别模型中每种所述初始标签对应的特征向量,得到目标特征向量;Extracting a feature vector corresponding to each of the initial tags in the initial recognition model to obtain a target feature vector;
    利用所述初始识别模型对所述测试集中的每个目标声谱图进行特征提取,得到对应的测试特征向量;Using the initial recognition model to perform feature extraction on each target spectrogram in the test set to obtain a corresponding test feature vector;
    计算每种所述初始标签对应的所述目标特征向量与所述测试特征向量的距离,得到损失距离值;Calculating the distance between the target feature vector and the test feature vector corresponding to each of the initial tags to obtain a loss distance value;
    计算所有所述损失距离值的平均值,得到所述损失值。Calculate the average value of all the loss distance values to obtain the loss value.
  15. 如权利要求9至14中任何一项所述的电子设备,其中,所述音频训练集为不同的疾病情况对应的咳嗽音频的集合。The electronic device according to any one of claims 9 to 14, wherein the audio training set is a set of cough audio corresponding to different disease conditions.
  16. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the following steps:
    获取音频训练集,对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集;Acquiring an audio training set, performing feature conversion on each audio in the audio training set, to obtain a target sound spectrum atlas;
    基于注意力机制与小样本学习,利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型;Based on the attention mechanism and small sample learning, use the target spectrogram atlas to train the pre-built deep learning network model to obtain the user state recognition model;
    当接收到待识别用户的音频时,对所述待识别用户的音频进行特征转换,得到待识别声谱图;When the audio of the user to be identified is received, feature conversion is performed on the audio of the user to be identified to obtain the spectrogram to be identified;
    利用所述用户状态识别模型对所述待识别声谱图进行识别,得到用户状态识别结果。The user state recognition model is used to recognize the to-be-recognized spectrogram to obtain a user state recognition result.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述对所述音频训练集中的每个音频进行特征转换,得到目标声谱图集,包括:15. The computer-readable storage medium according to claim 16, wherein said performing feature conversion on each audio in said audio training set to obtain a target spectrogram atlas comprises:
    将所述音频训练集中的每个音频进行重采样,得到对应的数字语音信号;Resample each audio in the audio training set to obtain a corresponding digital voice signal;
    对所述数字语音信号进行预加重,得到标准数字语音信号;Performing pre-emphasis on the digital voice signal to obtain a standard digital voice signal;
    汇总所有的所述标准数字语音信号,得到语音信号集;Summarize all the standard digital voice signals to obtain a voice signal set;
    对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集。Perform feature conversion on each standard digital voice signal included in the voice signal set to obtain a target spectrogram set.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述对所述语音信号集中包含的每个标准数字语音信号进行特征转换,得到目标声谱图集,包括:17. The computer-readable storage medium according to claim 17, wherein said performing feature conversion on each standard digital voice signal contained in said voice signal set to obtain a target spectroscopic atlas includes:
    利用预设声音处理算法,将所述语音信号集中的每个标准数字语音信号映射在频域,得到对应的目标声谱图;Using a preset sound processing algorithm to map each standard digital voice signal in the voice signal set in the frequency domain to obtain a corresponding target spectrogram;
    汇总所有的所述目标声谱图得到所述目标声谱图集。Collect all the target spectrograms to obtain the target spectrogram set.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述利用所述目标声谱图集对预构建的深度学习网络模型进行训练,得到用户状态识别模型,包括:15. The computer-readable storage medium according to claim 16, wherein the training a pre-built deep learning network model by using the target spectroscopic atlas to obtain a user state recognition model comprises:
    将所述目标声谱图集随机划分为训练集及测试集;Randomly dividing the target sound spectrum atlas into a training set and a test set;
    利用所述训练集对所述深度学习网络模型进行训练得到初始识别模型;Training the deep learning network model by using the training set to obtain an initial recognition model;
    根据所述测试集对所述初始识别模型进行测试得到损失值;Testing the initial recognition model according to the test set to obtain a loss value;
    当所述损失值大于预设阈值时返回所述将所述目标声谱图集随机划分为训练集及测试集步骤;When the loss value is greater than the preset threshold, return to the step of randomly dividing the target spectrogram set into a training set and a test set;
    当所述损失值小于或等于预设阈值时,将所述初始识别模型作为用户状态识别模型。When the loss value is less than or equal to a preset threshold, the initial recognition model is used as the user state recognition model.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述将所述目标声谱图集随机划分为训练集及测试集,包括:19. The computer-readable storage medium of claim 19, wherein the randomly dividing the target spectrogram set into a training set and a test set comprises:
    将所述目标声谱图集中的每个目标声谱图按照对应的初始标签进行分类,得到对应的分类目标声谱图集;Classify each target spectrogram in the target spectrogram set according to the corresponding initial label to obtain a corresponding classified target spectrogram set;
    从所述分类目标声谱图集中随机取出预设数量的目标声谱图作为测试子集,将所述分类声谱图集中所述测试子集的补集作为训练子集;Randomly taking out a preset number of target spectrograms from the classified target spectrogram set as a test subset, and using a complement of the test subset in the classified spectrogram set as a training subset;
    汇总所有的所述训练子集得到训练集;Summarize all the training subsets to obtain a training set;
    汇总所有的所述测试子集得到测试集。Summarize all the test subsets to obtain a test set.
PCT/CN2020/131983 2020-10-09 2020-11-27 Audio-based user state identification method and apparatus, and electronic device and storage medium WO2021189903A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011074898.9 2020-10-09
CN202011074898.9A CN112233700A (en) 2020-10-09 2020-10-09 Audio-based user state identification method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2021189903A1 true WO2021189903A1 (en) 2021-09-30

Family

ID=74120698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131983 WO2021189903A1 (en) 2020-10-09 2020-11-27 Audio-based user state identification method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN112233700A (en)
WO (1) WO2021189903A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476036A (en) * 2023-12-27 2024-01-30 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116509371A (en) * 2022-01-21 2023-08-01 华为技术有限公司 Audio detection method and electronic equipment
CN114373484A (en) * 2022-03-22 2022-04-19 南京邮电大学 Voice-driven small sample learning method for Parkinson disease multi-symptom characteristic parameters
CN114722884B (en) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104321015A (en) * 2012-03-29 2015-01-28 昆士兰大学 A method and apparatus for processing patient sounds
CN106073706A (en) * 2016-06-01 2016-11-09 中国科学院软件研究所 A kind of customized information towards Mini-mental Status Examination and audio data analysis method and system
CN106202952A (en) * 2016-07-19 2016-12-07 南京邮电大学 A kind of Parkinson disease diagnostic method based on machine learning
CN106847262A (en) * 2016-12-28 2017-06-13 华中农业大学 A kind of porcine respiratory disease automatic identification alarm method
WO2019023879A1 (en) * 2017-07-31 2019-02-07 深圳和而泰智能家居科技有限公司 Cough sound recognition method and device, and storage medium
WO2019119050A1 (en) * 2017-12-21 2019-06-27 The University Of Queensland A method for analysis of cough sounds using disease signatures to diagnose respiratory diseases

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205535A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The method and its system of Emotion tagging
CN111666960B (en) * 2019-03-06 2024-01-19 南京地平线机器人技术有限公司 Image recognition method, device, electronic equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104321015A (en) * 2012-03-29 2015-01-28 昆士兰大学 A method and apparatus for processing patient sounds
CN106073706A (en) * 2016-06-01 2016-11-09 中国科学院软件研究所 A kind of customized information towards Mini-mental Status Examination and audio data analysis method and system
CN106202952A (en) * 2016-07-19 2016-12-07 南京邮电大学 A kind of Parkinson disease diagnostic method based on machine learning
CN106847262A (en) * 2016-12-28 2017-06-13 华中农业大学 A kind of porcine respiratory disease automatic identification alarm method
WO2019023879A1 (en) * 2017-07-31 2019-02-07 深圳和而泰智能家居科技有限公司 Cough sound recognition method and device, and storage medium
WO2019119050A1 (en) * 2017-12-21 2019-06-27 The University Of Queensland A method for analysis of cough sounds using disease signatures to diagnose respiratory diseases

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476036A (en) * 2023-12-27 2024-01-30 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium
CN117476036B (en) * 2023-12-27 2024-04-09 广州声博士声学技术有限公司 Environmental noise identification method, system, equipment and medium

Also Published As

Publication number Publication date
CN112233700A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
WO2021189903A1 (en) Audio-based user state identification method and apparatus, and electronic device and storage medium
WO2022116420A1 (en) Speech event detection method and apparatus, electronic device, and computer storage medium
WO2021232594A1 (en) Speech emotion recognition method and apparatus, electronic device, and storage medium
WO2022121176A1 (en) Speech synthesis method and apparatus, electronic device, and readable storage medium
WO2021189855A1 (en) Image recognition method and apparatus based on ct sequence, and electronic device and medium
WO2022105179A1 (en) Biological feature image recognition method and apparatus, and electronic device and readable storage medium
WO2021208696A1 (en) User intention analysis method, apparatus, electronic device, and computer storage medium
WO2022194062A1 (en) Disease label detection method and apparatus, electronic device, and storage medium
WO2022222943A1 (en) Department recommendation method and apparatus, electronic device and storage medium
WO2022247005A1 (en) Method and apparatus for identifying target object in image, electronic device and storage medium
WO2021151291A1 (en) Disease risk analysis method, apparatus, electronic device, and computer storage medium
CN107729928A (en) Information acquisition method and device
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
CN113806434A (en) Big data processing method, device, equipment and medium
CN113157739B (en) Cross-modal retrieval method and device, electronic equipment and storage medium
CN113205814B (en) Voice data labeling method and device, electronic equipment and storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
WO2022227171A1 (en) Method and apparatus for extracting key information, electronic device, and medium
WO2022141867A1 (en) Speech recognition method and apparatus, and electronic device and readable storage medium
CN111933274A (en) Disease classification diagnosis method and device, electronic equipment and storage medium
CN116844711A (en) Disease auxiliary identification method and device based on deep learning
CN116702776A (en) Multi-task semantic division method, device, equipment and medium based on cross-Chinese and western medicine
CN116578704A (en) Text emotion classification method, device, equipment and computer readable medium
CN111859985B (en) AI customer service model test method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20927189

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20927189

Country of ref document: EP

Kind code of ref document: A1