WO2022134833A1 - 语音信号的处理方法、装置、设备及存储介质 - Google Patents

语音信号的处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022134833A1
WO2022134833A1 PCT/CN2021/126111 CN2021126111W WO2022134833A1 WO 2022134833 A1 WO2022134833 A1 WO 2022134833A1 CN 2021126111 W CN2021126111 W CN 2021126111W WO 2022134833 A1 WO2022134833 A1 WO 2022134833A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
short
voice
preset
voice signal
Prior art date
Application number
PCT/CN2021/126111
Other languages
English (en)
French (fr)
Inventor
赵沁
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134833A1 publication Critical patent/WO2022134833A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present application relates to the field of speech signal processing of artificial intelligence, and in particular, to a method, apparatus, device and storage medium for processing speech signals.
  • the present application provides a voice signal processing method, device, device and storage medium, which are used to improve the recognition accuracy of effective short voices.
  • a first aspect of the present application provides a method for processing a speech signal, including:
  • the target short speech segment and the preset short speech segment are sequentially matched and classified label extraction is performed to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the to-be-processed voice signal is filtered to obtain a target voice signal.
  • a second aspect of the present application provides a voice signal processing device, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the computer
  • the following steps are implemented when readable instructions:
  • the target short speech segment and the preset short speech segment are sequentially matched and classified label extraction is performed to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the to-be-processed voice signal is filtered to obtain a target voice signal.
  • a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:
  • the target short speech segment and the preset short speech segment are sequentially matched and classified label extraction is performed to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the to-be-processed voice signal is filtered to obtain a target voice signal.
  • a fourth aspect of the present application provides a device for processing a voice signal, including:
  • a recognition and extraction module configured to acquire a to-be-processed speech signal, perform short speech segment recognition on the to-be-processed speech signal to obtain a target short speech segment, and perform frame audio feature extraction on the target short speech segment to obtain a target audio feature;
  • the matching extraction module is used to sequentially perform matching and classification label extraction between the target short speech segment and the preset short speech segment to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise ;
  • a first classification module used for classifying the target audio feature by using a preset target neural network model and the target classification label to obtain an initial recognition type and a target confidence level corresponding to the initial recognition type
  • a judgment and determination module configured to judge whether the target confidence is greater than a preset threshold, and if the target confidence is greater than the preset threshold, determine the initial recognition type as a target recognition type;
  • a filtering module configured to filter the to-be-processed voice signal according to the target recognition type to obtain a target voice signal.
  • the target classification label including interrogative tone, normal statement tone and/or false alarm noise is obtained from the preset short speech segment according to the target short speech segment of the speech signal to be processed, through the target neural network
  • the network model and the target classification label classify the target audio features to obtain the initial recognition type and target confidence, and filter the to-be-processed voice signal according to the target recognition type to obtain the target voice signal.
  • short voice fragments and text output it can effectively judge And timely identify the speaker's emotion, expression content, as well as interrogative sentences and background noise, thereby improving the recognition accuracy of effective short speech.
  • FIG. 1 is a schematic diagram of an embodiment of a method for processing a speech signal in an embodiment of the present application
  • FIG. 2 is a schematic diagram of another embodiment of a method for processing a speech signal in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of an apparatus for processing a speech signal in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of an apparatus for processing a speech signal in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of a device for processing a speech signal in an embodiment of the present application.
  • Embodiments of the present application provide a voice signal processing method, apparatus, device, and storage medium, which improve the recognition accuracy of effective short voices.
  • an embodiment of the voice signal processing method in the embodiment of the present application includes:
  • the execution subject of the present application may be a device for processing voice signals, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the server can perform data cleaning, data integration and signal conversion on the voice information through the voice information sent by the preset interface in turn to obtain the initial voice signal, perform pre-emphasis processing and windowing and framing processing on the initial voice signal, and perform signal enhancement. and voice endpoint detection to obtain the voice signal to be processed; the server can also send the acquisition instruction to the preset voice collector or voice acquisition device, so that the voice collector or voice acquisition device can collect the initial voice signal and pre-process the initial voice signal. Emphasis processing and windowing and framing processing are performed, and signal enhancement processing and voice endpoint detection processing are performed to obtain the voice signal to be processed.
  • the server can obtain the target short speech segment by performing short speech segment recognition on the speech signal to be processed according to a preset short speech recognition rule, and the short speech recognition rule may include the speech duration and short-term energy of the target short speech segment.
  • the server can also perform short speech segment recognition on the speech signal to be processed according to the preset short speech recognition rules to obtain the initial short speech segment, and perform speech recognition on the initial short speech segment through the automatic speech recognition (ASR) algorithm and text conversion to obtain the initial short speech text, and determine whether the initial short speech text is a monosyllabic word.
  • ASR automatic speech recognition
  • the target short speech segment with the preset short speech segment and extract the classification label in sequence to obtain the target classification label, where the target classification label includes interrogative tone, normal statement tone and/or false alarm noise.
  • the server may perform key-value pair matching on the hash table of the preset short voice fragments stored in the preset database by generating the target key of the target short voice fragment, so as to obtain the target short voice fragment corresponding to the target short voice fragment; or, the server may also The preset database can be retrieved through the preset inverted index to obtain the target short voice fragment corresponding to the target short voice fragment; or the server can also calculate the semantic similarity between the target short voice fragment and the preset short voice fragment.
  • the server extracts the classification label information of the target short speech segment through a preset label extraction algorithm to obtain the target classification label.
  • the server classifies the target audio features and calculates the probability value based on the target classification label, and obtains the initial recognition type corresponding to each target short speech segment, and the initial recognition The confidence level corresponding to the type, which is a probability value.
  • the server can use multiple classifiers in the preset target neural network model (the number of fully connected network layers includes multiple, and one fully connected network layer corresponds to one classifier), based on the target classification label, respectively.
  • the server determines whether the target confidence is greater than the preset threshold, and if so, determines the initial recognition type as the target recognition type, and if not, determines the initial recognition type as the default type, which can be used to indicate a normal statement tone. After obtaining the target recognition type by judging whether the target confidence is greater than the preset threshold, the server can retrieve the initial historical short voice fragments stored in the preset database according to the target short voice fragments, and obtain the corresponding target historical short voice fragments.
  • the target historical short speech segment contains the corresponding classification label information
  • the historical recognition type of the target historical short speech segment can be obtained from the classification label information, calculate the error value between the target recognition type and the historical recognition type, and judge whether the error value is greater than
  • the preset target error value if so, send the target recognition type and the target short voice clip corresponding to the target recognition type to the preset review terminal, if not, create a target recognition type and target short voice fragments corresponding to the target recognition type.
  • Corresponding relationship is created, and the target recognition type with the corresponding relationship and the target short speech segment corresponding to the target recognition type are created in the ground cache, which improves the recognition accuracy of the target recognition type.
  • the target recognition type is interrogative tone, normal statement tone and false alarm noise
  • the speech signal to be processed includes multiple target short speech segments, namely target short speech segment 1, target short speech segment 2 and target short speech segment 3, then
  • the server classifies the speech signal to be processed according to the target recognition type, obtains the speech signal 1 corresponding to the interrogative tone, the speech signal 2 corresponding to the normal statement tone and the speech signal 3 corresponding to the false alarm noise, and deletes the speech signal from the to-be-processed speech signal.
  • Signal 3 to obtain the target voice signal including voice signal 1 and voice signal 2.
  • the target classification label including the interrogative tone, normal statement tone and/or false alarm noise is obtained, and the target neural network model is used to obtain the target classification label.
  • Classify the target audio features with the target classification label to obtain the initial recognition type and target confidence filter the to-be-processed voice signal according to the target recognition type to obtain the target voice signal, combine short voice fragments and text output, can effectively judge and timely Identify the speaker's emotion, expression content, as well as interrogative sentences and background noise, thereby improving the recognition accuracy of valid short speech.
  • FIG. 2 another embodiment of the voice signal processing method in the embodiment of the present application includes:
  • a speech signal to be processed Acquire a speech signal to be processed, perform short speech segment recognition on the speech signal to be processed, obtain a target short speech segment, and perform frame audio feature extraction on the target short speech segment to obtain the target audio feature.
  • the server collects the to-be-processed voice signal through a preset voice collector, and sequentially performs preprocessing, voice segment recognition and segmentation, and voice-to-text conversion on the to-be-processed voice signal to obtain initial text information corresponding to the voice fragment and the voice fragment; Identify the monosyllabic words in the target text information to obtain the target text information, and determine the speech segment corresponding to the target text information as the target short speech segment; according to the preset frame length and inter-frame overlap, frame speech extraction is performed on the target short speech segment. , obtain the frame speech segment, and perform audio feature extraction on the frame speech segment to obtain the target audio feature.
  • the server collects the voice signal to be processed by invoking a preset microphone or other preset voice collectors, performs signal enhancement preprocessing on the voice signal to be processed to obtain an enhanced voice signal, and performs voice endpoint detection on the enhanced voice signal to obtain Voice endpoint, segment the enhanced voice signal according to the voice endpoint to realize the processing of speech segment recognition and segmentation, and obtain the speech segment, and perform speech recognition and speech-to-text conversion on the speech segment through the ASR algorithm to obtain the initial text information and detect the initial text.
  • the target short speech segment corresponding to the monosyllabic words in the speech segment
  • each frame of short speech in the target short speech segment according to the preset frame length and inter-frame overlap to obtain the frame speech segment.
  • the frame length is 25ms
  • the overlap between frames is 50%
  • the audio features of the frame speech fragments are extracted to obtain the target audio features.
  • the target audio features include spectral features, Mel frequency cepstral features, first-order second-order difference features, volume At least two of a characteristic and a fundamental frequency characteristic.
  • the server obtains the to-be-processed speech signal, performs short speech segment recognition on the to-be-processed speech signal, obtains the target short speech segment, and performs frame audio feature extraction on the target short speech segment.
  • the training samples of short speech clips include the label information of interrogative tone, normal statement tone and false alarm noise;
  • the verification algorithm classifies the audio feature samples into training sets and verification sets; trains the preset initial neural network model through the training set to obtain the candidate neural network model, and verifies the candidate neural network model through the verification set to obtain the verification result ; Through the preset loss function, optimizer and verification results, iteratively update the candidate neural network model to obtain the target neural network model.
  • the server obtains the initial voice signal training sample, performs signal enhancement, voice endpoint detection processing and voice segment segmentation on the initial voice signal training sample, obtains the voice fragment training sample, and performs text conversion on the voice fragment training sample through the preset ASR algorithm and short voice screening to obtain training samples of short voice clips, send the training samples of short voice clips to the preset labeling terminal, mark the training samples of short voice clips through the preset labeling terminal, or use the preset labeling terminal to train short voice clips Manually label the samples, or call the preset labeling tools to label the training samples of short speech clips, and obtain training samples of short speech clips that have been type-annotated.
  • the initial neural network model adopts a fully connected network structure, the loss function selects the cross-entropy function (the loss function is not limited to the cross-entropy function), the optimizer selects the Adam optimizer, the learning rate is ⁇ 10 ⁇ (-4), and the batch size is 256.
  • the network structure and model parameters of the candidate neural network model are iteratively updated. After 100 cycles of training, the optimal model is selected according to the accuracy of the verification results, so as to obtain the target neural network model.
  • the optimizer may include at least one of a momentum Momentum optimizer, an Adam optimizer, and a root mean square prop (RMSprop) optimizer .
  • the server obtains the first error value between the verification result and the label information, and calculates the second error value of the candidate neural network model through a preset loss function; the target error value is determined according to the first error value and the second error value ; Through the optimizer, the model parameters and/or network structure of the candidate neural network model are iteratively updated until the target error value is less than the preset error value, and the target neural network model is obtained.
  • the server calculates the similarity between the verification result and the label information, determines the difference between the similarity and 1 as the first error value between the verification result and the label information, and calculates the difference between the candidate neural network model through the preset loss function.
  • the sum or weight value of the first error value and the second error value is calculated to obtain the target error value, and the model parameters (hyperparameters) of the candidate neural network model are iteratively adjusted by the optimizer, and/or Through the optimizer, add or delete network layers for the candidate neural network model, or adjust the connection mode of multiple network frameworks for the candidate neural network model, until the target error value is less than the preset error value, the loss function converges, and the target is obtained.
  • Neural network model is a neural network model.
  • Target short speech segment with the preset short speech segment and extract classification labels in sequence to obtain a target classification label, where the target classification label includes interrogative tone, normal statement tone and/or false alarm noise.
  • the server calculates the short-term energy similarity between the target speech segment and the preset short speech segment, as well as the audio feature similarity; the short-term energy similarity and the audio feature similarity are weighted and summed to obtain the target similarity ; From the preset short voice fragments, obtain the target short voice fragments whose target similarity is greater than the preset similarity, and extract the classification tags of the target short voice fragments through the preset tag extraction algorithm to obtain the target classification tags.
  • the server can also calculate the text similarity and emotional feature similarity between the target speech segment and the preset short speech segment.
  • the short-term energy similarity, audio feature similarity, text similarity and emotional feature similarity are weighted and summed to obtain the target similarity, and determine whether the target similarity is greater than the preset target similarity.
  • the preset short speech segment corresponding to the degree is determined as the target short speech segment, if not, it returns a null value and stops the execution.
  • the server extracts the classification label of the target short speech segment through a preset label extraction algorithm to obtain the target classification label.
  • the server sequentially performs audio-weighted feature matrix calculation and feature fusion on the target audio features to obtain a fusion feature matrix.
  • the target neural network model includes an attention mechanism layer and multiple The fully connected layer of the layer; through the multi-layer fully connected layer and the target classification label, the fusion feature matrix is multi-level classification and probability value calculation, and the initial recognition type and the target confidence corresponding to the initial recognition type are obtained.
  • the server calculates the attention matrix of the target audio features through the attention mechanism layer in the preset target neural network model, and obtains the audio-heavy feature matrix.
  • the audio-heavy feature matrix is matrix-multiplied or matrix-added with the target audio features to obtain Fusion feature matrix, in which the multi-layer fully-connected layers are fully-connected layers that are connected in a preset series mode, that is, the output of the previous fully-connected layer is the input of the next fully-connected layer.
  • step 204 The execution process of this step 204 is similar to the execution process of the above-mentioned step 104, and details are not repeated here.
  • the server can segment the speech segment of the speech signal to be processed, obtain segmented speech segments, delete the segmented speech segments that meet the preset type conditions, obtain the deleted speech segments, and store the deleted speech segments.
  • the segments are spliced according to the time sequence and sequence of the speech signal to be processed to obtain the target speech signal.
  • the target recognition type is interrogative tone, normal statement tone and false alarm noise
  • the preset type condition is false alarm noise.
  • treat Process the speech signal to segment the speech segment and obtain the segmented speech segments A1 (corresponding to the normal statement tone), A2 (corresponding to false alarm noise) and A3 (corresponding to the interrogative tone).
  • A1 and A3 are spliced together to obtain the target speech signal A1A3 by processing the time sequence and sequence of the speech signal.
  • the voice assistance information includes service information, answer information, and information of the called assistant robot corresponding to the target voice signal.
  • the voice signal processing method can be applied to an intelligent dialogue assistant decision-making system.
  • the server corresponding to the intelligent dialogue assistant decision-making system performs voice recognition on the target voice signal to obtain voice text, and performs entity recognition on the voice text to obtain an entity.
  • the entity retrieves the voice assistant knowledge graph in the preset database to obtain the voice assistant information corresponding to the target voice signal.
  • the voice assistant information includes but is not limited to the business information corresponding to the voice, the answer information and the called assistant robot information, etc., wherein , after the server obtains the auxiliary voice information, it can perform corresponding operations according to the auxiliary voice information, such as the display of business process information, the voice dialogue and the invocation of auxiliary robots, which improves the accuracy of matching the auxiliary voice information and effectively avoids the need for some
  • the recognition of background noise as a valid speech segment output problem, and the error text content of the short speech segment corresponding to the background noise will be processed and responded to later, which increases the recognition burden and error rate, and improves the intelligent dialogue assistance decision-making.
  • the efficiency and accuracy of the system are conducive to improving the understanding ability of the intelligent dialogue-assisted decision-making system and the subsequent decision-making accuracy, which greatly improves the user experience. Data processing, easy to integrate into existing intelligent dialogue-assisted decision-making systems.
  • the short speech segment and the text output are combined, but also the emotion and expression content of the speaker, as well as interrogative sentences and background noise can be effectively judged and recognized in time, thereby improving the recognition accuracy of effective short speech. Furthermore, by matching the corresponding voice auxiliary information from the preset database according to the target voice signal, the accuracy of matching the voice auxiliary information is improved.
  • an embodiment of the apparatus for processing voice signals in the embodiments of the present application includes:
  • the identification and extraction module 301 is configured to obtain the speech signal to be processed, perform short speech segment recognition on the speech signal to be processed, obtain the target short speech segment, and perform frame audio feature extraction on the target short speech segment to obtain the target audio feature;
  • the matching extraction module 302 is used to sequentially perform matching and classification label extraction between the target short speech segment and the preset short speech segment, to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the first classification module 303 is used to classify the target audio feature through the preset target neural network model and the target classification label to obtain the initial recognition type and the target confidence level corresponding to the initial recognition type;
  • the judgment and determination module 304 is used to judge whether the target confidence is greater than the preset threshold, and if the target confidence is greater than the preset threshold, the initial recognition type is determined as the target recognition type;
  • the filtering module 305 is configured to filter the speech signal to be processed according to the target recognition type to obtain the target speech signal.
  • each module in the above-mentioned voice signal processing apparatus corresponds to each step in the above-mentioned voice signal processing method embodiment, and the functions and implementation process thereof will not be repeated here.
  • the target classification label including the interrogative tone, normal statement tone and/or false alarm noise is obtained, and the target neural network model is used to obtain the target classification label.
  • Classify the target audio features with the target classification label to obtain the initial recognition type and target confidence filter the to-be-processed voice signal according to the target recognition type to obtain the target voice signal, combine short voice fragments and text output, can effectively judge and timely Identify the speaker's emotion, expression content, as well as interrogative sentences and background noise, thereby improving the recognition accuracy of valid short speech.
  • another embodiment of the apparatus for processing a speech signal in the embodiment of the present application includes:
  • the identification and extraction module 301 is configured to obtain the speech signal to be processed, perform short speech segment recognition on the speech signal to be processed, obtain the target short speech segment, and perform frame audio feature extraction on the target short speech segment to obtain the target audio feature;
  • the matching extraction module 302 is used to sequentially perform matching and classification label extraction between the target short speech segment and the preset short speech segment, to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the first classification module 303 is used to classify the target audio feature through the preset target neural network model and the target classification label to obtain the initial recognition type and the target confidence level corresponding to the initial recognition type;
  • the judgment and determination module 304 is used to judge whether the target confidence is greater than the preset threshold, and if the target confidence is greater than the preset threshold, the initial recognition type is determined as the target recognition type;
  • the filtering module 305 is used for filtering the to-be-processed voice signal according to the target recognition type to obtain the target voice signal;
  • the matching module 306 is configured to match the corresponding voice assistance information from the preset database according to the target voice signal, where the voice assistance information includes business information corresponding to the target voice signal, answer information and information of the called assistant robot.
  • the identification and extraction module 301 can also be specifically used for:
  • Collect the to-be-processed voice signal through a preset voice collector, and sequentially perform preprocessing, voice segment recognition and segmentation, and voice-to-text conversion on the to-be-processed voice signal to obtain the voice fragment and the initial text information corresponding to the voice fragment;
  • frame speech extraction is performed on the target short speech segment to obtain the frame speech segment
  • audio feature extraction is performed on the frame speech segment to obtain the target audio feature
  • the matching extraction module 302 can also be specifically used for:
  • the short-term energy similarity and the audio feature similarity are weighted and summed to obtain the target similarity
  • the target short voice fragments From the preset short voice fragments, obtain the target short voice fragments whose target similarity is greater than the preset similarity, and extract the classification tags of the target short voice fragments through the preset tag extraction algorithm to obtain the target classification tags.
  • the first classification module 303 can also be specifically used for:
  • the target neural network model includes the attention mechanism layer and the multi-layer full connection Floor;
  • multi-level classification and probability value calculation are performed on the fusion feature matrix to obtain the initial recognition type and the target confidence corresponding to the initial recognition type.
  • the device for processing voice signals further comprising:
  • the feature extraction module 307 is used to obtain the short voice clip training samples marked by type, and perform frame audio feature extraction on the short voice clip training samples to obtain audio feature samples.
  • the short voice clip training samples include interrogative tone, normal statement tone and false tone. Label information of alarm noise;
  • the second classification module 308 is configured to classify the audio feature samples into a training set and a validation set through a preset ten-fold cross-validation algorithm
  • the training verification module 309 is used to train the preset initial neural network model through the training set to obtain the candidate neural network model, and to verify the candidate neural network model through the verification set to obtain the verification result;
  • the updating module 310 is configured to iteratively update the candidate neural network model through the preset loss function, the optimizer and the verification result to obtain the target neural network model.
  • the update module 310 can also be specifically used for:
  • the model parameters and/or network structure of the candidate neural network model are iteratively updated until the target error value is less than the preset error value, and the target neural network model is obtained.
  • each module and each unit in the above voice signal processing apparatus corresponds to each step in the above voice signal processing method embodiment, and their functions and implementation processes are not repeated here.
  • the short speech segment and the text output are combined, but also the emotion and expression content of the speaker, as well as interrogative sentences and background noise can be effectively judged and recognized in time, thereby improving the recognition accuracy of effective short speech. Furthermore, by matching the corresponding voice auxiliary information from the preset database according to the target voice signal, the accuracy of matching the voice auxiliary information is improved.
  • FIGS 3 and 4 above describe in detail the voice signal processing apparatus in the embodiment of the present application from the perspective of modular functional entities, and the following describes the voice signal processing device in the embodiment of the present application in detail from the perspective of hardware processing.
  • the voice signal processing device 500 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the voice signal processing device 500 .
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the voice signal processing device 500.
  • the voice signal processing device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • operating systems 531 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • the present application also provides a device for processing voice signals, including: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected through a line; the at least one processor The instruction in the memory is invoked, so that the voice signal processing device executes the steps in the above-mentioned voice signal processing method.
  • the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:
  • the target short speech segment and the preset short speech segment are sequentially matched and classified label extraction is performed to obtain a target classification label, and the target classification label includes interrogative tone, normal statement tone and/or false alarm noise;
  • the to-be-processed voice signal is filtered to obtain a target voice signal.
  • the computer-readable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required by at least one function, and the like; Use the created data, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及人工智能技术领域,提供一种语音信号的处理方法、装置、设备及存储介质,用于提高对有效短语音的识别准确性。语音信号的处理方法包括:获取待处理语音信号的目标短语音片段,并提取目标短语音片段的目标音频特征;根据目标短语音片段从预置短语音片段中获取目标分类标签,目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;通过目标神经网络模型和目标分类标签,对目标音频特征进行分类得到初始识别类型以及目标置信度;将目标置信度大于预设阈值的初始识别类型确定为目标识别类型;根据目标识别类型对待处理语音信号进行过滤得到目标语音信号。此外,本申请还涉及区块链技术,待处理语音信号可存储于区块链中。

Description

语音信号的处理方法、装置、设备及存储介质
本申请要求于2020年12月23日提交中国专利局、申请号为202011545242.0、发明名称为“语音信号的处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能的语音信号处理领域,尤其涉及一种语音信号的处理方法、装置、设备及存储介质。
背景技术
近些年来,随着深度学习和强化学习的蓬勃发展,智能对话系统作为人工智能领域的核心技术,得到了广泛的应用。自然语言理解是智能对话系统中的重要环节。现有的自然语言理解处理方式基本都集中于文本处理,并且为了提升人机交互的流畅性和效率,智能对话系统引入了文本情绪处理。
但是,发明人意识到现有的自然语言理解处理方式中,对于一些短语音,极少或者没有其相关的文本信息,因此,无法有效地判断说话人的情绪和表达内容,从而,导致了对有效短语音的识别准确性较低。
发明内容
本申请提供一种语音信号的处理方法、装置、设备及存储介质,用于提高对有效短语音的识别准确性。
本申请第一方面提供了一种语音信号的处理方法,包括:
获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
本申请第二方面提供了一种语音信号的处理设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
本申请第四方面提供了一种语音信号的处理装置,包括:
识别提取模块,用于获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
匹配提取模块,用于将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
第一分类模块,用于通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
判断确定模块,用于判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
过滤模块,用于根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
本申请提供的技术方案中,通过根据待处理语音信号的目标短语音片段,从预置短语音片段中,获取包括疑问语气、正常陈述语气和/或虚警噪声的目标分类标签,通过目标神经网络模型和目标分类标签对目标音频特征进行分类,得到初始识别类型以及目标置信度,根据目标识别类型对待处理语音信号进行过滤得到目标语音信号,结合了短语音片段和文本输出,能够有效地判断和及时识别说话人的情绪、表达内容,以及疑问语句和背景噪声,从而提高了对有效短语音的识别准确性。
附图说明
图1为本申请实施例中语音信号的处理方法的一个实施例示意图;
图2为本申请实施例中语音信号的处理方法的另一个实施例示意图;
图3为本申请实施例中语音信号的处理装置的一个实施例示意图;
图4为本申请实施例中语音信号的处理装置的另一个实施例示意图;
图5为本申请实施例中语音信号的处理设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种语音信号的处理方法、装置、设备及存储介质,提高了对有效短语音的识别准确性。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中语音信号的处理方法的一个实施例包括:
101、获取待处理语音信号,对待处理语音信号进行短语音片段识别,得到目标短语音片段,并对目标短语音片段进行帧音频特征提取,得到目标音频特征。
可以理解的是,本申请的执行主体可以为语音信号的处理装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
服务器可通过预置界面发送的语音信息,对语音信息依次进行数据清理处理、数据集成和信号转换,得到初始语音信号,对初始语音信号进行预加重处理和加窗分帧处理,并进行信号增强和语音端点检测,得到待处理语音信号;服务器也可通过发送采集指令给预置的语音采集器或语音采集设备,以使得语音采集器或语音采集设备采集初始语音信号,对初始语音信号进行预加重处理和加窗分帧处理,并进行信号增强处理和语音端点检测处理,得到待处理语音信号。
服务器可通过根据预设的短语音识别规则,对待处理语音信号进行短语音片段识别,得到目标短语音片段,该短语音识别规则可包括目标短语音片段的语音时长和短时能量大小。服务器也可通过根据预设的短语音识别规则,对待处理语音信号进行短语音片段识别,得到初始短语音片段,通过自动语音识别(automatic speech recognition,ASR)算法,对初始短语音片段进行语音识别和文本转换,得到初始短语音文本,判断初始短语音文本是否为单音节词,若是,则将初始短语音文本对应的初始短语音片段确定为目标短语音片段,若否,则剔除或标记初始短语音片段。
102、将目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声。
服务器可通过生成目标短语音片段的目标键,对预置数据库中存储的预置短语音片段的散列表进行键值对匹配,得到与目标短语音片段对应的目标短语音片段;或者,服务器也可通过预置的倒序索引,对预置数据库进行检索,得到与目标短语音片段对应的目标短语音片段;或者服务器也可通过计算目标短语音片段与预置短语音片段之间的语义相似度、文本相似度和情绪特征相似度,并计算语义相似度、文本相似度和情绪特征相似度的均值或权重和值,得到最终的相似度,判断该最终的相似度是否大于预设的目标值,若是,则将对应的预置短语音片段确定为目标短语音片段,若否,则返回空值。服务器通过预置的标签提取算法,提取目标短语音片段的分类标签信息,得到目标分类标签。
103、通过预置的目标神经网络模型和目标分类标签,对目标音频特征进行分类,得到初始识别类型,以及初始识别类型对应的目标置信度。
服务器通过预置的目标神经网络模型中的全连接网络层,基于目标分类标签,对目标音频特征进行分类并进行概率值计算,得到每个目标短语音片段对应给的初始识别类型,以及初始识别类型对应的置信度,该置信度为概率值。
其中,服务器可通过预置的目标神经网络模型中的多个分类器(全连接网络层的数量包括多个,一个全连接网络层对应一个分类器),分别基于目标分类标签,对目标音频特征进行分类并进行概率值计算,得到每个目标短语音片段对应给的多个识别结果,以及初始识别类型对应的多个初始置信度,按照值从大到小的顺序对多个初始置信度进行排序,将排序第一的初始置信度确定为目标置信度,将目标置信度对应的识别结果确定为初始识别类型。
104、判断目标置信度是否大于预设阈值,若目标置信度大于预设阈值,则将初始识别类型确定为目标识别类型。
服务器判断目标置信度是否大于预设阈值,若是,则将初始识别类型确定为目标识别类型,若否,则将初始识别类型确定为默认类型,默认类型可用于指示正常陈述语气。其中,服务器通过判断目标置信度是否大于预设阈值,得到目标识别类型之后,可根据目标短语音片段,检索预置数据库中存储的初始历史短语音片段,得到对应的目标历史短语音 片段,该目标历史短语音片段包含有对应的分类标签信息,从该分类标签信息中可获得目标历史短语音片段的历史识别类型,计算目标识别类型与历史识别类型之间的误差值,判断误差值是否大于预设的目标误差值,若是,则将目标识别类型和目标识别类型对应的目标短语音片段发送至预置审核端,若否,则创建目标识别类型和目标识别类型对应的目标短语音片段的对应关系,并落地缓存创建有对应关系的目标识别类型和目标识别类型对应的目标短语音片段,提高了目标识别类型的识别准确度。
105、根据目标识别类型,对待处理语音信号进行过滤,得到目标语音信号。
例如,目标识别类型为疑问语气、正常陈述语气和虚警噪声,待处理语音信号包括多个目标短语音片段,分别为目标短语音片段1、目标短语音片段2和目标短语音片段3,则服务器根据目标识别类型对待处理语音信号进行分类,得到与疑问语气对应的语音信号1、与正常陈述语气对应的语音信号2和与虚警噪声对应的语音信号3,将待处理语音信号中删除语音信号3,得到包含有语音信号1和语音信号2的目标语音信号。
本申请实施例中,通过根据待处理语音信号的目标短语音片段,从预置短语音片段中,获取包括疑问语气、正常陈述语气和/或虚警噪声的目标分类标签,通过目标神经网络模型和目标分类标签对目标音频特征进行分类,得到初始识别类型以及目标置信度,根据目标识别类型对待处理语音信号进行过滤得到目标语音信号,结合了短语音片段和文本输出,能够有效地判断和及时识别说话人的情绪、表达内容,以及疑问语句和背景噪声,从而提高了对有效短语音的识别准确性。
请参阅图2,本申请实施例中语音信号的处理方法的另一个实施例包括:
201、获取待处理语音信号,对待处理语音信号进行短语音片段识别,得到目标短语音片段,并对目标短语音片段进行帧音频特征提取,得到目标音频特征。
具体地,服务器通过预置语音采集器采集待处理语音信号,对待处理语音信号依次进行预处理、语音片段识别分割和语音文本转换,得到语音片段和语音片段对应的初始文本信息;对初始文本信息中的单音节词进行识别,得到目标文本信息,并将目标文本信息对应的语音片段确定为目标短语音片段;根据预置的帧长和帧间重叠度,对目标短语音片段进行帧语音提取,得到帧语音片段,并对帧语音片段进行音频特征提取,得到目标音频特征。
例如,服务器通过调用预置的麦克风或其他预置语音采集器,来采集待处理语音信号,对待处理语音信号进行信号增强的预处理,得到增强语音信号,对增强语音信号进行语音端点检测,得到语音端点,根据语音端点对增强语音信号进行片段分割,以实现语音片段识别分割的处理,得到语音片段,并通过ASR算法对语音片段进行语音识别和语音文本转换,得到初始文本信息,检测初始文本信息中的单音节词,获取语音片段中单音节词对应的目标短语音片段,根据预置的帧长和帧间重叠度,提取目标短语音片段中的每一帧短语音,得到帧语音片段,该帧长为25ms,该帧间重叠度为50%,提取帧语音片段的音频特征得到目标音频特征,目标音频特征包括频谱特征、梅尔频率倒谱特征、一阶二阶差分特征、音量特征和基频特征中的至少两种。
具体地,服务器获取待处理语音信号,对待处理语音信号进行短语音片段识别,得到目标短语音片段,并对目标短语音片段进行帧音频特征提取,得到目标音频特征之前,获取经过类型标注的短语音片段训练样本,并对短语音片段训练样本进行帧音频特征提取,得到音频特征样本,短语音片段训练样本包括疑问语气、正常陈述语气和虚警噪声的标签信息;通过预置的十折交叉验证算法,将音频特征样本分类为训练集和验证集;通过训练集对预置的初始神经网络模型进行训练,得到候选神经网络模型,并通过验证集对候选神经网络模型进行验证,得到验证结果;通过预置的损失函数、优化器和验证结果,对候选神经网络模型进行迭代更新,得到目标神经网络模型。
例如,服务器获取初始语音信号训练样本,对初始语音信号训练样本进行信号增强、语音端点检测处理和语音片段分割,得到语音片段训练样本,通过预置的ASR算法,对语音片段训练样本进行文本转换和短语音筛选,得到短语音片段训练样本,将短语音片段训练样本发送至预置标注端,通过预置标注端对短语音片段训练样本进行标注,或者通过预置标注端对短语音片段训练样本进行人工标注,或调用预置的标注工具对短语音片段训练样本进行标注,得到经过类型标注的短语音片段训练样本,标注的内容包括疑问语气、正常陈述语气和虚警噪声,如“疑问”,“陈述”和“噪声”等,对短语音片段训练样本进行帧音频特征提取,得到音频特征样本,通过预置的十折交叉验证算法,将音频特征样本分类为训练集和验证集,初始神经网络模型采用全连接网络结构,损失函数选择交叉熵函数(损失函数不限于交叉熵函数),优化器选择Adam优化器,学习率为〖10〗^(-4),批尺寸选择256,使用交叉熵函数,对候选神经网络模型的网络结构和模型参数进行迭代更新,训练经过100次循环,根据验证结果的正确率选择最优模型,从而得到目标神经网络模型,其中,在对初始神经网络模型进行训练时,可结合预训练模型进行训练和迭代更新,优化器可包括动量Momentum优化器、亚当Adam优化器以及均方根误差(root mean square prop,RMSprop)优化器中的至少一种。
具体地,服务器获取验证结果与标签信息之间的第一误差值,并通过预置的损失函数计算候选神经网络模型的第二误差值;根据第一误差值和第二误差值确定目标误差值;通过优化器,对候选神经网络模型的模型参数和/或网络结构进行迭代更新,直至目标误差值小于预设误差值,得到目标神经网络模型。
例如,服务器计算验证结果与标签信息之间的相似度,将相似度与1的差值确定为验证结果与标签信息之间的第一误差值,通过预置的损失函数计算候选神经网络模型的第二误差值,计算第一误差值和第二误差值的和值或权重值,得到目标误差值,通过优化器,对候选神经网络模型的模型参数(超参数)进行迭代调整,和/或通过优化器,对候选神经网络模型进行网络层的增加、删除,或对候选神经网络模型进行多个网络框架的连接方式进行调整,直至目标误差值小于预设误差值、损失函数收敛,得到目标神经网络模型。
202、将目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声。
具体地,服务器计算目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度;将短时能量相似度和音频特征相似度进行加权求和处理,得到目标相似度;从预置短语音片段中,获取目标相似度大于预设相似度的目标短语音片段,并通过预置的标签提取算法,提取目标短语音片段的分类标签,得到目标分类标签。
服务器计算目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度之外,还可以计算目标语音片段与预置短语音片段之间的文本相似度和情绪特征相似度,将短时能量相似度、音频特征相似度、文本相似度和情绪特征相似度进行加权求和处理,得到目标相似度,判断目标相似度是否大于预设目标相似度,若是,将目标相似度对应的预置短语音片段确定为目标短语音片段,若否,则返回空值,停止执行。服务器通过预置的标签提取算法,提取目标短语音片段的分类标签,得到目标分类标签。
203、通过预置的目标神经网络模型和目标分类标签,对目标音频特征进行分类,得到初始识别类型,以及初始识别类型对应的目标置信度。
具体地,服务器通过预置的目标神经网络模型中的注意力机制层,对目标音频特征依次进行音频偏重特征矩阵计算和特征融合,得到融合特征矩阵,目标神经网络模型包括注意力机制层和多层的全连接层;通过多层的全连接层和目标分类标签,对融合特征矩阵进行多层级分类和概率值计算,得到初始识别类型,以及初始识别类型对应的目标置信度。
服务器通过预置的目标神经网络模型中的注意力机制层,计算目标音频特征的注意力 矩阵,得到音频偏重特征矩阵,将音频偏重特征矩阵与目标音频特征进行矩阵相乘或矩阵相加,得到融合特征矩阵,其中,多层的全连接层为按照预设的串联方式进行连接的全连接层,即上一个全连接层的输出为下一个全连接层的输入,通过多层的全连接层,基于目标分类标签,对融合特征矩阵进行多层级分类并进行概率值计算,得到初始识别类型,以及初始识别类型对应的目标置信度,提高了获取初始识别类型,以及初始识别类型对应的目标置信度的准确度。
204、判断目标置信度是否大于预设阈值,若目标置信度大于预设阈值,则将初始识别类型确定为目标识别类型。
该步骤204的执行过程与上述步骤104的执行过程类似,在此不再赘述。
205、根据目标识别类型,对待处理语音信号进行过滤,得到目标语音信号。
服务器可根据目标识别类型,对待处理语音信号进行语音片段分割,得到分割后的语音片段,将符合预设类型条件的分割后的语音片段进行删除,得到删除后的语音片段,将删除后的语音片段按照待处理语音信号的时序和序列进行拼接,得到目标语音信号,例如:目标识别类型为疑问语气、正常陈述语气和虚警噪声,预设类型条件为虚警噪声,根据目标识别类型,对待处理语音信号进行语音片段分割,得到分割后的语音片段A1(对应正常陈述语气)、A2(对应虚警噪声)和A3(对应疑问语气),A2预设类型条件,则将A2删除,按照待处理语音信号的时序和序列将A1和A3进行拼接,得到目标语音信号A1A3。
206、根据目标语音信号,从预置数据库中匹配对应的语音辅助信息,语音辅助信息包括目标语音信号对应的业务信息、回答信息和调用的辅助机器人信息。
例如,本语音信号的处理方法可运用于智能对话辅助决策系统中,智能对话辅助决策系统对应的服务器对该目标语音信号进行语音识别,得到语音文本,对语音文本进行实体识别,得到实体,根据实体对预置数据库中的语音辅助知识图谱进行检索,得到与目标语音信号对应的语音辅助信息,该语音辅助信息包括但不限于语音对应的业务信息、回答信息和调用的辅助机器人信息等,其中,服务器得到语音辅助信息后,可根据语音辅助信息进行相应的操作,如:业务流程信息的展示、语音对话和辅助机器人的调用,提高了匹配语音辅助信息的准确性,有效地避免了将一些背景噪声识别为有效语音片段输出问题,以及对于背景噪声对应的短语音片段的错误文本内容,后续会对此进行处理及响应,所增加识别的负担和失误率的问题,提高了智能对话辅助决策系统的效率和准确性,有利于提高智能对话辅助决策系统的理解能力以及后续的决策准确度,极大地提升了用户体验,本技术基于ASR输出的语音片段以及对应的文本输出,不需要额外的数据处理,易于集成在现有的智能对话辅助决策系统中。
本申请实施例中,不仅结合了短语音片段和文本输出,能够有效地判断和及时识别说话人的情绪、表达内容,以及疑问语句和背景噪声,从而提高了对有效短语音的识别准确性,还通过根据目标语音信号,从预置数据库中匹配对应的语音辅助信息,提高了匹配语音辅助信息的准确性。
上面对本申请实施例中语音信号的处理方法进行了描述,下面对本申请实施例中语音信号的处理装置进行描述,请参阅图3,本申请实施例中语音信号的处理装置一个实施例包括:
识别提取模块301,用于获取待处理语音信号,对待处理语音信号进行短语音片段识别,得到目标短语音片段,并对目标短语音片段进行帧音频特征提取,得到目标音频特征;
匹配提取模块302,用于将目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
第一分类模块303,用于通过预置的目标神经网络模型和目标分类标签,对目标音频特征进行分类,得到初始识别类型,以及初始识别类型对应的目标置信度;
判断确定模块304,用于判断目标置信度是否大于预设阈值,若目标置信度大于预设阈值,则将初始识别类型确定为目标识别类型;
过滤模块305,用于根据目标识别类型,对待处理语音信号进行过滤,得到目标语音信号。
上述语音信号的处理装置中各个模块的功能实现与上述语音信号的处理方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,通过根据待处理语音信号的目标短语音片段,从预置短语音片段中,获取包括疑问语气、正常陈述语气和/或虚警噪声的目标分类标签,通过目标神经网络模型和目标分类标签对目标音频特征进行分类,得到初始识别类型以及目标置信度,根据目标识别类型对待处理语音信号进行过滤得到目标语音信号,结合了短语音片段和文本输出,能够有效地判断和及时识别说话人的情绪、表达内容,以及疑问语句和背景噪声,从而提高了对有效短语音的识别准确性。
请参阅图4,本申请实施例中语音信号的处理装置的另一个实施例包括:
识别提取模块301,用于获取待处理语音信号,对待处理语音信号进行短语音片段识别,得到目标短语音片段,并对目标短语音片段进行帧音频特征提取,得到目标音频特征;
匹配提取模块302,用于将目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
第一分类模块303,用于通过预置的目标神经网络模型和目标分类标签,对目标音频特征进行分类,得到初始识别类型,以及初始识别类型对应的目标置信度;
判断确定模块304,用于判断目标置信度是否大于预设阈值,若目标置信度大于预设阈值,则将初始识别类型确定为目标识别类型;
过滤模块305,用于根据目标识别类型,对待处理语音信号进行过滤,得到目标语音信号;
匹配模块306,用于根据目标语音信号,从预置数据库中匹配对应的语音辅助信息,语音辅助信息包括目标语音信号对应的业务信息、回答信息和调用的辅助机器人信息。
可选的,识别提取模块301还可以具体用于:
通过预置语音采集器采集待处理语音信号,对待处理语音信号依次进行预处理、语音片段识别分割和语音文本转换,得到语音片段和语音片段对应的初始文本信息;
对初始文本信息中的单音节词进行识别,得到目标文本信息,并将目标文本信息对应的语音片段确定为目标短语音片段;
根据预置的帧长和帧间重叠度,对目标短语音片段进行帧语音提取,得到帧语音片段,并对帧语音片段进行音频特征提取,得到目标音频特征。
可选的,匹配提取模块302还可以具体用于:
计算目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度;
将短时能量相似度和音频特征相似度进行加权求和处理,得到目标相似度;
从预置短语音片段中,获取目标相似度大于预设相似度的目标短语音片段,并通过预置的标签提取算法,提取目标短语音片段的分类标签,得到目标分类标签。
可选的,第一分类模块303还可以具体用于:
通过预置的目标神经网络模型中的注意力机制层,对目标音频特征依次进行音频偏重特征矩阵计算和特征融合,得到融合特征矩阵,目标神经网络模型包括注意力机制层和多层的全连接层;
通过多层的全连接层和目标分类标签,对融合特征矩阵进行多层级分类和概率值计算,得到初始识别类型,以及初始识别类型对应的目标置信度。
可选的,语音信号的处理装置,还包括:
特征提取模块307,用于获取经过类型标注的短语音片段训练样本,并对短语音片段训练样本进行帧音频特征提取,得到音频特征样本,短语音片段训练样本包括疑问语气、正常陈述语气和虚警噪声的标签信息;
第二分类模块308,用于通过预置的十折交叉验证算法,将音频特征样本分类为训练集和验证集;
训练验证模块309,用于通过训练集对预置的初始神经网络模型进行训练,得到候选神经网络模型,并通过验证集对候选神经网络模型进行验证,得到验证结果;
更新模块310,用于通过预置的损失函数、优化器和验证结果,对候选神经网络模型进行迭代更新,得到目标神经网络模型。
可选的,更新模块310还可以具体用于:
获取验证结果与标签信息之间的第一误差值,并通过预置的损失函数计算候选神经网络模型的第二误差值;
根据第一误差值和第二误差值确定目标误差值;
通过优化器,对候选神经网络模型的模型参数和/或网络结构进行迭代更新,直至目标误差值小于预设误差值,得到目标神经网络模型。
上述语音信号的处理装置中各模块和各单元的功能实现与上述语音信号的处理方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,不仅结合了短语音片段和文本输出,能够有效地判断和及时识别说话人的情绪、表达内容,以及疑问语句和背景噪声,从而提高了对有效短语音的识别准确性,还通过根据目标语音信号,从预置数据库中匹配对应的语音辅助信息,提高了匹配语音辅助信息的准确性。
上面图3和图4从模块化功能实体的角度对本申请实施例中的语音信号的处理装置进行详细描述,下面从硬件处理的角度对本申请实施例中语音信号的处理设备进行详细描述。
图5是本申请实施例提供的一种语音信号的处理设备的结构示意图,该语音信号的处理设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对语音信号的处理设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在语音信号的处理设备500上执行存储介质530中的一系列指令操作。
语音信号的处理设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的语音信号的处理设备结构并不构成对语音信号的处理设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种语音信号的处理设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述语音信号的处理设备执行上述语音信号的处理方法中的步骤。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音 片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
进一步地,计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种语音信号的处理方法,其中,所述语音信号的处理方法包括:
    获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
    将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
    通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
    判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
    根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
  2. 根据权利要求1所述的语音信号的处理方法,其中,所述获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征,包括:
    通过预置语音采集器采集待处理语音信号,对所述待处理语音信号依次进行预处理、语音片段识别分割和语音文本转换,得到语音片段和所述语音片段对应的初始文本信息;
    对所述初始文本信息中的单音节词进行识别,得到目标文本信息,并将所述目标文本信息对应的语音片段确定为目标短语音片段;
    根据预置的帧长和帧间重叠度,对所述目标短语音片段进行帧语音提取,得到帧语音片段,并对所述帧语音片段进行音频特征提取,得到目标音频特征。
  3. 根据权利要求1所述的语音信号的处理方法,其中,所述将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,包括:
    计算所述目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度;
    将所述短时能量相似度和所述音频特征相似度进行加权求和处理,得到目标相似度;
    从所述预置短语音片段中,获取所述目标相似度大于预设相似度的目标短语音片段,并通过预置的标签提取算法,提取所述目标短语音片段的分类标签,得到目标分类标签。
  4. 根据权利要求1所述的语音信号的处理方法,其中,所述通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度,包括:
    通过预置的目标神经网络模型中的注意力机制层,对所述目标音频特征依次进行音频偏重特征矩阵计算和特征融合,得到融合特征矩阵,所述目标神经网络模型包括注意力机制层和多层的全连接层;
    通过所述多层的全连接层和所述目标分类标签,对所述融合特征矩阵进行多层级分类和概率值计算,得到初始识别类型,以及所述初始识别类型对应的目标置信度。
  5. 根据权利要求1所述的语音信号的处理方法,其中,所述获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征之前,还包括:
    获取经过类型标注的短语音片段训练样本,并对所述短语音片段训练样本进行帧音频特征提取,得到音频特征样本,所述短语音片段训练样本包括疑问语气、正常陈述语气和虚警噪声的标签信息;
    通过预置的十折交叉验证算法,将所述音频特征样本分类为训练集和验证集;
    通过所述训练集对预置的初始神经网络模型进行训练,得到候选神经网络模型,并通过所述验证集对所述候选神经网络模型进行验证,得到验证结果;
    通过预置的损失函数、优化器和所述验证结果,对所述候选神经网络模型进行迭代更新,得到目标神经网络模型。
  6. 根据权利要求5所述的语音信号的处理方法,其中,所述通过预置的损失函数、优化器和所述验证结果,对所述候选神经网络模型进行迭代更新,得到目标神经网络模型,包括:
    获取所述验证结果与所述标签信息之间的第一误差值,并通过预置的损失函数计算所述候选神经网络模型的第二误差值;
    根据所述第一误差值和所述第二误差值确定目标误差值;
    通过所述优化器,对所述候选神经网络模型的模型参数和/或网络结构进行迭代更新,直至所述目标误差值小于预设误差值,得到目标神经网络模型。
  7. 根据权利要求1-6中任一项所述的语音信号的处理方法,其中,所述根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号之后,还包括:
    根据所述目标语音信号,从预置数据库中匹配对应的语音辅助信息,所述语音辅助信息包括所述目标语音信号对应的业务信息、回答信息和调用的辅助机器人信息。
  8. 一种语音信号的处理设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
    将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
    通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
    判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
    根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
  9. 根据权利要求8所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过预置语音采集器采集待处理语音信号,对所述待处理语音信号依次进行预处理、语音片段识别分割和语音文本转换,得到语音片段和所述语音片段对应的初始文本信息;
    对所述初始文本信息中的单音节词进行识别,得到目标文本信息,并将所述目标文本信息对应的语音片段确定为目标短语音片段;
    根据预置的帧长和帧间重叠度,对所述目标短语音片段进行帧语音提取,得到帧语音片段,并对所述帧语音片段进行音频特征提取,得到目标音频特征。
  10. 根据权利要求8所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    计算所述目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度;
    将所述短时能量相似度和所述音频特征相似度进行加权求和处理,得到目标相似度;
    从所述预置短语音片段中,获取所述目标相似度大于预设相似度的目标短语音片段,并通过预置的标签提取算法,提取所述目标短语音片段的分类标签,得到目标分类标签。
  11. 根据权利要求8所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过预置的目标神经网络模型中的注意力机制层,对所述目标音频特征依次进行音频偏重特征矩阵计算和特征融合,得到融合特征矩阵,所述目标神经网络模型包括注意力机 制层和多层的全连接层;
    通过所述多层的全连接层和所述目标分类标签,对所述融合特征矩阵进行多层级分类和概率值计算,得到初始识别类型,以及所述初始识别类型对应的目标置信度。
  12. 根据权利要求8所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    获取经过类型标注的短语音片段训练样本,并对所述短语音片段训练样本进行帧音频特征提取,得到音频特征样本,所述短语音片段训练样本包括疑问语气、正常陈述语气和虚警噪声的标签信息;
    通过预置的十折交叉验证算法,将所述音频特征样本分类为训练集和验证集;
    通过所述训练集对预置的初始神经网络模型进行训练,得到候选神经网络模型,并通过所述验证集对所述候选神经网络模型进行验证,得到验证结果;
    通过预置的损失函数、优化器和所述验证结果,对所述候选神经网络模型进行迭代更新,得到目标神经网络模型。
  13. 根据权利要求12所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    获取所述验证结果与所述标签信息之间的第一误差值,并通过预置的损失函数计算所述候选神经网络模型的第二误差值;
    根据所述第一误差值和所述第二误差值确定目标误差值;
    通过所述优化器,对所述候选神经网络模型的模型参数和/或网络结构进行迭代更新,直至所述目标误差值小于预设误差值,得到目标神经网络模型。
  14. 根据权利要求8-13中任一项所述的语音信号的处理设备,所述处理器执行所述计算机程序时还实现以下步骤:
    根据所述目标语音信号,从预置数据库中匹配对应的语音辅助信息,所述语音辅助信息包括所述目标语音信号对应的业务信息、回答信息和调用的辅助机器人信息。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
    将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
    通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
    判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
    根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    通过预置语音采集器采集待处理语音信号,对所述待处理语音信号依次进行预处理、语音片段识别分割和语音文本转换,得到语音片段和所述语音片段对应的初始文本信息;
    对所述初始文本信息中的单音节词进行识别,得到目标文本信息,并将所述目标文本信息对应的语音片段确定为目标短语音片段;
    根据预置的帧长和帧间重叠度,对所述目标短语音片段进行帧语音提取,得到帧语音片段,并对所述帧语音片段进行音频特征提取,得到目标音频特征。
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行 时,使得计算机还执行如下步骤:
    计算所述目标语音片段与预置短语音片段之间的短时能量相似度,以及音频特征相似度;
    将所述短时能量相似度和所述音频特征相似度进行加权求和处理,得到目标相似度;
    从所述预置短语音片段中,获取所述目标相似度大于预设相似度的目标短语音片段,并通过预置的标签提取算法,提取所述目标短语音片段的分类标签,得到目标分类标签。
  18. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    通过预置的目标神经网络模型中的注意力机制层,对所述目标音频特征依次进行音频偏重特征矩阵计算和特征融合,得到融合特征矩阵,所述目标神经网络模型包括注意力机制层和多层的全连接层;
    通过所述多层的全连接层和所述目标分类标签,对所述融合特征矩阵进行多层级分类和概率值计算,得到初始识别类型,以及所述初始识别类型对应的目标置信度。
  19. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    获取经过类型标注的短语音片段训练样本,并对所述短语音片段训练样本进行帧音频特征提取,得到音频特征样本,所述短语音片段训练样本包括疑问语气、正常陈述语气和虚警噪声的标签信息;
    通过预置的十折交叉验证算法,将所述音频特征样本分类为训练集和验证集;
    通过所述训练集对预置的初始神经网络模型进行训练,得到候选神经网络模型,并通过所述验证集对所述候选神经网络模型进行验证,得到验证结果;
    通过预置的损失函数、优化器和所述验证结果,对所述候选神经网络模型进行迭代更新,得到目标神经网络模型。
  20. 一种语音信号的处理装置,其中,所述语音信号的处理装置包括:
    识别提取模块,用于获取待处理语音信号,对所述待处理语音信号进行短语音片段识别,得到目标短语音片段,并对所述目标短语音片段进行帧音频特征提取,得到目标音频特征;
    匹配提取模块,用于将所述目标短语音片段与预置短语音片段依次进行匹配和分类标签提取,得到目标分类标签,所述目标分类标签包括疑问语气、正常陈述语气和/或虚警噪声;
    第一分类模块,用于通过预置的目标神经网络模型和所述目标分类标签,对所述目标音频特征进行分类,得到初始识别类型,以及所述初始识别类型对应的目标置信度;
    判断确定模块,用于判断所述目标置信度是否大于预设阈值,若所述目标置信度大于所述预设阈值,则将所述初始识别类型确定为目标识别类型;
    过滤模块,用于根据所述目标识别类型,对所述待处理语音信号进行过滤,得到目标语音信号。
PCT/CN2021/126111 2020-12-23 2021-10-25 语音信号的处理方法、装置、设备及存储介质 WO2022134833A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011545242.0 2020-12-23
CN202011545242.0A CN112735383A (zh) 2020-12-23 2020-12-23 语音信号的处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022134833A1 true WO2022134833A1 (zh) 2022-06-30

Family

ID=75605032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126111 WO2022134833A1 (zh) 2020-12-23 2021-10-25 语音信号的处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112735383A (zh)
WO (1) WO2022134833A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033734A (zh) * 2022-08-11 2022-09-09 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及存储介质
CN115062678A (zh) * 2022-08-19 2022-09-16 山东能源数智云科技有限公司 设备故障检测模型的训练方法、故障检测方法及装置
CN115631448A (zh) * 2022-12-19 2023-01-20 广州佰锐网络科技有限公司 一种音视频质检处理方法及系统
CN115631743A (zh) * 2022-12-07 2023-01-20 中诚华隆计算机技术有限公司 一种基于语音芯片的高精度语音识别方法及系统
CN117061788A (zh) * 2023-10-08 2023-11-14 中国地质大学(武汉) 一种短视频自动化监管与预警方法、设备及存储设备
CN117935787A (zh) * 2024-03-22 2024-04-26 摩尔线程智能科技(北京)有限责任公司 一种数据筛选标注方法、装置、电子设备和存储介质
CN117935787B (zh) * 2024-03-22 2024-05-31 摩尔线程智能科技(北京)有限责任公司 一种数据筛选标注方法、装置、电子设备和存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735383A (zh) * 2020-12-23 2021-04-30 深圳壹账通智能科技有限公司 语音信号的处理方法、装置、设备及存储介质
CN113220933A (zh) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 对音频片段进行分类的方法、装置和电子设备
CN113592262B (zh) * 2021-07-16 2022-10-21 深圳昌恩智能股份有限公司 一种用于网约车的安全监控方法及系统
CN113436634B (zh) * 2021-07-30 2023-06-20 中国平安人寿保险股份有限公司 基于声纹识别的语音分类方法、装置及相关设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231278A (zh) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 实现语音识别中自动添加标点符号的方法及系统
CN105427858A (zh) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 实现语音自动分类的方法及系统
CN105654942A (zh) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 一种基于统计参数的疑问句、感叹句的语音合成方法
CN106710588A (zh) * 2016-12-20 2017-05-24 科大讯飞股份有限公司 语音数据句类识别方法和装置及系统
CN111028827A (zh) * 2019-12-10 2020-04-17 深圳追一科技有限公司 基于情绪识别的交互处理方法、装置、设备和存储介质
CN111681653A (zh) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 呼叫控制方法、装置、计算机设备以及存储介质
CN112735383A (zh) * 2020-12-23 2021-04-30 深圳壹账通智能科技有限公司 语音信号的处理方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231278A (zh) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 实现语音识别中自动添加标点符号的方法及系统
CN105427858A (zh) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 实现语音自动分类的方法及系统
CN105654942A (zh) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 一种基于统计参数的疑问句、感叹句的语音合成方法
CN106710588A (zh) * 2016-12-20 2017-05-24 科大讯飞股份有限公司 语音数据句类识别方法和装置及系统
CN111028827A (zh) * 2019-12-10 2020-04-17 深圳追一科技有限公司 基于情绪识别的交互处理方法、装置、设备和存储介质
CN111681653A (zh) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 呼叫控制方法、装置、计算机设备以及存储介质
CN112735383A (zh) * 2020-12-23 2021-04-30 深圳壹账通智能科技有限公司 语音信号的处理方法、装置、设备及存储介质

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033734A (zh) * 2022-08-11 2022-09-09 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及存储介质
CN115033734B (zh) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及存储介质
CN115062678A (zh) * 2022-08-19 2022-09-16 山东能源数智云科技有限公司 设备故障检测模型的训练方法、故障检测方法及装置
CN115631743A (zh) * 2022-12-07 2023-01-20 中诚华隆计算机技术有限公司 一种基于语音芯片的高精度语音识别方法及系统
CN115631743B (zh) * 2022-12-07 2023-03-21 中诚华隆计算机技术有限公司 一种基于语音芯片的高精度语音识别方法及系统
CN115631448A (zh) * 2022-12-19 2023-01-20 广州佰锐网络科技有限公司 一种音视频质检处理方法及系统
CN115631448B (zh) * 2022-12-19 2023-04-04 广州佰锐网络科技有限公司 一种音视频质检处理方法及系统
CN117061788A (zh) * 2023-10-08 2023-11-14 中国地质大学(武汉) 一种短视频自动化监管与预警方法、设备及存储设备
CN117061788B (zh) * 2023-10-08 2023-12-19 中国地质大学(武汉) 一种短视频自动化监管与预警方法、设备及存储设备
CN117935787A (zh) * 2024-03-22 2024-04-26 摩尔线程智能科技(北京)有限责任公司 一种数据筛选标注方法、装置、电子设备和存储介质
CN117935787B (zh) * 2024-03-22 2024-05-31 摩尔线程智能科技(北京)有限责任公司 一种数据筛选标注方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN112735383A (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022134833A1 (zh) 语音信号的处理方法、装置、设备及存储介质
WO2021208719A1 (zh) 基于语音的情绪识别方法、装置、设备及存储介质
WO2020182153A1 (zh) 基于自适应语种进行语音识别的方法及相关装置
WO2021174757A1 (zh) 语音情绪识别方法、装置、电子设备及计算机可读存储介质
CN105931644B (zh) 一种语音识别方法及移动终端
CN109461446B (zh) 一种识别用户目标请求的方法、装置、系统及存储介质
CN103514170B (zh) 一种语音识别的文本分类方法和装置
WO2021103712A1 (zh) 一种基于神经网络的语音关键词检测方法、装置及系统
WO2016119604A1 (zh) 一种语音信息搜索方法、装置及服务器
JP5017534B2 (ja) 飲酒状態判定装置及び飲酒状態判定方法
WO2022134798A1 (zh) 基于自然语言的断句方法、装置、设备及存储介质
CN116110405B (zh) 一种基于半监督学习的陆空通话说话人识别方法及设备
CN112151015A (zh) 关键词检测方法、装置、电子设备以及存储介质
WO2022134834A1 (zh) 潜在事件预测方法、装置、设备及存储介质
JP2004094257A (ja) 音声処理のためのデシジョン・ツリーの質問を生成するための方法および装置
CN112466284B (zh) 一种口罩语音鉴别方法
WO2020238681A1 (zh) 音频处理方法、装置和人机交互系统
CN116050419B (zh) 一种面向科学文献知识实体的无监督识别方法及系统
CN111145761B (zh) 模型训练的方法、声纹确认的方法、系统、设备及介质
CN111091809B (zh) 一种深度特征融合的地域性口音识别方法及装置
CN117115581A (zh) 一种基于多模态深度学习的智能误操作预警方法及系统
CN116978367A (zh) 语音识别方法、装置、电子设备和存储介质
CN112037772B (zh) 基于多模态的响应义务检测方法、系统及装置
Cai et al. Deep speaker embeddings with convolutional neural network on supervector for text-independent speaker recognition
CN113470652A (zh) 一种基于工业互联网的语音识别及处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908834

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.11.2023)