CN116543753A

CN116543753A - Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Info

Publication number: CN116543753A
Application number: CN202310636139.4A
Authority: CN
Inventors: 赵梦原; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-04

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring target voice data; extracting acoustic features of the target voice data to obtain target acoustic features; decoding the target voice data based on the decoding model to obtain a target word graph; the target word graph comprises word nodes and a voice feature sequence, wherein the voice feature sequence comprises at least two voice words; constructing a plurality of candidate sentence paths based on the word nodes and the voice feature sequences; performing preliminary scoring on the voice words of each candidate sentence path to obtain a preliminary word score; performing scoring correction on the preliminary word scores based on the target acoustic features to obtain target word scores; and screening the candidate sentence paths according to the target word score to obtain target sentence paths, and splicing the voice words of the target sentence paths to obtain target sentence data. The method and the device can improve the accuracy of voice recognition.

Description

Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Technical Field

The present disclosure relates to the field of financial science and technology, and in particular, to a voice recognition method, a voice recognition device, an electronic device, and a storage medium.

Background

With the development of network, communication and computer technologies, enterprises have the characteristics of electronization, remoting, virtualization and networking, and more online enterprises are greatly emerging. Communication and dialogue between clients and enterprises are also developed from face-to-face consultation and interaction to communication and communication based on remote means such as network, telephone and the like. Under the background, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like.

For example, voice interaction-based financial transaction platforms are facing a large number of telephone voice services each day, handling customer diversified service needs, including pre-sales consultations, purchases, after-sales, complaints, and the like. In the telephone service process, the intelligent customer service robot needs to deal with different service objects and make proper reactions. If the intelligent customer service cannot accurately identify the requirements of the service object represented in the voice data in the dialogue exchange, the service response based on the voice data feedback cannot meet the requirements of the object, and the like, so that the service quality and the object satisfaction degree are affected.

Most of the current voice recognition methods depend on a neural network model, and the recognition process of the common neural network model is complex, so that partial semantic recognition errors are easy to cause, and the accuracy of voice recognition is affected, so that how to improve the accuracy of voice recognition becomes a technical problem to be solved urgently.

Disclosure of Invention

The main purpose of the embodiments of the present application is to provide a voice recognition method, a voice recognition device, an electronic device, and a storage medium, which aim to improve the accuracy of voice recognition.

To achieve the above object, a first aspect of an embodiment of the present application proposes a speech recognition method, the method including:

acquiring target voice data;

extracting acoustic features of the target voice data to obtain target acoustic features;

decoding the target voice data based on a preset decoding model to obtain a target word graph; the target word graph comprises word nodes and a voice feature sequence, wherein the voice feature sequence comprises at least two voice words, and each voice word is connected with two adjacent word nodes;

constructing a plurality of candidate sentence paths based on word nodes of the target word graph and the voice feature sequence;

Performing preliminary scoring on the voice words in each candidate sentence path to obtain preliminary word scores;

performing scoring correction on the preliminary word score based on the target acoustic features to obtain a target word score;

and screening the candidate sentence paths according to the target word score to obtain a target sentence path, and splicing the voice words in the target sentence path to obtain target sentence data.

In some embodiments, the extracting the acoustic feature of the target voice data to obtain a target acoustic feature includes:

inputting the target voice data into a preset acoustic model, wherein the acoustic model comprises a time domain convolution layer and a full connection layer;

performing feature extraction on the target voice data based on the time domain convolution layer to obtain a preliminary acoustic feature;

and carrying out feature screening on the preliminary acoustic features based on the full connection layer to obtain the target acoustic features.

In some embodiments, the word nodes include a start node, an end node, and a plurality of intermediate nodes, and the constructing a plurality of candidate sentence paths based on the word nodes of the target word graph and the speech feature sequence includes:

Calculating a first target weight between the initial node and the intermediate node according to a preset algorithm;

calculating a second target weight between the end node and the intermediate node according to the preset algorithm;

calculating a third target weight between each intermediate node and other intermediate nodes according to the preset algorithm;

and traversing the voice word of each intermediate node according to the first target weight, the second target weight and the third target weight to obtain a plurality of candidate sentence paths.

In some embodiments, the performing score correction on the preliminary term score based on the target acoustic feature to obtain a target term score includes:

scoring the voice words in each candidate sentence path again to obtain intermediate word scores;

and grading and correcting the preliminary word scores according to a preset formula, the target acoustic characteristics and the intermediate word scores to obtain the target word scores.

In some embodiments, the filtering the candidate sentence paths according to the target word score to obtain a target sentence path, and splicing the speech words in the target sentence path to obtain target sentence data, including:

Summing the target word scores of each candidate sentence path to obtain candidate sentence scores of the candidate sentence paths;

screening the candidate sentence paths according to the candidate sentence scores to obtain the target sentence paths;

and splicing the voice words in the target sentence path according to a preset sentence template to obtain the target sentence data.

In some embodiments, the filtering the candidate sentence path according to the candidate sentence score to obtain the target sentence path includes:

comparing the candidate sentence scores of all the candidate sentence paths;

and taking the candidate sentence path with the smallest candidate sentence score as the target sentence path.

In some embodiments, the acquiring the target voice data includes:

acquiring original voice data;

carrying out framing treatment on the original voice data to obtain initial voice data;

and performing frequency spectrum transformation on the initial voice data to obtain the target voice data, wherein the target voice data is a Mel cepstrum.

To achieve the above object, a second aspect of the embodiments of the present application proposes a speech recognition apparatus, the apparatus comprising:

The data acquisition module is used for acquiring target voice data;

the feature extraction module is used for extracting acoustic features of the target voice data to obtain target acoustic features;

the decoding module is used for decoding the target voice data based on a preset decoding model to obtain a target word graph; the target word graph comprises word nodes and a voice feature sequence, wherein the voice feature sequence comprises at least two voice words, and each voice word is connected with two adjacent word nodes;

the path construction module is used for constructing a plurality of candidate sentence paths based on the word nodes of the target word graph and the voice feature sequence;

the preliminary scoring module is used for carrying out preliminary scoring on the voice words in each candidate sentence path to obtain preliminary word scores;

the scoring correction module is used for scoring correction of the preliminary word scores based on the target acoustic features to obtain target word scores;

and the screening module is used for screening the candidate sentence paths according to the target word score to obtain a target sentence path, and splicing the voice words in the target sentence path to obtain target sentence data.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.

The voice recognition method, the voice recognition device, the electronic equipment and the storage medium are used for acquiring target voice data; the target acoustic characteristics of the target voice data can be conveniently identified by extracting the acoustic characteristics of the target voice data to obtain the target acoustic characteristics. Further, decoding processing is carried out on the target voice data based on a preset decoding model, and a target word graph is obtained; the target word graph comprises word nodes and a voice feature sequence, the voice feature sequence comprises at least two voice words, each voice word is connected with two adjacent word nodes, the target voice data can be decoded into the graph form, the decoded content of the target voice data is intuitively displayed, and subsequent voice recognition is facilitated. Further, a plurality of candidate sentence paths are constructed based on the word nodes and the voice feature sequences of the target word graph, and a plurality of voice words can be combined based on the connection relation between the word nodes, so that a plurality of candidate sentence paths are formed. The method comprises the steps of carrying out preliminary scoring on voice words in each candidate sentence path to obtain a preliminary word score, carrying out scoring correction on the preliminary word score based on target acoustic features to obtain a target word score, carrying out word scoring on a plurality of voice words belonging to the same candidate sentence path by taking the candidate sentence path as a batch unit, and better improving the scoring efficiency, thereby improving the voice recognition efficiency. Finally, screening the candidate sentence paths according to the target word score to obtain target sentence paths, splicing voice words in the target sentence paths to obtain target sentence data, screening the candidate sentence paths in a score quantification mode, determining target sentence data corresponding to the target voice data based on the target sentence paths, and better improving the accuracy of voice recognition, so that the intelligent customer service robot can more accurately recognize the characterized appeal in the voice data of the service object in the dialogue process with the service object, thereby improving targeted response and service feedback, effectively improving the dialogue quality and dialogue effectiveness in the financial transaction process, realizing intelligent voice dialogue service, improving the service quality and customer satisfaction of customers, and further improving the business success rate.

Drawings

FIG. 1 is a flow chart of a speech recognition method provided by an embodiment of the present application;

fig. 2 is a flowchart of step S101 in fig. 1;

fig. 3 is a flowchart of step S102 in fig. 1;

fig. 4 is a flowchart of step S104 in fig. 1;

fig. 5 is a flowchart of step S106 in fig. 1;

fig. 6 is a flowchart of step S107 in fig. 1;

fig. 7 is a flowchart of step S602 in fig. 6;

fig. 8 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Decoding (Decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

The hidden Markov model (Hidden Markov Model, HMM) is a statistical model that is used to describe a Markov process that contains hidden unknown parameters. The difficulty is determining the implicit parameters of the process from the observable parameters. These parameters are then used for further analysis, such as pattern recognition. The dynamic Bayesian network is the dynamic Bayesian network with the simplest structure, is a directed graph model and is mainly used for modeling time sequence data.

Fourier transform: the representation can represent a certain function satisfying a certain condition as a trigonometric function (sine and/or cosine function) or as a linear combination of integrals. In different areas of research, fourier transforms have many different variants, such as continuous fourier transforms and discrete fourier transforms.

Mel-frequency cepstral coefficients (Mel-Frequency Cipstal Coefficients, MFCC): is a set of key coefficients used to create the mel-frequency spectrum. From the segments of the music signal, a set of cepstrum is obtained that is sufficient to represent the music signal, and mel-frequency cepstrum coefficients are cepstrum (i.e., the spectrum) derived from the cepstrum. Unlike the general cepstrum, the biggest feature of mel-frequency cepstrum is that the frequency bands on mel-frequency spectrum are uniformly distributed on mel scale, that is, such frequency bands are closer to human nonlinear auditory System (Audio System) than the general linear cepstrum representation method. For example: in audio compression techniques, mel-frequency cepstrum is often used for processing.

Financial transaction platforms based on voice interactions are facing a large number of telephone voice services each day, handling customer diversified service needs including pre-sale consultation, purchase, after-sale, complaints, etc. In the telephone service process, the intelligent customer service robot needs to deal with different service objects and make proper reactions. If the intelligent customer service cannot accurately identify the requirements of the service object represented in the voice data in the dialogue exchange, the service response based on the voice data feedback cannot meet the requirements of the object, and the like, so that the service quality and the object satisfaction degree are affected.

For example, in the product recommendation process through the virtual character, when the service object has a query and needs to be consulted and communicated, the current virtual character can only search the answer in the set options, so that the appeal represented by the service object in the voice data can not be accurately identified, the phenomenon of 'answering questions' is caused, and the voice interaction accuracy of the virtual character is low.

Based on this, the embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, aiming at improving the accuracy of voice recognition.

The voice recognition method, the voice recognition device, the electronic apparatus and the storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the voice recognition method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice recognition method, and relates to the technical field of artificial intelligence. The voice recognition method provided by the embodiment of the application can be applied to the terminal, the server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a voice recognition method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to data related to user identity or characteristics, such as user information, user behavior data, user voice data, user history data, and user location information, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a voice recognition method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, acquiring target voice data;

step S102, extracting acoustic features of target voice data to obtain target acoustic features;

step S103, decoding processing is carried out on the target voice data based on a preset decoding model, and a target word graph is obtained; the target word graph comprises word nodes and a voice feature sequence, wherein the voice feature sequence comprises at least two voice words, and each voice word is connected with two adjacent word nodes;

Step S104, constructing a plurality of candidate sentence paths based on word nodes and voice feature sequences of the target word graph;

step S105, performing preliminary scoring on the voice words in each candidate sentence path to obtain preliminary word scores;

step S106, scoring and correcting the score of the preliminary word based on the target acoustic characteristics to obtain the score of the target word;

step S107, screening the candidate sentence paths according to the target word score to obtain target sentence paths, and splicing the voice words in the target sentence paths to obtain target sentence data.

Step S101 to step S107 illustrated in the embodiment of the present application, by acquiring target voice data; the target acoustic characteristics of the target voice data can be conveniently identified by extracting the acoustic characteristics of the target voice data to obtain the target acoustic characteristics. Further, decoding processing is carried out on the target voice data based on a preset decoding model, and a target word graph is obtained; the target word graph comprises word nodes and a voice feature sequence, the voice feature sequence comprises at least two voice words, each voice word is connected with two adjacent word nodes, the target voice data can be decoded into the graph form, the decoded content of the target voice data is intuitively displayed, and subsequent voice recognition is facilitated. Further, a plurality of candidate sentence paths are constructed based on the word nodes and the voice feature sequences of the target word graph, and a plurality of voice words can be combined based on the connection relation between the word nodes, so that a plurality of candidate sentence paths are formed. The method comprises the steps of carrying out preliminary scoring on voice words in each candidate sentence path to obtain a preliminary word score, carrying out scoring correction on the preliminary word score based on target acoustic features to obtain a target word score, carrying out word scoring on a plurality of voice words belonging to the same candidate sentence path by taking the candidate sentence path as a batch unit, and better improving the scoring efficiency, thereby improving the voice recognition efficiency. And finally, screening the candidate sentence paths according to the score of the target words to obtain target sentence paths, and splicing the voice words in the target sentence paths to obtain target sentence data, so that the screening of the candidate sentence paths can be realized in a score quantization mode, and the target sentence data corresponding to the target voice data can be determined based on the target sentence paths, thereby better improving the accuracy of voice recognition.

Referring to fig. 2, in some embodiments, step S101 may include, but is not limited to, steps S201 to S203:

step S201, obtaining original voice data;

step S202, carrying out framing treatment on original voice data to obtain initial voice data;

step S203, performing frequency spectrum transformation on the initial voice data to obtain target voice data, wherein the target voice data is a Mel cepstrum.

In step S201 of some embodiments, data may be crawled in a targeted manner after a web crawler is written and a data source is set to obtain original voice data, where the data source may be various types of network platforms, social media may also be some specific audio databases, etc., and the original voice data may be musical materials of speaking objects, lecture reports, chat dialogs, etc. The original voice data may also be acquired by other means, not limited thereto.

For example, in a financial transaction scenario, the original voice data is audio data containing conversations commonly used in the financial field, and in a security promotion scenario, the original voice data is audio data containing descriptions of the risk, cost, applicable population, etc. of a certain security product.

In step S202 of some embodiments, the original voice data is subjected to framing processing, so as to obtain initial voice data. Specifically, signal framing and windowing processing is performed on original voice data to obtain multi-frame voice fragments, short-time Fourier transformation is performed on the voice fragments of each frame, time domain features of the voice fragments are converted into frequency domain features, finally, stacking processing is performed on the frequency domain features of each frame in a time dimension to obtain a target spectrogram, and the target spectrogram is used as initial voice data.

In step S203 of some embodiments, the initial voice data is filtered through a preset mel-cepstrum filter bank, the initial voice data is subjected to logarithmic operation to obtain a target logarithmic spectrum, and then the target logarithmic spectrum is subjected to fourier inverse transformation to obtain a target mel-cepstrum. Further, feature extraction is carried out on the target mel cepstrum to obtain a target mel cepstrum coefficient, and the target mel cepstrum coefficient is used as target voice data.

Through the steps S201 to S203, the original voice data can be converted into the spectrum features more conveniently, and the spectrum features are filtered to obtain the target mel-frequency cepstrum coefficient (i.e. the target voice data), so that the voice content of the target voice data can be identified through the spectrum features, thereby improving the accuracy of voice identification.

Referring to fig. 3, in some embodiments, step S102 may include, but is not limited to, steps S301 to S303:

step S301, inputting target voice data into a preset acoustic model, wherein the acoustic model comprises a time domain convolution layer and a full connection layer;

step S302, extracting features of target voice data based on a time domain convolution layer to obtain preliminary acoustic features;

step S303, feature screening is carried out on the preliminary acoustic features based on the full connection layer, and target acoustic features are obtained.

In step S301 of some embodiments, the acoustic model may be built based on a deep neural network or based on a hidden markov model. For example, deep neural networks are employed to construct acoustic models that include time domain convolutional layers and fully-connected layers.

In step S302 of some embodiments, feature extraction is performed on the target voice data based on the time domain convolution layer, so as to obtain time sequence feature information in the target voice data, and obtain a preliminary acoustic feature.

In step S303 of some embodiments, the initial acoustic features are classified based on a prediction function (e.g., a softmax function, etc.) of the full-connection layer to obtain labeled acoustic features, where each labeled acoustic feature includes an initial acoustic feature and an acoustic tag of the initial acoustic feature, the acoustic tag includes a tone type, a pitch type, and so on, and a target acoustic feature that meets the current requirement is selected from a plurality of labeled acoustic features according to the acoustic tag, where the target acoustic feature may include multiple audio information such as a pitch, a duration length, a sounding frequency, and so on of target voice data.

For example, the speech features of the target speech data include semantic features, emotional features, regional features, and speech speed features. The speech features may be: "advisory Credit card questions"; the historical business feature may be "credit card overdue, the historical overdue times are more than two times, the payroll amount is low", the user statistics feature may be "credit card with high preference amount", the interactive content may be "you good, you have overdue for many times, which has great influence on your credit, hope you to finish the refund as soon as possible. ". While the speech features may be: "normal emotion", "a city dialect", "normal speech rate", the user static features may be: the "sex men", "living place is a city", "age 38", "character interior", "loving cartoon", the user statistical feature may be "preference for transacting deposit business", the interactive emotional feature may be "normal emotion", the interactive regional feature may be "a city dialect", the interactive stylized feature may be "game style", and the interactive speech rate feature may be "normal speech rate".

Through the steps S301 to S303, multiple kinds of acoustic characteristic information such as pitch, duration length, sounding frequency and the like of the target voice data can be conveniently identified, so that the obtained target acoustic characteristic can be used for a subsequent voice recognition process, and the accuracy of voice recognition is improved.

In step S103 of some embodiments, the preset decoding model may be a static decoder based on WFST, where WFST refers to a weighted finite state transducer (weighted finite state transducer). The decoding model is adopted to decode target voice data, the decoded content of the target voice data can be converted into a form of a directed graph, and a target word graph Lattice is obtained, wherein recognized word sequences are stored in the target word graph Lattice and can be used for representing the voice content of the recognized target voice data, the target word graph Lattice comprises a plurality of word nodes and a voice feature sequence, the voice feature sequence comprises at least two voice words, and each voice word is connected with two adjacent word nodes. The method can decode the target voice data into the form of a directed graph, intuitively display the decoded content of the target voice data, and is beneficial to subsequent voice recognition.

Referring to fig. 4, in some embodiments, the word nodes include a start node, an end node, and a plurality of intermediate nodes, and step S104 may include, but is not limited to, steps S401 to S404:

step S401, calculating a first target weight between a starting node and an intermediate node according to a preset algorithm;

Step S402, calculating a second target weight between the end node and the intermediate node according to a preset algorithm;

step S403, calculating a third target weight between each intermediate node and other intermediate nodes according to a preset algorithm;

step S404, according to the first target weight, the second target weight and the third target weight, performing phonetic word traversal on each intermediate node to obtain a plurality of candidate sentence paths.

In step S401 of some embodiments, the preset algorithm may be a plurality of algorithms included in the hidden markov model, for example, a forward algorithm, a viterbi algorithm, and a forward-backward algorithm. And carrying out sequence probability calculation on the initial node and the intermediate nodes by using a hidden Markov model to obtain a weight value between the initial node and each intermediate node, taking the weight value with the smallest value as a first target weight alpha, and taking the intermediate node corresponding to the first target weight alpha as an optimal subsequent node starting from the initial node.

In step S402 of some embodiments, a hidden markov model is used to perform sequence probability calculation on a start node and an intermediate node, so as to obtain a weight value between the intermediate node and each intermediate node, the weight value with the smallest value is used as a second target weight β, and the intermediate node corresponding to the second target weight β is used as an optimal preamble node of an end node.

In step S403 of some embodiments, a sequence probability calculation is performed on each intermediate node and other intermediate nodes by using a hidden markov model, so as to obtain a weight value between each intermediate node and other intermediate nodes, and the weight value is used as a third target weight between each intermediate node and other intermediate nodes, where the size of the third target weight may reflect an optimal preamble node and an optimal successor node of each intermediate node.

In step S404 of some embodiments, a phonetic word traversal is performed for each intermediate node according to the first target weight, the second target weight, and the third target weight, and each edge from which each intermediate node is traversed, i.e., phonetic words between the intermediate node and all of its optimal successor nodes are traversed, thereby generating a plurality of candidate sentence paths. It should be noted that, one candidate sentence path includes a plurality of word nodes and voice words on one effective path, where the effective path is a path between a start node and an end node.

Through the steps S401 to S404, the connection relation between the word nodes can be conveniently constructed by utilizing the voice words based on the weight between the word nodes, so that a plurality of candidate sentence paths are formed, and the generated candidate sentence paths are more reasonable.

In step S105 of some embodiments, a neural network language model (Neural Networks Learn Language Models, i.e., nnlm model) may be used to perform a preliminary scoring on the speech words in each candidate sentence path, resulting in a preliminary word score. Specifically, single-heat coding is performed on voice words of an input candidate sentence path based on a neural network language model to obtain voice coding features, then matrix multiplication is performed on each voice coding feature and a preset reference matrix to obtain target voice word matrixes, and finally, preliminary scoring is performed on each target voice word matrix by using a prediction function of the neural network language model to obtain a preliminary word score of each voice word, wherein the prediction function can be a softmax function, a tanh function and the like without limitation. By the method, the candidate sentence paths can be conveniently used as a batch unit, and the neural network language model is utilized to score the words of the plurality of voice words belonging to the same candidate sentence path, so that the scoring efficiency can be better improved, and the voice recognition efficiency is improved.

Referring to fig. 5, in some embodiments, step S106 may include, but is not limited to, steps S501 to S502:

Step S501, scoring the voice words in each candidate sentence path again to obtain intermediate word scores;

and step S502, grading correction is carried out on the preliminary term scores according to a preset formula, target acoustic characteristics and intermediate term scores, and target term scores are obtained.

In step S501 of some embodiments, the speech words in each candidate sentence path may be re-scored using a common language model (e.g., an N-gram model, etc.), resulting in mid-term scores. For example, the n-gram model is adopted to perform single-heat coding on the voice words of the input candidate sentence paths to obtain voice coding features, and a prediction function of the n-gram model is utilized to perform scoring calculation on each voice coding feature to obtain the intermediate word score of each voice word, wherein the prediction function can be a softmax function, a tanh function and the like without limitation.

In step S502 of some embodiments, the correlation between each speech word and the target acoustic feature may be calculated first, for example, the correlation between the speech word and the target acoustic feature is calculated by adopting a cosine similarity algorithm, a coordination filtering algorithm, a euclidean distance, and the like, to obtain an acoustic correlation M; and then according to a preset formula, firstly carrying out subtraction operation by using the acoustic correlation M and the intermediate word score Q, and then adding the subtraction operation result and the preliminary word score P to obtain the target word score N. Namely, the preset formula is as follows:

Target term score n=acoustic relevance M-intermediate term score q+preliminary term score P.

Through the steps S501 to S502, the candidate sentence paths can be more conveniently used as units, the common language model is utilized to score the words of the plurality of voice words belonging to the same candidate sentence path again, the intermediate word score is obtained, the preliminary word score is subjected to scoring correction based on the target acoustic feature and the intermediate word score, and the target word score with higher accuracy is obtained.

Referring to fig. 6, in some embodiments, step S107 includes, but is not limited to, steps S601 to S603:

step S601, summing the target word scores of each candidate sentence path to obtain candidate sentence scores of the candidate sentence paths;

step S602, screening the candidate sentence paths according to the candidate sentence scores to obtain target sentence paths;

step S603, the speech words in the target sentence path are spliced according to a preset sentence template, and target sentence data are obtained.

In step S601 of some embodiments, a sum function or other statistical function may be used to sum the target word scores of each candidate sentence path to obtain candidate sentence scores of the candidate sentence paths. For example, a candidate sentence path includes 5 speech words, each speech word corresponds to a target word score, and then the sum function is used to sum the target word scores of the 5 speech words to obtain a sum of the 5 target word scores, and the sum of the 5 target word scores is used as a candidate sentence score of the candidate sentence path.

In step S602 of some embodiments, the size of the candidate sentence score can intuitively reflect whether the candidate sentence path is the shortest path, and the shorter the candidate sentence path, i.e. the smaller the candidate sentence score, the more the sentence content represented by the candidate sentence path is close to the real semantic content of the current target speech.

In step S603 of some embodiments, the preset sentence template includes preset sentence format content, such as font size, font type, word spacing, paragraph spacing, and the like, of the sentence. And splicing all the voice words in the candidate sentence paths according to the node sequence of the word nodes of the target sentence paths to obtain the target sentence paths.

For example, the candidate sentence path a includes the target word scores of the speech words: orange (0.5), fruit (0.2), fruit (0.1); the candidate sentence path B includes the target word scores of the speech words: sentence (0.2), yes (0.2), fruit (0.1); the candidate sentence score for candidate sentence path a is 0.5+0.2+0.1=0.8; candidate sentence path B has a candidate sentence score of 0.2+0.2+0.1=0.5; and selecting the candidate sentence path A as a target sentence path according to the candidate sentence score of the candidate sentence path, and splicing the voice words in the target sentence path to obtain the fruit of which the target sentence data is orange.

Through the steps S601 to S603, the candidate sentence score of each candidate sentence path can be obtained relatively conveniently, the target sentence path is screened out from the candidate sentence paths according to the candidate sentence scores, the candidate sentence paths can be screened in a score quantization mode, the target sentence data corresponding to the target voice data is determined based on the target sentence paths, the voice content of the target voice data is represented by the target sentence data, and the accuracy of voice recognition can be improved better.

Referring to fig. 7, in some embodiments, step S602 may include, but is not limited to, steps S701 to S702:

step S701, comparing candidate sentence scores of all candidate sentence paths;

step S702, taking the candidate sentence path with the smallest candidate sentence score as a target sentence path.

In step S701 of some embodiments, the shorter the candidate sentence path, that is, the smaller the candidate sentence score, the more closely the sentence content represented by the candidate sentence path is to the real semantic content of the current target speech, so the candidate sentence path may be screened according to the candidate sentence score, the candidate sentence scores of all the candidate sentence paths may be compared, and the target sentence path may be selected from the plurality of candidate sentence paths according to the magnitude relation of the candidate sentence scores of the plurality of candidate sentence paths.

In step S702 of some embodiments, in order to improve accuracy of speech recognition, when a target sentence path is selected from a plurality of candidate sentence paths according to a magnitude relation of candidate sentence scores of the plurality of candidate sentence paths, a candidate sentence path with a minimum candidate sentence score is selected, and sentence content of the candidate sentence path with the minimum candidate sentence score is closest to real semantic content of current target speech, and the candidate sentence path with the minimum candidate sentence score is taken as the target sentence path.

Through the steps S701 to S702, the candidate sentence score of each candidate sentence path can be obtained more conveniently, and the target sentence path is screened out from the plurality of candidate sentence paths according to the candidate sentence scores of all candidate sentence paths, so that the screening of the candidate sentence paths can be realized in a score quantization mode, and the accuracy and the rationality of path screening are improved.

According to the voice recognition method, target voice data are obtained; the target acoustic characteristics of the target voice data can be conveniently identified by extracting the acoustic characteristics of the target voice data to obtain the target acoustic characteristics. Further, decoding processing is carried out on the target voice data based on a preset decoding model, and a target word graph is obtained; the target word graph comprises word nodes and a voice feature sequence, the voice feature sequence comprises at least two voice words, each voice word is connected with two adjacent word nodes, the target voice data can be decoded into the graph form, the decoded content of the target voice data is intuitively displayed, and subsequent voice recognition is facilitated. Further, a plurality of candidate sentence paths are constructed based on the word nodes and the voice feature sequences of the target word graph, and a plurality of voice words can be combined based on the connection relation between the word nodes, so that a plurality of candidate sentence paths are formed. The method comprises the steps of carrying out preliminary scoring on voice words in each candidate sentence path to obtain a preliminary word score, carrying out scoring correction on the preliminary word score based on target acoustic features to obtain a target word score, carrying out word scoring on a plurality of voice words belonging to the same candidate sentence path by taking the candidate sentence path as a batch unit, and better improving the scoring efficiency, thereby improving the voice recognition efficiency. Finally, screening the candidate sentence paths according to the target word score to obtain target sentence paths, splicing voice words in the target sentence paths to obtain target sentence data, screening the candidate sentence paths in a score quantification mode, determining target sentence data corresponding to the target voice data based on the target sentence paths, and better improving the accuracy of voice recognition, so that the intelligent customer service robot can more accurately recognize the characterized appeal in the voice data of the service object in the dialogue process with the service object, thereby improving targeted response and service feedback, effectively improving the dialogue quality and dialogue effectiveness in the financial transaction process, realizing intelligent voice dialogue service, improving the service quality and customer satisfaction of customers, and further improving the business success rate.

Referring to fig. 8, an embodiment of the present application further provides a voice recognition device, which may implement the foregoing voice recognition method, where the device includes:

a data acquisition module 801, configured to acquire target voice data;

the feature extraction module 802 is configured to perform acoustic feature extraction on the target voice data to obtain a target acoustic feature;

the decoding module 803 is configured to perform decoding processing on the target voice data based on a preset decoding model to obtain a target word graph; the target word graph comprises word nodes and a voice feature sequence, wherein the voice feature sequence comprises at least two voice words, and each voice word is connected with two adjacent word nodes;

a path construction module 804, configured to construct a plurality of candidate sentence paths based on the word nodes and the voice feature sequences of the target word graph;

the preliminary scoring module 805 is configured to perform preliminary scoring on the speech word in each candidate sentence path to obtain a preliminary word score;

the scoring correction module 806 is configured to perform scoring correction on the preliminary term score based on the target acoustic feature to obtain a target term score;

and a screening module 807, configured to perform screening processing on the candidate sentence paths according to the target word score, obtain a target sentence path, and splice the speech words in the target sentence path, so as to obtain target sentence data.

The specific implementation of the voice recognition device is basically the same as the specific embodiment of the voice recognition method, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises: the voice recognition method comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the voice recognition method. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the voice recognition method to perform the embodiments of the present application;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the voice recognition method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, which are used for acquiring target voice data; the target acoustic characteristics of the target voice data can be conveniently identified by extracting the acoustic characteristics of the target voice data to obtain the target acoustic characteristics. Further, decoding processing is carried out on the target voice data based on a preset decoding model, and a target word graph is obtained; the target word graph comprises word nodes and a voice feature sequence, the voice feature sequence comprises at least two voice words, each voice word is connected with two adjacent word nodes, the target voice data can be decoded into the graph form, the decoded content of the target voice data is intuitively displayed, and subsequent voice recognition is facilitated. Further, a plurality of candidate sentence paths are constructed based on the word nodes and the voice feature sequences of the target word graph, and a plurality of voice words can be combined based on the connection relation between the word nodes, so that a plurality of candidate sentence paths are formed. The method comprises the steps of carrying out preliminary scoring on voice words in each candidate sentence path to obtain a preliminary word score, carrying out scoring correction on the preliminary word score based on target acoustic features to obtain a target word score, carrying out word scoring on a plurality of voice words belonging to the same candidate sentence path by taking the candidate sentence path as a batch unit, and better improving the scoring efficiency, thereby improving the voice recognition efficiency. Finally, screening the candidate sentence paths according to the target word score to obtain target sentence paths, splicing voice words in the target sentence paths to obtain target sentence data, screening the candidate sentence paths in a score quantification mode, determining target sentence data corresponding to the target voice data based on the target sentence paths, and better improving the accuracy of voice recognition, so that the intelligent customer service robot can more accurately recognize the characterized appeal in the voice data of the service object in the dialogue process with the service object, thereby improving targeted response and service feedback, effectively improving the dialogue quality and dialogue effectiveness in the financial transaction process, realizing intelligent voice dialogue service, improving the service quality and customer satisfaction of customers, and further improving the business success rate.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of speech recognition, the method comprising:

acquiring target voice data;

2. The method for recognizing speech according to claim 1, wherein the extracting acoustic features from the target speech data to obtain target acoustic features comprises:

3. The method of claim 1, wherein the word nodes include a start node, an end node, and a plurality of intermediate nodes, wherein the constructing a plurality of candidate sentence paths based on the word nodes of the target word graph and the speech feature sequence includes:

4. The method of claim 1, wherein scoring the preliminary term scores based on the target acoustic features to obtain target term scores comprises:

5. The method of claim 1, wherein the filtering the candidate sentence paths according to the target word score to obtain a target sentence path, and splicing the voice words in the target sentence path to obtain target sentence data, includes:

6. The method of claim 5, wherein the filtering the candidate sentence paths according to the candidate sentence scores to obtain the target sentence paths comprises:

comparing the candidate sentence scores of all the candidate sentence paths;

7. The voice recognition method according to any one of claims 1 to 6, wherein the acquiring the target voice data includes:

acquiring original voice data;

8. A speech recognition device, the device comprising:

the data acquisition module is used for acquiring target voice data;

9. An electronic device comprising a memory storing a computer program and a processor that when executing the computer program implements the speech recognition method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech recognition method of any one of claims 1 to 7.