CN114267341A

CN114267341A - Voice recognition processing method and device based on ATM service logic

Info

Publication number: CN114267341A
Application number: CN202111629658.5A
Authority: CN
Inventors: 梁升荣; 王曼; 罗秉安; 王永隆
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-01

Abstract

The application relates to a speech recognition processing method, a speech recognition processing device, computer equipment, a storage medium and a computer program product based on automatic teller machine service logic, which can be used in the technical field of artificial intelligence. The method and the device can improve the recognition efficiency of the voice instruction input by the user, and compared with the traditional voice recognition scheme, the calculation amount is reduced, the voice recognition speed and the recognition accuracy are further improved, and therefore the user experience on the automatic teller machine is improved. The method comprises the following steps: acquiring a user voice audio and identifying the current service state; screening an executable instruction set from a preset instruction library based on the current service state; constructing a state network based on a set of executable instructions; based on the voice and audio of the user, searching from the state network by using a path searching algorithm to obtain a global optimal path; converting the global optimal path into a voice instruction; and instructing the automatic teller machine to perform corresponding actions according to the voice instruction.

Description

Voice recognition processing method and device based on ATM service logic

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for speech recognition processing based on atm service logic, a computer device, a storage medium, and a computer program product.

Background

With the development of information technology and the internet, all industries have started the digitization process. The traditional bank service also starts to be digital, people do not need to draw a large amount of time to go to a bank outlet to handle business in person like the past, the withdrawal service on an Automatic Teller Machine (ATM) becomes more and more intelligent, and a faster and efficient server can be provided for users through characters and voice.

In the current ATM self-service, after the user is identified, the user can input instructions on a keyboard through the voice prompt of the ATM, such as service item codes, withdrawal amount and the like, and then can complete the related service functions. However, in the current service mode, the user needs to input an instruction and cannot accurately acquire the voice instruction of the user and accurately execute the corresponding execution action, and the service mode is not fast and intelligent enough and cannot meet the deposit or withdrawal efficiency of the user.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for speech recognition processing based on atm service logic.

In a first aspect, the application provides a speech recognition processing method based on ATM service logic. The method comprises the following steps:

acquiring a user voice audio and identifying the current service state;

screening an executable instruction set from a preset instruction library based on the current service state;

constructing a state network based on the set of executable instructions;

searching from the state network by utilizing a path search algorithm based on the user voice audio to obtain a global optimal path;

converting the global optimal path into a voice instruction;

and instructing the automatic teller machine to perform corresponding actions according to the voice instruction.

In one embodiment, the constructing a state network based on the set of executable instructions comprises:

converting the set of executable instructions to a word network;

converting the word network to a phoneme network;

converting the phoneme network to the state network; wherein each phoneme corresponds to a plurality of states.

In one embodiment, the searching for the globally optimal path from the state network by using a path search algorithm based on the user voice audio includes:

carrying out voice boundary detection on the user voice audio to obtain effective voice audio;

performing framing processing on the effective voice audio to obtain a multi-frame voice signal;

obtaining a multidimensional vector corresponding to each frame of voice signal according to the characteristic parameters of each frame of voice signal;

and searching the optimal path in the state network corresponding to the executable instruction set by utilizing a path search algorithm based on the multi-dimensional vector to obtain the global optimal path.

In one embodiment, the global optimal path comprises a plurality of target states; the converting the global optimal path into a voice instruction includes:

combining the target states into a plurality of target phonemes according to the corresponding relation between the target states and the target phonemes;

combining the plurality of target phonemes into a target word;

and constructing the voice instruction according to the target word.

In one embodiment, before the obtaining the user voice audio, the method further includes:

acquiring a user card through card detection equipment, and displaying a password input interface;

in response to a user entering a user password on the password entry interface, verifying the user password;

if the password passes the verification, verifying the card state of the user card;

and outputting verification passing information on the premise that the card state is verified to pass.

In one embodiment, the path search algorithm comprises a viterbi algorithm.

In a second aspect, the application further provides a speech recognition processing device based on the automatic teller machine service logic. The device comprises:

the audio acquisition module is used for acquiring the voice audio of the user and identifying the current service state;

the instruction screening module is used for screening an executable instruction set from a preset instruction library based on the current service state;

a state network construction module for constructing a state network based on the set of executable instructions;

the optimal path searching module is used for searching the state network by utilizing a path searching algorithm based on the user voice audio to obtain a global optimal path;

the voice instruction conversion module is used for converting the global optimal path into a voice instruction;

and the action execution module is used for indicating the automatic teller machine to perform corresponding actions according to the voice instruction.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps in the embodiment of the speech recognition processing method based on the ATM service logic.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in an embodiment of a method for speech recognition processing based on automated teller machine business logic.

In a fifth aspect, the present application further provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps in an embodiment of the method for speech recognition processing based on ATM service logic.

According to the voice recognition processing method and device based on the ATM service logic, the computer equipment, the storage medium and the computer program product, the voice audio frequency of the user is obtained, and the current service state is recognized; screening an executable instruction set from a preset instruction library based on the current service state; constructing a state network based on a set of executable instructions; based on the voice and audio of the user, searching from the state network by using a path searching algorithm to obtain a global optimal path; converting the global optimal path into a voice instruction; and instructing the automatic teller machine to perform corresponding actions according to the voice instruction. The method and the device can improve the recognition efficiency of the voice instruction input by the user, and compared with the traditional voice recognition scheme, the calculation amount is reduced, the voice recognition speed and the recognition accuracy are further improved, and therefore the user experience on the automatic teller machine is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a method for speech recognition processing based on ATM service logic;

FIG. 2 is a flow diagram of a method for speech recognition processing based on ATM service logic in one embodiment;

FIG. 3 is a flow diagram of a speech recognition processing method based on ATM service logic in another embodiment;

FIG. 4 is a block diagram of a speech recognition processing device based on ATM service logic in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for processing speech recognition based on the automatic teller machine service logic provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The data storage system may store data that the server 102 needs to process. The data storage system may be integrated on the server 102, or may be located on the cloud or other network server. Among them, the terminal 101 is an ATM (automatic teller machine) of various forms, and the ATM includes a sound pickup device (e.g., a microphone, etc.) for acquiring voice audio of a user. The server 102 may be located with the ATM or separately from the ATM, and the server 102 may be implemented as a stand-alone server or a server cluster comprising a plurality of servers.

In one embodiment, as shown in fig. 2, a method for processing speech recognition based on atm service logic is provided, which is described by taking the method as an example applied to the server 102 in fig. 1, and includes the following steps:

step S201, obtaining the voice audio of the user and identifying the current service state.

The voice audio of the user refers to voice fragments spoken by the user at the ATM side, and the voice fragments can uniquely identify the voiceprint characteristics of the user and further identify the identity of the user because the voice characteristics of each person are different. The current service state refers to a current service node obtained according to the ATM service logic, for example, when the user is not inserting a card, the ATM detects that the current user is not inserting a card, the current service state is a card to be inserted state, when the ATM detects a card (for example, a bank card) of the user and the user inputs a wrong password, the current service state is an identity state to be verified, when the ATM detects a card of the user and the identity verification passes, the current state is a service type to be confirmed (for example, the service type is a withdrawal service or a deposit service), and the like, and the service states are service nodes designed in advance according to the ATM service logic.

Specifically, a sound pickup device, which may be a microphone, is integrated on the ATM, and when a user sends voice information within a certain range near the ATM, the sound pickup device captures voice audio of the user, and detects the voice audio and simultaneously detects the current service state, for example, when the user does not insert a card, the ATM detects that the current user does not insert a card (current service state: no medium), the current service state is the card to be inserted state, and when the ATM detects a card (e.g., a bank card) of the user and the user inputs a wrong password, the current service state is the identity state to be verified.

Step S202, screening an executable instruction set from a preset instruction library based on the current service state;

the preset instruction library is a list of all commands which are designed and stored in advance according to ATM service logic, and the commands comprise balance (deposit), deposit, withdrawal, transfer, password modification and the like. The executable instruction set refers to commands which can be processed in the current business state, for example, instructions which can be processed in the next step in the state without media (i.e. without card insertion by a user) include prompting card insertion, identity recognition, no card withdrawal and the like, and operations such as transfer, deposit and the like can be performed in businesses which cannot be processed, and the executable instructions can be flexibly set according to the regulations of various banks and business logic.

Specifically, the step S202 refers to the server 102, after identifying the current service state (i.e. the current service node), associating the current service node with one or more executable instructions to form an executable execution instruction set, where the executable instructions are screened from a preset full instruction library. The full-amount instruction library is obtained in advance according to the design of banking business logic. The business logic of the bank may refer to instructions that the banking business can execute in each state, for example, the business logic at the ATM is: the first step is as follows: when the customer operates the ATM, the interface can be operated to inquire balance and deposit, and the current state and the operable instruction are { state: no medium; list of commands: inquiring balance and deposit }; after the operation medium (such as a debit card) is inserted, the operations which can be carried out are deposit, withdrawal, transfer, balance inquiry and password modification; at this time, the service state and executable instruction that are screened out for recognition are { state: identifying a card number; list of commands: deposit, withdraw money, transfer account, query balance, modify password }; if withdrawal is selected, after a password or other verification information is input and verification is passed, the operation which can be performed is withdrawal operation, and the current service state and the executable instruction which are screened out for identification at this time are { state: passing the verification; list of commands: withdraw 100 yuan, withdraw 200 yuan, 200 yuan.

Step S203, constructing a state network based on the executable instruction set;

in the field of speech recognition, it is generally considered that the pronunciation of a word is composed of phonemes. For english, a commonly used phone set is a set of 39 phones from The university of kaki merlon, see The CMU sounding Dictionary. Chinese generally uses all initials and finals as a phoneme set directly. The state is as follows: it can be understood that a unit of speech finer than a phoneme is generally divided into 3 states. That is, every 3 states are combined into one phoneme, and one or more phonemes are combined into one word.

In step S203, since the instructions in the executable instruction set are generally simple sentences or words, where the sentences are also composed of words, and each word corresponds to a plurality of states, all the words in the executable instruction set can be decomposed into corresponding states, and all the decomposed states form a state network.

And step S204, searching the state network by using a path search algorithm based on the user voice audio to obtain a global optimal path.

The path search algorithm is an algorithm in the field of speech recognition, that is, after obtaining speech segments to be recognized, an optimal state path is found in a known state network according to sound waveform characteristics of the speech segments, where the state path includes a most probable state and a transition relationship between the most probable states, and the transition relationship may be a connection relationship between two words, for example, a word before a noun is more likely to be a verb, and the probability that the noun is still before the noun is smaller.

Specifically, step S204 is to find an optimal state path in the known state network according to the sound waveform characteristics of the speech segments after obtaining the speech segments to be recognized, where the state in the optimal state path corresponds to the most probable word, and the combination of the words is the most probable audio content.

Step S205, converting the global optimal path into a voice command.

The voice command refers to specific content corresponding to the voice information, and because the machine acquires only one voice segment, which is only one waveform for the machine, and the machine cannot immediately know what the waveform means like a human, the meaning (including semantics and grammar) corresponding to the voice waveform needs to be decoded by a certain algorithm, and the meaning of the combination of the semantics and the grammar is the voice command, that is, the specific content of the voice audio of the user.

Specifically, the states in the global optimal path are mapped to corresponding words, the conversion relationship between the states is converted to the corresponding word connection sequence, and the words are combined to obtain the content of the voice audio of the user, for example, the content corresponding to a section of voice clip of the user is recognized as "deposit", "card withdrawal" or "withdrawal", etc.

And step S206, instructing the automatic teller machine to perform corresponding actions according to the voice command.

Specifically, after recognizing the instruction content corresponding to the voice audio of the user, the server 102 may instruct the automatic teller machine to perform a corresponding action according to the voice instruction of the user, for example, if the voice content of the user is recognized as "card exit", the automatic teller machine is instructed to perform a card exit action, and if the voice content of the user is recognized as "deposit", the server performs a next action according to the current service status, for example, opening a banknote receiving device to prompt the user to deposit banknotes.

In the embodiment, the voice audio of the user is obtained, and the current service state is identified; screening an executable instruction set from a preset instruction library based on the current service state; constructing a state network based on a set of executable instructions; based on the voice and audio of the user, searching from the state network by using a path searching algorithm to obtain a global optimal path; converting the global optimal path into a voice instruction; and instructing the automatic teller machine to perform corresponding actions according to the voice instruction. According to the method and the device, after the voice audio of the user is obtained at each time, the executable instruction set is screened according to the current service state, the search range of the path search algorithm is narrowed, the recognition efficiency of the voice instruction input by the user can be improved, compared with the traditional voice recognition scheme, the calculation amount is reduced, the voice recognition speed and the recognition accuracy are further improved, and therefore the user experience on the automatic teller machine is improved.

In an embodiment, the step S203 includes: converting the set of executable instructions to a word network; converting the word network into a factor network; the factor network is converted to a state network, wherein each factor corresponds to a plurality of states.

Specifically, after obtaining a corresponding executable instruction set according to the current service state, the server 102 splits each of these executable instructions into words to form a word network, and since each word is composed of one or more phonemes, it is necessary to split each word into corresponding phonemes, and expand the word network into a phoneme network, and since each phoneme is composed of 3 states, each state actually corresponds to a number sequence, which is convenient for computer processing, the phoneme network is expanded into a state network.

Further, in the present embodiment, the HMM (Hidden MarkovModel) is used to construct the state network, and the HMM algorithm is a process for describing the random generation of the observation sequence by the markov chain, and is a generative model. Taking an HMM model as an example, a specific optimization algorithm for speech recognition is:

the present embodiment enables dynamically building a state network. That is, other speech recognitions are all constructed with a full state network in advance, so that the result is limited to the preset network. However, the smaller the definition of the network, the smaller the recognition range, such as only defining two words, namely "eating" and "sleeping", and recognizing the two words anyway, but if the network is larger, the more difficult it is to achieve a better recognition accuracy. Then dynamically selecting this state network through the bank's currently executable instruction set will limit the network to a smaller and more accurate network. Before each identification, detecting the instructions capable of executing the operation in the current service state, and then forming the operable instructions into a state network:

Z＝{Z1,Z2,Z3,Z4,......Zn]

wherein Z is an executable instruction set in a certain service state, and Zn is the nth instruction (n is a natural number) in the set, and this instruction set may be stored in the server, or may be stored in the local terminal device, such as an ATM machine.

In the above embodiment, the executable instruction set is expanded into a state network, each state actually corresponds to a number sequence, which is beneficial for the computer to correspond each frame of the user voice audio to the corresponding state, and provides a mathematical basis for semantic analysis.

In an embodiment, as shown in fig. 3, fig. 3 is a schematic flowchart illustrating a speech recognition processing method based on atm service logic in another embodiment, where the step S204 includes: carrying out voice boundary detection aiming at the voice audio of the user to obtain effective voice audio; performing framing processing on the effective voice audio to obtain a multi-frame voice signal; obtaining a multidimensional vector corresponding to each frame of voice signal according to the characteristic parameters of each frame of voice signal; and searching the optimal path in the state network corresponding to the executable instruction set by utilizing a path searching algorithm based on the multi-dimensional vector to obtain the global optimal path.

Voice boundary Detection, namely VAD (Voice Activity Detection/Voice endpoint Detection), means that before formal Voice recognition is started, muting or background noise of the head and tail ends in Voice audio of a user needs to be removed, so that interference on subsequent steps is reduced; the characteristic parameters generally refer to MFCC (Mel-frequency cepstral coefficients) parameters; MFCC parameters refer to a set of feature vectors that result from an encoding operation on the physical information of the speech (spectral envelope and details).

Specifically, after obtaining the user voice audio, the user voice audio needs to be preprocessed to improve subsequent processing efficiency, where the preprocessing includes VAD detection, and the original user voice audio is subjected to noise suppression and silence removal through VAD detection to obtain an effective voice audio. Then, the effective speech audio is subjected to framing processing, i.e., the effective speech audio is cut into small segments, each segment being called a frame. The framing operation is generally not a simple cut-out but is implemented using a moving window function. After framing, the speech becomes many small segments, i.e., multi-frame speech signals. But the waveform has little description capability in the time domain, so the waveform must be transformed. The transformation method used in this embodiment is to extract the MFCC (Mel Frequency Cepstral coefficient), and according to the physiological characteristics of human ears, change each frame waveform into a multi-dimensional vector, which can be simply understood as that the vector contains the content information of the frame of speech. This process is called acoustic feature extraction. Obtaining a characteristic parameter of each frame through MFCC characteristic extraction, wherein the characteristic parameter of each frame forms a multi-dimensional vector; to this end, the sound is formed into a matrix of 12 rows (assuming 12 dimensions of acoustic features) and N columns, called the observation sequence, where N is the total number of frames. Next, how to convert this matrix into text will be described.

How to correspond the phonemes in each frame segmented as described above to a specific state? The present embodiment uses a path search algorithm, i.e. the probability of looking at which known state a frame corresponds to is the largest, and which known state the frame belongs to.

Where do these probabilities used read? There is a so-called "acoustic model" in which a large set of parameters is stored, and by means of these parameters, the probability of the frame and the state can be known. The method of acquiring this large set of parameters is called "training" and requires the use of a significant amount of speech data. Hidden Markov Models (HMM) are used herein. Before the speech recognition is really started, the HMM model needs to be trained to obtain various parameters in the model by using the historical sample audio, wherein the historical sample audio is the audio generated by the user in the ATM service process. The trained model starts the first step of speech recognition to construct a state network, namely the state network constructed by using the executable instruction set in the embodiment, and after a user sends out a speech audio, the server detects the current node, re-acquires the executable instruction set and reconstructs the state network, so that the state network constructed by the server every time is different; and secondly, searching a path which is most matched with the sound from the state network.

The speech recognition process is actually to search an optimal path in the state network so as to maximize the probability that the speech to be recognized corresponds to the path, which is called "decoding". The path search algorithm is a dynamic pruning planning algorithm, and the Viterbi algorithm is used herein to find a globally optimal path.

The path search algorithm uses the cumulative probability, which is formed by three parts as follows: i.e. the probability corresponding to each frame and each state; transition probability that each state transitions to itself or to the next state; and language probability is the probability obtained according to the language statistical rule. Wherein the first two probabilities are obtained from the acoustic model and the last probability is obtained from the language model. The language model is trained by using a large amount of texts, and the statistical rules of a certain language (including part of speech such as statistical distribution probability of noun verb adjectives and statistical distribution attributes of sentence grammar such as a main predicate object) can be utilized to help improve the recognition accuracy. The language model is important, and if the language model is not used, the recognized result is basically a cluttering of numbs when the state network is large.

In the embodiment, the most probable state path corresponding to the user voice to be recognized in the constructed state network is searched through the path search algorithm, so that the text content of the voice is recognized.

Further, the above embodiment forms a state network according to the screened instruction set, and dynamically matches one state network for each current operation, which not only greatly reduces the size of the state network, but also limits the result in the correct result. The beneficial effects are as follows:

1. the size of the state network is reduced, the calculation amount is greatly reduced, unnecessary calculation is reduced, and the recognition efficiency is improved.

2. The matching range is reduced and the state network is the correct instruction set, so that the identification accuracy is improved to a certain extent.

In an embodiment, the global optimal path includes a plurality of target states, and the step S205 includes: combining the target states into a plurality of target phonemes according to the corresponding relation between the target states and the target phonemes; combining a plurality of target phonemes into a target word; and constructing a voice instruction according to the target word.

The target state refers to the most possible state corresponding to the speech to be recognized, which is recognized by the search algorithm. Every third state corresponds to a phoneme.

Specifically, according to the corresponding relationship between the target states and the target phonemes, combining the target states into a plurality of target phonemes; combining a plurality of target phonemes into a target word; and constructing a voice instruction according to the target word, so far, recognizing the specific text content of the voice audio of the user.

In the embodiment, the recognized global optimal path is converted into the corresponding word through the corresponding relation between the state and the phoneme and between the phoneme and the word, the sentence is formed, the voice instruction of the user is recognized, and a prerequisite is provided for executing the corresponding operation according to the instruction of the user.

In an embodiment, before the step S201, the method further includes: acquiring a user card through card detection equipment, and displaying a password input interface; verifying the user password in response to the user inputting the user password on the password input interface; if the password passes the verification, the card state of the user card is verified; and outputting verification passing information on the premise that the card state is verified to pass.

Specifically, as shown in fig. 3, the dotted line in the figure represents a logic flow, and the solid line represents an actual flow. When a user comes to the ATM to handle business, the original interface of the ATM prompts the insertion of a card, and after the user inserts the card, the ATM interface prompts the input of a password (a password input interface is displayed). The user inputs the password, the ATM performs password verification and card state verification through the server 102, the card state verification comprises the steps of verifying whether the card is in the valid period or not, the card does not pass the transaction, verification passing information is output after the card passes the verification, the verification passing information can be output through a display interface or voice output, the user is prompted to perform the next operation, and the voice audio of the user is received. After receiving the voice, the ATM switches to background voice processing. And after receiving the voice information sound source, the voice recognition module starts to enter a voice recognition stage. Firstly, the current service state is detected, and the current service state of the ATM equipment is obtained. The last state network, namely the state network identified last time, is abandoned after the current service state is acquired, and each identification is independent to form a new state network. Then, the executable instruction set is screened according to the current service state, for example, the operation of needing login, such as withdrawal, cannot be performed when the executable instruction set is not logged in. Then, forming a word network by the screened instruction set, for example, the card quitting operation can be executed only when the instruction set is not logged in, and the word network at the moment is the card quitting; then expanding the word network into a phoneme network (the single pronunciation of the word is composed of phonemes); and then expanded into a state network (speech units finer than phonemes). Then, VAD processing (VAD, voice endpoint detection technology) is carried out on the audio, and then framing processing is carried out on the audio, so that each frame waveform is changed into a multi-dimensional vector. Then, searching an optimal path in the state network, wherein the probability of the voice corresponding to the path is the maximum, called decoding, and searching a global optimal path by using a viterbi algorithm. And finally, synthesizing the state into phonemes, and synthesizing the phonemes into words until the speech recognition is finished. After the identification is completed, the command is converted into an executable command and is transmitted back to the ATM, the ATM executes the returned command, and the process of man-machine interaction is completed once.

According to the embodiment, the user card is obtained, and the user identity and the card state are verified, so that a precondition is provided for the subsequent voice detection process.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a speech recognition processing apparatus based on the atm service logic, for implementing the above-mentioned speech recognition processing method based on the atm service logic. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the speech recognition processing device based on the atm service logic provided below can be referred to the above limitations for the speech recognition processing method based on the atm service logic, and are not described herein again.

In one embodiment, as shown in fig. 4, there is provided a speech recognition processing apparatus 400 based on atm service logic, comprising: an audio acquisition module 401, an instruction screening module 402, a state network construction module 403, an optimal path search module 404, a voice instruction conversion module 405, and an action execution module 406, wherein:

the audio acquisition module 401 is configured to acquire a user voice audio and identify a current service state;

an instruction screening module 402, configured to screen an executable instruction set from a preset instruction library based on the current service state;

a state network construction module 403 for constructing a state network based on the set of executable instructions;

an optimal path searching module 404, configured to search, based on the user voice audio, a global optimal path from the state network by using a path searching algorithm;

a voice instruction conversion module 405, configured to convert the global optimal path into a voice instruction;

and the action execution module 406 is configured to instruct the automatic teller machine to perform a corresponding action according to the voice instruction.

In one embodiment, the state network construction module 403 is further configured to convert the set of executable instructions into a word network; converting the word network to a phoneme network; converting the phoneme network to the state network; wherein each phoneme corresponds to a plurality of states.

In an embodiment, the optimal path searching module 404 is further configured to perform voice boundary detection on the user voice audio to obtain an effective voice audio; performing framing processing on the effective voice audio to obtain a multi-frame voice signal; obtaining a multidimensional vector corresponding to each frame of voice signal according to the characteristic parameters of each frame of voice signal; and searching the optimal path in the state network corresponding to the executable instruction set by utilizing a path search algorithm based on the multi-dimensional vector to obtain the global optimal path.

In an embodiment, the global optimal path comprises a plurality of target states; the voice instruction converting module 405 is further configured to combine the target states into a plurality of target phonemes according to the corresponding relationship between the target states and the target phonemes; combining the plurality of target phonemes into a target word; and constructing the voice instruction according to the target word.

In one embodiment, the system further comprises an identity authentication unit, which is used for acquiring the user card through the card detection equipment and displaying the password input interface; in response to a user entering a user password on the password entry interface, verifying the user password; if the password passes the verification, verifying the card state of the user card; and outputting verification passing information on the premise that the card state is verified to pass.

In one embodiment, the path search algorithm comprises a viterbi algorithm.

The modules in the automatic teller machine service logic-based speech recognition processing device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing user identity information, card information and executable instruction data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech recognition processing based on automated teller machine business logic.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of speech recognition processing based on automated teller machine business logic. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 5-6 are only block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computer device to which the present disclosure may be applied, and a particular computer device may include more or less components than those shown in the figures, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring a user voice audio and identifying the current service state;

constructing a state network based on the set of executable instructions;

converting the global optimal path into a voice instruction;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

converting the set of executable instructions to a word network;

converting the word network to a phoneme network;

In one embodiment, the global optimal path includes a plurality of target states; the processor, when executing the computer program, further performs the steps of:

combining the plurality of target phonemes into a target word;

and constructing the voice instruction according to the target word.

In one embodiment, the path search algorithm comprises a viterbi algorithm.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring a user voice audio and identifying the current service state;

constructing a state network based on the set of executable instructions;

converting the global optimal path into a voice instruction;

In one embodiment, the computer program when executed by the processor further performs the steps of:

converting the set of executable instructions to a word network;

converting the word network to a phoneme network;

In one embodiment, the global optimal path includes a plurality of target states; the computer program when executed by the processor further realizes the steps of:

combining the plurality of target phonemes into a target word;

and constructing the voice instruction according to the target word.

In one embodiment, the path search algorithm comprises a viterbi algorithm.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

acquiring a user voice audio and identifying the current service state;

constructing a state network based on the set of executable instructions;

converting the global optimal path into a voice instruction;

converting the set of executable instructions to a word network;

converting the word network to a phoneme network;

combining the plurality of target phonemes into a target word;

and constructing the voice instruction according to the target word.

In one embodiment, the path search algorithm comprises a viterbi algorithm.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for processing speech recognition based on atm service logic, the method comprising:

acquiring a user voice audio and identifying the current service state;

constructing a state network based on the set of executable instructions;

converting the global optimal path into a voice instruction;

2. The method of claim 1, wherein constructing a state network based on the set of executable instructions comprises:

converting the set of executable instructions to a word network;

converting the word network to a phoneme network;

3. The method of claim 1, wherein searching for a globally optimal path from the state network using a path search algorithm based on the user speech audio comprises:

4. The method of claim 1, wherein the global optimal path comprises a plurality of target states; the converting the global optimal path into a voice instruction includes:

combining the plurality of target phonemes into a target word;

and constructing the voice instruction according to the target word.

5. The method of claim 1, wherein prior to the obtaining the user speech audio, the method further comprises:

6. The method according to any of claims 1 to 5, wherein the path search algorithm comprises a viterbi algorithm.

7. A speech recognition processing apparatus based on atm service logic, the apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.