CN116863935B

CN116863935B - Speech recognition method, device, electronic equipment and computer readable medium

Info

Publication number: CN116863935B
Application number: CN202311127447.0A
Authority: CN
Inventors: 刘子正; 蒋斌; 廖福燕; 王瑜
Original assignee: Shenzhen Kilakila Technology Co ltd
Current assignee: Shenzhen Kilakila Technology Co ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-24
Anticipated expiration: 2043-09-04
Also published as: CN116863935A

Abstract

Embodiments of the present disclosure disclose a voice recognition method, apparatus, electronic device, and computer-readable medium. One embodiment of the method comprises the following steps: randomly selecting a preset number of question information from a preset question information set, and executing the following processing steps: in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a target voice recognition model corresponding to the set user from the voice recognition model group; inputting the voice included in each question information in the preset number of question information into a target voice recognition model to generate a voice recognition text; converting each speech recognition text in the speech recognition text set into speech audio in a target format; and combining each question information in the preset number of question information and voice audio corresponding to the question information to form target question information. This embodiment improves the hearing of the user and increases the user's viscosity.

Description

Speech recognition method, device, electronic equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a voice recognition method, an electronic device, and a computer readable medium.

Background

The question-answer matching friend making playing method is a unique application social playing method, and a user can select interesting questions from a question bank and set answers; other users can successfully unlock the chat by answering questions and answering a certain number of questions to match with the other party. Currently, when a user sets a question and an answer, the following methods are generally adopted: randomly selecting questions and answers from the question bank, and configuring voice through a system or configuring voice by a user.

However, the following technical problems generally exist in the above manner:

firstly, the voice configured by the system is stiff, so that the hearing of other users is poor;

secondly, when setting the questions, the types of the history questions set by the user are not considered, so that the relevance among the newly set questions is lower, and the answering time of the user is easy to waste;

thirdly, sensitive word detection is not carried out on the voice configured by the user, and when the voice configured by the user has sensitive information, iteration of application is not facilitated and the voice is easy to be blocked.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a speech recognition method, apparatus, electronic device, and computer-readable medium to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a speech recognition method, the method comprising: in response to determining that the target user initial authentication passes, randomly selecting a preset number of question information from a preset question information set, performing the following processing steps: determining user information of a set user corresponding to a preset question and answer library; determining whether address information included in the user information is consistent with an address corresponding to the target user; in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model; inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group; converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group; combining each question information in the preset number of question information and voice audio corresponding to the question information to obtain target question information groups; and sending the target question answering information group to a user terminal of the target user.

In a second aspect, some embodiments of the present disclosure provide a speech recognition apparatus, the apparatus comprising: a voice recognition unit configured to, in response to determining that the target user initial authentication passes, randomly select a preset number of question information from a preset set of question information, perform the following processing steps: determining user information of a set user corresponding to a preset question and answer library; determining whether address information included in the user information is consistent with an address corresponding to the target user; in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model; inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group; converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group; combining each question information in the preset number of question information and voice audio corresponding to the question information to obtain target question information groups; and the sending unit is configured to send the target question answer information group to the user terminal of the target user.

In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors causes the one or more processors to implement the method described in any of the implementations of the first aspect above.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect above.

The above embodiments of the present disclosure have the following advantageous effects: by the voice recognition method of some embodiments of the present disclosure, the hearing of the user is improved, and the user viscosity is increased. Specifically, the reason for the poor hearing of other users is that: the system configures the voice, and the voice is stiff. Based on this, the voice recognition method of some embodiments of the present disclosure, in response to determining that the target user initial authentication passes, randomly selects a preset number of question information from a preset question information set, performs the following processing steps: firstly, determining user information of a set user corresponding to a preset question and answer library; and determining whether the address information included in the user information is consistent with the address corresponding to the target user. Thereby, it is convenient to determine whether the setting user and the target user belong to the same area of users. Thus, it is convenient to determine whether the target user can understand the audio recorded by the setting user. And then, in response to determining that the address information included in the user information is inconsistent with the address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model. Therefore, when the target user and the setting user do not belong to the same area, the information of the audio recorded by the setting user can be conveniently identified according to the target voice recognition model. Then, the voice included in each question information in the preset number of question information is input into the target voice recognition model to generate a voice recognition text, and a voice recognition text group is obtained. Thus, a text corresponding to a voice included in each question information can be recognized. And then, converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group. Therefore, the audio which is not understood by the target user can be converted into standard audio, and the user can understand the meaning of the audio conveniently. And then, combining voice audio corresponding to each question information in the preset number of question information with the question information to obtain target question information set. Therefore, standard audio and the sound of the setting user can be set in question and answer information, so that the target user can know the audio meaning conveniently and know the sound information of the setting user conveniently. And finally, the target question answering information group is sent to the user terminal of the target user. Therefore, through the double play modes of standard audio and setting the own voice of the user, the hearing of the user is improved, and the viscosity of the user is increased.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of some embodiments of a speech recognition method according to the present disclosure;

FIG. 2 is a schematic diagram of the structure of some embodiments of a speech recognition device according to the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a flow chart of some embodiments of a speech recognition method according to the present disclosure. A flow 100 of some embodiments of a speech recognition method according to the present disclosure is shown. The voice recognition method comprises the following steps:

Step 101, in response to determining that the initial authentication of the target user passes, randomly selecting a preset number of question information from a preset question information set, and executing the following processing steps:

in some embodiments, an executing body (e.g., a computing device) of the above-described voice recognition method may randomly select a preset number of question information from a preset set of question information in response to determining that the target user is initially authenticated. Here, the initial authentication of the target user may mean that the account number of the target user is authenticated. I.e. questions set by the user through local settings. The question information set may be each question information set in advance by the local setting user. Question and answer information may include questions, answers, and voice of the setting user. For example, a question of question and answer information may be "how do you do" a lover in a friend's circle and a separate photo of the opposite? ". The answer of question and answer information can be A, not cool, let Ta delete; b, scratch off when not visible. The voice of the setting user may be a voice representing a question, a voice of an answer, or another voice.

Optionally, before randomly selecting the preset number of question information from the preset question information set, the method further includes:

First, a set of history question reply information sequences for a setting user is acquired. That is, the set of history question reply information sequences for the setting user may be acquired from the local database by means of wired connection or wireless connection. Here, the history question reply information sequence in the history question reply information sequence set may be a reply record information sequence in which a question communication has been made between the setting user and the other user. The history question reply information includes: actual question attribute information sets for historical dialog questions. The actual answer question attribute information may be question attribute information in a history dialogue question (i.e., a question posed by the setting user to be answered by other users). The question attributes may include: the type of the question, the question and the answer to the question. The question types may represent emotion questions, life questions, and the like.

Second, for each of the history question reply information sequences in the history question reply information sequence set, the following processing steps are performed:

a first sub-step of performing vector conversion on each of the historical question answer information in the historical question answer information sequence to generate a dialogue vector, thereby obtaining a dialogue vector sequence. The dialogue vector may characterize dialogue feature information of the history question answer information. Each of the above-described sequences of historical question reply information may be input into a BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder representation from the transformer) model to generate a dialogue vector, resulting in a sequence of dialogue vectors.

A second sub-step of determining a set of problem attributes for the target dialog problem. Wherein the problem attribute group may be preset. In practice, the problem attribute set may include: question attributes that characterize whether a user is interested.

A third sub-step of inputting the dialogue vector sequence to a pre-trained question attribute attention mechanism model to generate candidate question answer information corresponding to the question attribute group for the history question answer information sequence. The candidate question answer information may be answer question information generated for question attributes of whether the user is interested or not.

Optionally, before the third sub-step, the following sub-steps are further included:

and a substep 1, acquiring a first historical question reply information sequence corresponding to the setting user. Wherein the first historical question reply information includes: actual question attribute information sets for historical dialog questions. Wherein the first historical question reply information includes: actual question attribute information sets for historical dialog questions. The first historical question reply information sequence may be an exchange record information sequence in which a user is set to exchange questions with other users. In practice, the actual answer question attribute information may be question attribute information in a historical dialog question (i.e., a question that the user is set to answer to for the user). In practice, the problem attribute information may be an attribute value of a problem attribute.

And 2, carrying out vector conversion on each first historical problem reply message in the first historical problem reply message sequence to generate a first dialogue vector, and obtaining a first dialogue vector sequence. Each of the first sequence of historical question reply information may be input into a BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder representation from the transformer) model to generate a first dialog vector, resulting in a first sequence of dialog vectors.

A substep 3, selecting historical question reply information from the first historical question reply information sequence as a target training sample, and executing the following training steps:

first, a first dialogue vector corresponding to the target training sample is input to an initial attribute attention mechanism model to generate an attribute problem information set corresponding to a problem attribute set. The initial attribute attention mechanism model may be an attribute attention mechanism model that has not been trained to end. The attribute attention mechanism model may be a model that learns dialog context feature information. The attribute-aware mechanism model may be a transducer model. The set of question attributes may be question attributes related to a question that are preset for initial attribute attention mechanism model training. The attribute problem information in the attribute problem information set and the problem attribute in the problem attribute set have a one-to-one correspondence. The attribute problem information may be problem information for a corresponding problem attribute. For example, first, the first dialog vector may be input to an initial sub-attention layer included in the initial attribute attention mechanism model to output a dialog context vector. Wherein the initial sub-attention layer may be a sub-attention layer that has not been trained yet. The sub-attention layer may be a network layer including a multi-headed attention mechanism. The dialog context vector may characterize feature information of dialog context-associated features in the historical question reply information. The dialog context vector may then be input into an initial problem prediction model included in the initial attribute attention mechanism model to output an attribute problem information set. Wherein each attribute problem information includes: sub-problem information corresponding to at least one attribute problem category. The initial problem prediction model may be a problem prediction model that has not been trained. In practice, the problem prediction model may be a model that outputs attribute problem information. The problem prediction model may be a multi-layered serial connection of fully connected layers. The attribute issue category may be an issue category to which the attribute issue corresponds. The attribute problem categories may be: the problem category of the emotion class and the problem category of the life class of the user. The corresponding sub-problem information may be: emotion-based question information and life-based question information.

And secondly, generating attribute loss information according to the attribute problem information set and the corresponding actual problem attribute information set.

First, for each attribute problem information in the above-described attribute problem information set, the following processing steps are performed: determining sub-problem information of a preset attribute problem category in the attribute problem information as target sub-problem information; and generating question loss information aiming at the corresponding actual question attribute information according to the actual question attribute information corresponding to the attribute question information and the target sub-question information.

Then, attribute loss information is generated from the generated individual problem loss information. The average value of the loss values corresponding to the respective problem loss information may be determined as attribute loss information.

Third, in response to determining that the attribute loss information satisfies a preset condition, the initial attribute attention mechanism model is determined as a problem attribute attention mechanism model. Here, the preset condition may mean that the loss value represented by the attribute loss information is equal to or less than the preset attribute loss value.

Thirdly, constructing a question and answer information set according to the generated candidate question and answer information.

The above related content is taken as an invention point of the present disclosure, which solves the second technical problem mentioned in the background art, namely that the answering time of the user is easy to be wasted. ". Factors that easily waste the answering time of the user are often as follows: when setting the questions, the types of history questions set by the user are not considered, resulting in a low correlation between the questions set newly. If the above factors are solved, the effect of reducing the waste of answering time of the user can be achieved. To achieve this effect, first, a set of history question answer information sequences for the setting user is acquired. Thus, it is convenient to set new question-answer information according to the history question-answer information of the setting user. Next, for each of the history question reply information sequences in the above history question reply information sequence set, the following processing steps are performed: first, vector conversion is performed on each of the historical question reply information in the above-mentioned historical question reply information sequence to generate a dialogue vector, resulting in a dialogue vector sequence. Here, dialogue characteristic information of each of the history question reply information is extracted by vector conversion, and subsequent input to the initial attribute attention mechanism model is facilitated. Next, a set of question attributes for the target dialog question is determined. Then, the dialogue vector sequence is input to a pre-trained question attribute attention mechanism model to generate candidate question answer information corresponding to the question attribute group for the history question answer information sequence. And finally, constructing a question and answer information set according to the generated candidate question and answer information. Therefore, a new question and answer question can be set according to the historical question answer information sequence of the setting user, and the relevance among the questions is ensured. Therefore, the waste of answering time of the user is reduced.

In practice, the third step may comprise the sub-steps of:

a first sub-step of, for each of the above-described respective candidate question information, executing the following processing steps:

first, a voice corresponding to the candidate answer information issued by the setting user is received. That is, the voice corresponding to the candidate question information issued by the setting user may be received through a headset or a speaker.

Next, the speech is input into the target speech recognition model to generate speech recognition text.

And then, inputting the voice recognition text into a pre-trained sensitive information recognition model to obtain a sensitive information recognition result. The sensitive information recognition result may indicate whether there is a sensitive word in the speech recognition text.

Then, in response to determining that the sensitive information identification result represents no sensitive information, the candidate question information and the voice are combined into question information. Combining may be represented as merging.

And a second sub-step of determining each question information obtained as a question information set.

Optionally, before the voice recognition text is input into the pre-trained sensitive information recognition model to obtain the sensitive information recognition result, the method further includes:

First, a phonetic text training sample set is obtained. The voice text training sample set can be obtained from the terminal equipment in a wired connection or wireless connection mode. The phonetic text training samples may include phonetic text and corresponding sample tags. The sample tag may indicate whether a sensitive word is present in the phonetic text.

Second, a phonetic text training sample is randomly selected from the phonetic text training sample set. A phonetic text training sample may be randomly selected from the set of phonetic text training samples.

Thirdly, word segmentation processing is carried out on the voice texts included in the voice text training samples, and word segmentation voice texts are obtained. For example, the word included in the speech text may be separated into words by a word separation tool (e.g., stanford, hanlp) to obtain a word separated speech text.

Fourth, converting the word segmentation phonetic text into at least one word sequence matched with the preset length. When the length of the voice text is greater than the preset length, the segmented word voice text can be intercepted so as to obtain a word sequence with the same length as the preset length. When the length of the voice text is smaller than the preset length, filling processing can be carried out on the segmented voice text so as to obtain word sequences with the same length as the preset length.

Fifth, candidate speech sample data is generated based on the at least one word sequence. Some word sequences may be selected from the at least one word sequence as candidate speech sample data.

Sixthly, inputting the candidate voice sample data into an initial sensitive information recognition model to obtain a candidate voice sample data recognition result. Here, the initial sensitive information identification model may refer to an untrained convolutional neural network model, or an untrained bidirectional LSTM (Long ShortTerm Memory, long-term memory network) model.

Seventh, determining a recognition loss value between the candidate speech sample data recognition result and a sample tag included in the speech text training sample. And determining the recognition loss value between the candidate voice sample data recognition result and the sample label included in the voice text training sample through a preset loss function. The loss function may include, but is not limited to: mean square error loss function (MSE), hinge loss function (SVM), cross entropy loss function (CrossEntropy), and the like.

Eighth, in response to determining that the recognition loss value is less than or equal to a preset loss value, determining the initial sensitive information recognition model as a trained sensitive information recognition model. Here, the setting of the preset loss value is not limited.

The above related matters are taken as an invention point of the present disclosure, solve the third technical problem mentioned in the background art, which is unfavorable for iteration of application and is easy to be blocked. ". Factors that are detrimental to the iteration of the application and are easily blocked tend to be as follows: sensitive word detection is not performed on the voice configured by the user, and iteration of the application is not facilitated when the voice configured by the user has sensitive information. If the above factors are solved, the effect of facilitating the iteration of the application and reducing the possibility of being blocked can be achieved. To achieve this, first, a phonetic text training sample set is acquired. And secondly, randomly selecting the voice text training sample from the voice text training sample set. Thus, training of the sensitive information recognition model is facilitated. Then, word segmentation processing is carried out on the voice text included in the voice text training sample, so as to obtain a word segmentation voice text; and converting the word segmentation voice text into at least one word sequence matched with the preset length. Thus, the model is convenient to recognize words in the text one by one. Then, generating candidate speech sample data based on the at least one word sequence; and inputting the candidate voice sample data into an initial sensitive information recognition model to obtain a candidate voice sample data recognition result. Thus, it is convenient to determine whether or not a sensitive word is present in the sample. Finally, determining a recognition loss value between the candidate voice sample data recognition result and a sample label included in the voice text training sample; and in response to determining that the identification loss value is smaller than or equal to a preset loss value, determining the initial sensitive information identification model as a trained sensitive information identification model. Therefore, the text corresponding to the voice can be subjected to sensitive word recognition through the trained sensitive information recognition model. Thus, voice outflow with sensitive words can be avoided. The iteration of the application is facilitated, and the possibility of being blocked is reduced.

Optionally, a user audio data sample set corresponding to the set user is obtained.

In some embodiments, the executing entity may obtain a user audio data sample set corresponding to the setting user. Here, the user audio data samples in the user audio data sample set include user audio data and sample tags. Here, the sample tag may represent text corresponding to the user audio data. The user audio data may refer to audio.

Optionally, a user audio data sample is selected from the set of user audio data samples.

In some embodiments, the executing entity may randomly select one user audio data sample from the set of user audio data samples.

Optionally, inputting the user audio data included in the user audio data sample into an initial speech recognition model to obtain a user audio data sample recognition result.

In some embodiments, the executing body may input the user audio data included in the user audio data sample into an initial speech recognition model, so as to obtain a recognition result of the user audio data sample. The initial speech recognition model may be an untrained deep neural network-hidden markov model (DNN-HMM). The user audio data sample recognition result may represent recognized audio text.

Optionally, a loss value between the user audio data sample identification result and a sample label corresponding to the user audio data sample is determined.

In some embodiments, the executing body may determine a loss value between the user audio data sample identification result and a sample tag corresponding to the user audio data sample through a preset loss function. The loss function may include, but is not limited to: mean square error loss function (MSE), hinge loss function (SVM), cross entropy loss function (CrossEntropy), and the like.

Optionally, in response to determining that the loss value is less than or equal to a preset threshold, determining the initial speech recognition model as the target speech recognition model.

In some embodiments, the executing entity may determine the initial speech recognition model as the target speech recognition model in response to determining that the loss value is equal to or less than a preset threshold.

Step 1011, determining user information of the set user corresponding to the preset question bank.

In some embodiments, the executing body may determine user information of the setting user corresponding to the preset question bank. That is, user information of the setting user who sets the question bank can be determined. The user information may include constant address information for setting the user.

Step 1012, determining whether the address information included in the user information is consistent with the address corresponding to the target user.

In some embodiments, the execution body may determine whether address information included in the user information is consistent with an address corresponding to the target user. The address corresponding to the target user may refer to the usual address information of the target user. That is, it may be determined whether or not the address information included in the user information is identical to the address corresponding to the target user.

In step 1013, in response to determining that the address information included in the user information does not match the address corresponding to the target user, a speech recognition model corresponding to the set user is selected from a pre-trained speech recognition model group as a target speech recognition model.

In some embodiments, the executing entity may select, as the target speech recognition model, a speech recognition model corresponding to the set user from a pre-trained speech recognition model group in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user. Wherein the set of speech recognition models may be pre-trained speech recognition models that recognize different accents. For example, the set of speech recognition models may include speech recognition models that recognize the A-region accent, and may include speech recognition models that recognize the B-region accent. For example, the speech recognition models in the speech recognition model set may be pre-trained convolutional neural network models with speech as input and speech recognition text as output. That is, a speech recognition model corresponding to the address area of the set user may be selected from a pre-trained speech recognition model group as the target speech recognition model.

Step 1014, inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group.

In some embodiments, the execution body may input a voice included in each question information of the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group. The voice recognition text may refer to text corresponding to a voice included in the recognized question information.

Step 1015, converting each speech recognition text in the speech recognition text set to speech audio in a target format, thereby obtaining a speech audio set.

In some embodiments, the executing entity may convert each speech recognition text in the speech recognition text set into a speech audio in a target format to obtain a speech audio set. The target format may refer to a mandarin format. That is, the speech recognition text may be converted to speech audio in mandarin format. Here the conversion may be performed by a text-to-audio converter.

Step 1016, combining the voice audio of each question information in the preset number of question information and the corresponding question information into target question information, and obtaining a target question information group.

In some embodiments, the execution body may combine the voice audio of each question information in the preset number of question information and the corresponding question information as target question information, to obtain a target question information group. Combining may be referred to as merging.

And 102, transmitting the target question answering information group to a user terminal of the target user.

In some embodiments, the executing body may send the target answer information set to a user terminal of the target user, so as to facilitate the target user to answer.

With further reference to fig. 2, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of a speech recognition apparatus, which correspond to those method embodiments shown in fig. 1, and which are particularly applicable in various electronic devices.

As shown in fig. 2, the voice recognition apparatus 200 of some embodiments includes: a speech recognition unit 201 and a transmission unit 202. Wherein the voice recognition unit 201 is configured to, in response to determining that the target user is initially authenticated, randomly select a preset number of question information from a preset set of question information, perform the following processing steps: determining user information of a set user corresponding to a preset question and answer library; determining whether address information included in the user information is consistent with an address corresponding to the target user; in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model; inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group; converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group; combining each question information in the preset number of question information and voice audio corresponding to the question information to obtain target question information groups; and a transmitting unit 202 configured to transmit the target question information set to a user terminal of the target user.

It will be appreciated that the elements recited in the speech recognition device 200 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations, features and advantages described above with respect to the method are equally applicable to the speech recognition device 200 and the units contained therein, and are not described here again.

Referring now to FIG. 3, a schematic diagram of an electronic device (e.g., computing device) 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic devices in some embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, as well as stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM302, and the RAM303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 3 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 309, or from storage device 308, or from ROM 302. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to determining that the target user initial authentication passes, randomly selecting a preset number of question information from a preset question information set, performing the following processing steps: determining user information of a set user corresponding to the question bank; determining whether address information included in the user information is consistent with an address corresponding to the target user; in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model; inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, so as to obtain a voice recognition text group; converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group; combining each question information in the preset number of question information and voice audio corresponding to the question information to obtain target question information groups; and sending the target question answering information group to a user terminal of the target user.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor comprising: a voice recognition unit and a transmission unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the transmitting unit may also be described as "a unit that transmits the above-described target question-answer information group to the user terminal of the above-described target user".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A method of speech recognition, comprising:

acquiring a historical question reply information sequence set aiming at a setting user;

for each of the historical question reply information sequences in the set of historical question reply information sequences, performing the following processing steps:

vector conversion is carried out on each historical question reply information in the historical question reply information sequence so as to generate a dialogue vector, and a dialogue vector sequence is obtained;

determining a problem attribute group for a target dialogue problem;

acquiring a first historical problem reply information sequence corresponding to the setting user, wherein the first historical problem reply information comprises: an actual question attribute information set for the historical dialog questions;

vector conversion is carried out on each first historical problem reply message in the first historical problem reply message sequence so as to generate a first dialogue vector, and a first dialogue vector sequence is obtained;

selecting historical question reply information from the first historical question reply information sequence as a target training sample, and executing the following training steps:

inputting a first dialogue vector corresponding to the target training sample into an initial attribute attention mechanism model to generate an attribute problem information set corresponding to a problem attribute set;

Generating attribute loss information according to the attribute problem information set and the corresponding actual problem attribute information set;

determining the initial attribute attention mechanism model as a problem attribute attention mechanism model in response to determining that the attribute loss information meets a preset condition;

inputting the dialogue vector sequence into a pre-trained problem attribute attention mechanism model to generate candidate question answer information which is specific to the historical question answer information sequence and corresponds to the problem attribute group;

constructing a question and answer information set according to the generated candidate question and answer information;

in response to determining that the target user initial authentication passes, randomly selecting a preset number of question information from a preset question information set, performing the following processing steps:

determining user information of a set user corresponding to a preset question and answer library;

determining whether address information included in the user information is consistent with an address corresponding to the target user;

in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model;

Inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, and obtaining a voice recognition text group;

converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group;

combining each question information in the preset number of question information and voice audio corresponding to the question information into target question information to obtain a target question information group;

the target question answer information group is sent to a user terminal of the target user;

the construction of the question and answer information set according to the generated candidate question and answer information comprises the following steps:

for each of the respective candidate question information, the following processing steps are performed:

receiving the voice corresponding to the candidate question and answer information sent by the setting user;

inputting the voice into the target voice recognition model to generate voice recognition text;

acquiring a voice text training sample set;

randomly selecting a voice text training sample from the voice text training sample set;

Performing word segmentation processing on the voice text included in the voice text training sample to obtain a word segmentation voice text;

converting the word segmentation voice text into at least one word sequence matched with a preset length;

generating candidate speech sample data based on the at least one word sequence;

inputting the candidate voice sample data into an initial sensitive information recognition model to obtain a candidate voice sample data recognition result;

determining a recognition loss value between the candidate voice sample data recognition result and a sample label included in the voice text training sample;

in response to determining that the recognition loss value is less than or equal to a preset loss value, determining the initial sensitive information recognition model as a trained sensitive information recognition model;

inputting the voice recognition text into a pre-trained sensitive information recognition model to obtain a sensitive information recognition result;

in response to determining that the sensitive information identification result represents no sensitive information, combining the candidate question-answer information with the voice to obtain question-answer information;

determining each question and answer information as a question and answer information set;

wherein, generating attribute loss information according to the attribute problem information set and the corresponding actual problem attribute information set includes:

For each attribute issue information in the set of attribute issue information, performing the following processing steps:

determining sub-problem information of a preset attribute problem category in the attribute problem information as target sub-problem information;

generating problem loss information aiming at the corresponding actual problem attribute information according to the actual problem attribute information corresponding to the attribute problem information and the target sub-problem information;

attribute loss information is generated from the generated individual problem loss information.

2. The method of claim 1, wherein prior to said selecting a speech recognition model corresponding to the setup user from the pre-trained set of speech recognition models as the target speech recognition model, the method further comprises:

acquiring a user audio data sample set corresponding to the setting user;

selecting a user audio data sample from the set of user audio data samples;

inputting the user audio data included in the user audio data sample into an initial voice recognition model to obtain a user audio data sample recognition result;

determining a loss value between the user audio data sample identification result and a sample label corresponding to the user audio data sample;

And determining the initial speech recognition model as a target speech recognition model in response to determining that the loss value is less than or equal to a preset threshold.

3. A speech recognition apparatus comprising:

an acquisition unit configured to acquire a history question reply information sequence set for a setting user;

an input unit configured to: for each of the historical question reply information sequences in the set of historical question reply information sequences, performing the following processing steps:

determining a problem attribute group for a target dialogue problem;

wherein generating attribute loss information according to the attribute problem information set and the corresponding actual problem attribute information set, comprises:

Generating attribute loss information according to the generated problem loss information;

inputting the voice into a target voice recognition model to generate voice recognition text;

acquiring a voice text training sample set;

a construction unit configured to construct a question-answer information set based on the generated respective candidate question-answer information;

a voice recognition unit configured to, in response to determining that the target user initial authentication passes, randomly select a preset number of question information from a preset set of question information, perform the following processing steps: determining user information of a set user corresponding to a preset question and answer library; determining whether address information included in the user information is consistent with an address corresponding to the target user; in response to determining that address information included in the user information is inconsistent with an address corresponding to the target user, selecting a voice recognition model corresponding to the set user from a pre-trained voice recognition model group as a target voice recognition model; inputting the voice included in each question information in the preset number of question information into the target voice recognition model to generate a voice recognition text, and obtaining a voice recognition text group; converting each voice recognition text in the voice recognition text group into voice audio in a target format to obtain a voice audio group; combining each question information in the preset number of question information and voice audio corresponding to the question information into target question information to obtain a target question information group;

And the sending unit is configured to send the target question answer information group to a user terminal of the target user.

4. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-2.

5. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-2.