CN117995172A

CN117995172A - Speech recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN117995172A
Application number: CN202410138775.9A
Authority: CN
Inventors: 全刚; 王佳; 李奇龙
Original assignee: Jingdong City Beijing Digital Technology Co Ltd; Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd; Jingdong Technology Information Technology Co Ltd
Priority date: 2024-01-31
Filing date: 2024-01-31
Publication date: 2024-05-07

Abstract

The disclosure provides a voice recognition method and device, electronic equipment and a computer readable storage medium, which can be applied to the technical fields of computer technology and voice processing. The method comprises the following steps: responding to the received voice recognition request, and processing the voice to be recognized in the voice recognition request to obtain a first voice segment corresponding to the first voice channel and a second voice segment corresponding to the second voice channel; processing the first voice segment and the second voice segment respectively to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time; sequencing at least one first voice statement and at least one second voice statement according to the starting time to obtain an initial sequencing result; and processing the initial sequencing result according to the voice channel identification and the ending time to obtain a voice recognition result.

Description

Speech recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technology and speech processing technology, and more particularly, to a speech recognition method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of computer technology, speech recognition technology has grown. Speech recognition technology may refer to the technology by which a machine converts a speech signal into a corresponding text or command through a process of recognition and understanding. For example, speech recognition techniques may be applied to quality inspection.

Quality inspection may refer to the process of performing voice recognition on a telephone recording file generated by a customer service answering a user's telephone and performing quality inspection item detection based on the voice recognition result, so as to determine whether the customer service local call operation meets the requirements. However, since voices from different channels may overlap during quality inspection, the recognized text sequence is inconsistent with the actual voice sequence.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: under the scene that the voices of different channels are overlapped, the voice recognition result is difficult to clearly display, so that the user experience is poor.

Disclosure of Invention

In view of this, the present disclosure provides a speech recognition method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to one aspect of the present disclosure, there is provided a voice recognition method including: responding to a received voice recognition request, and processing voice to be recognized in the voice recognition request to obtain a first voice segment corresponding to a first voice channel and a second voice segment corresponding to a second voice channel; processing the first voice segment and the second voice segment respectively to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time; sorting the at least one first voice sentence and the at least one second voice sentence according to the starting time to obtain an initial sorting result; and processing the initial sequencing result according to the voice channel identifier and the ending time to obtain a voice recognition result.

According to an embodiment of the present disclosure, the ranking the at least one first speech sentence and the at least one second speech sentence according to the start time, to obtain an initial ranking result includes: and sorting the at least one first voice sentence and the at least one second voice sentence according to the respective starting time of each first voice sentence and the respective starting time of each second voice sentence, so as to obtain the initial sorting result.

According to an embodiment of the present disclosure, the initial sequencing result includes at least one initial speech statement arranged by a start time and a speech channel identification corresponding to each of the initial speech statements.

According to an embodiment of the present disclosure, the processing the initial sorting result according to the voice channel identifier and the ending time to obtain a voice recognition result includes: determining a p-th speech sentence in the at least one first speech sentence, wherein p is a positive integer; determining a p+1st speech sentence in the at least one second speech sentence; carrying out voice channel identification detection on the p-th voice sentence and the p+1th voice sentence to obtain a voice channel identification detection result; responding to the voice channel identification detection result to represent that the p-th voice sentence and the p+1-th voice sentence have the same voice channel identification, and carrying out voice sentence time detection on the p-th voice sentence and the p+1-th voice sentence to obtain a voice sentence time detection result; and responding to the speech statement time detection result to represent that the statement end time of the p-th speech statement is matched with the statement start time of the p+1th speech statement, and combining the p-th speech statement and the p+1th speech statement to obtain a combined p-th speech statement.

According to an embodiment of the present disclosure, the speech recognition result includes at least one speech sentence and a speech channel identifier corresponding to each of the speech sentences.

According to an embodiment of the present disclosure, the method further includes, after processing the initial ranking result according to the voice channel identifier and the end time to obtain a voice recognition result: sequentially determining a q-th voice sentence in the at least one voice sentence, wherein q is a positive integer; responding to the voice channel identification representation corresponding to the voice statement to belong to a first voice channel, and displaying the voice statement in a first target area of a target page; and responding to the voice channel identification token corresponding to the voice statement to belong to a second voice channel, and displaying the voice statement in a second target area of the target page.

According to an embodiment of the present disclosure, the speech to be recognized includes at least two speech channels.

According to an embodiment of the present disclosure, the processing, in response to receiving a speech recognition request, speech to be recognized in the speech recognition request to obtain a first speech segment corresponding to a first speech channel and a second speech segment corresponding to a second speech channel includes: in response to receiving the voice recognition request, carrying out channel splitting processing on the voice to be recognized to obtain a first voice to be recognized corresponding to the first voice channel and a second voice to be recognized corresponding to the second voice channel; respectively performing voice activation detection processing on the first voice to be recognized and the second voice segment to obtain at least one first sub voice to be recognized and at least one second sub voice to be recognized; and respectively carrying out voice recognition processing on the at least one first sub voice to be recognized and the at least one second sub voice to be recognized to obtain at least one first voice section and at least one second voice section.

According to an embodiment of the present disclosure, the performing a speech recognition process on the at least one first sub-speech to be recognized and the at least one second sub-speech to be recognized, respectively, to obtain at least one first speech segment and at least one second speech segment includes: performing voice recognition processing on each first sub-voice to be recognized in the at least one first sub-voice to be recognized to obtain a first number of first characters and a second number of first words, wherein each first word corresponds to first word timestamp information; and performing voice recognition processing on each second sub-voice to be recognized in the at least one second sub-voice to be recognized to obtain a third number of second characters and a fourth number of second words, wherein each second word corresponds to second word timestamp information.

According to an embodiment of the present disclosure, the first vocabulary timestamp information includes a first vocabulary start time and a first vocabulary end time.

According to an embodiment of the present disclosure, the processing the first speech segment and the second speech segment to obtain at least one first speech sentence and at least one second speech sentence includes: performing character detection processing on the first number of first characters in the first voice section according to the preset characters to obtain a first character detection result; determining a fifth number of first words corresponding to the first character in response to the first character detection result characterizing that the first character matches the predetermined character; splitting the first voice segment according to the fifth number of first vocabularies to obtain the first voice sentence; and determining a start time and an end time corresponding to the first speech sentence according to the first vocabulary start time and the first vocabulary end time corresponding to each of the fifth number of first vocabularies.

According to an embodiment of the present disclosure, the second vocabulary timestamp information includes a second vocabulary start time and a second vocabulary end time.

According to an embodiment of the present disclosure, the processing the first speech segment and the second speech segment to obtain at least one first speech sentence and at least one second speech sentence includes: performing character detection processing on the second number of second characters in the second voice segment according to the preset characters to obtain a second character detection result; determining a sixth number of second words corresponding to the second character in response to the second character detection result characterizing that the second character matches the predetermined character; splitting the second voice segment according to the sixth number of second words to obtain the second voice sentence; and determining a start time and an end time corresponding to the second speech sentence according to the second vocabulary start time and the second vocabulary end time corresponding to each of the sixth number of second vocabularies.

According to another aspect of the present disclosure, there is provided a voice recognition apparatus including: the first processing module is used for responding to a received voice recognition request, processing voice to be recognized in the voice recognition request and obtaining a first voice segment corresponding to a first voice channel and a second voice segment corresponding to a second voice channel; the second processing module is used for respectively processing the first voice section and the second voice section to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time; the ordering module is used for ordering the at least one first voice statement and the at least one second voice statement according to the starting time to obtain an initial ordering result; and a third processing module, configured to process the initial sorting result according to the voice channel identifier and the ending time, to obtain a voice recognition result. .

According to another aspect of the present disclosure, there is provided an electronic device including: one or more processors; and a memory for storing one or more instructions that, when executed by the one or more processors, cause the one or more processors to implement a method as described in the present disclosure.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement a method as described in the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer executable instructions which, when executed, are adapted to carry out the method as described in the present disclosure.

According to the embodiment of the disclosure, the first voice segment and the second voice segment are obtained by processing the voice to be recognized in the voice recognition request, so that the voice to be recognized can be analyzed according to the voice channel to which the voice to be recognized belongs. Because the first voice sentence and the second voice sentence are respectively obtained by processing the first voice section and the second voice section, the complete voice sentence can be extracted from the voice section, and the accuracy of subsequent voice recognition is improved. On the basis, the first voice statement and the second voice statement are sequenced according to the starting time to obtain an initial sequencing result, and the initial sequencing result is processed according to the voice channel identification and the ending time to obtain a voice recognition result, so that the voice statements can be sequenced according to the time sequence, the technical problem that the voice recognition result is difficult to clearly display under the condition that voices of different channels are overlapped in the related art is at least partially solved, and poor user experience is caused.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates a system architecture to which a speech recognition method may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a speech recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates an example schematic diagram of a process of obtaining a first speech segment corresponding to a first speech channel and a second speech segment corresponding to a second speech channel according to an embodiment of the disclosure;

FIG. 4A schematically illustrates an example schematic diagram of a process for processing a first speech segment to obtain at least one first speech statement in accordance with an embodiment of the disclosure;

FIG. 4B schematically illustrates an example schematic diagram of a process of processing a second speech segment to obtain at least one second speech statement, in accordance with an embodiment of the disclosure;

FIG. 5 schematically illustrates an example schematic diagram of a process for ranking at least one first speech statement and at least one second speech statement according to a start time, resulting in an initial ranking result, according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates an example schematic diagram of a process of presenting speech recognition results according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a block diagram of a speech recognition apparatus according to an embodiment of the present disclosure; and

Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a speech recognition method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In embodiments of the present disclosure, the collection, updating, analysis, processing, use, transmission, provision, disclosure, storage, etc., of the data involved (including, but not limited to, user personal information) all comply with relevant legal regulations, are used for legal purposes, and do not violate well-known. In particular, necessary measures are taken for personal information of the user, illegal access to personal information data of the user is prevented, and personal information security, network security and national security of the user are maintained.

In embodiments of the present disclosure, the user's authorization or consent is obtained before the user's personal information is obtained or collected.

For example, after collecting the voice recognition result, you's information may be desensitized in a manner including de-identification or anonymization to secure your information.

The speech recognition method may comprise at least one of: a speech recognition method based on acoustic modeling, a speech recognition method based on language modeling, a speech recognition method based on pronunciation dictionary, and a speech recognition method based on feature extraction.

A speech recognition method based on acoustic modeling may refer to a method for representing frequency, time domain and acoustic properties of sound by converting a speech signal into a series of eigenvectors. The acoustic model may include at least one of: gaussian mixture model (Gaussian Mixture Model, GMM) and deep neural network (Deep Neural Network, DNN).

Speech recognition methods based on language modeling may refer to methods that statistically derive probability distributions of words and sentences from a large number of text data to select the most probable text decoding result in the recognition process. The language model may include at least one of: an n-gram Model (i.e., n-gram Model) and a recurrent neural network (Recurrent Neural Network, RNN) Model.

The pronunciation dictionary-based speech recognition method may refer to a method of recognizing a speech to be recognized based on a pronunciation dictionary that stores a phoneme (phonetic unit) sequence of each word and provides a mapping of phonemes to acoustic models.

Feature extraction-based speech recognition methods may refer to methods that convert an original speech signal into a feature representation that can be processed by machine learning algorithms. The feature extraction method may include at least one of: mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficients, MFCC) and linear predictive Coding (LINEAR PREDICTIVE Coding, LPC).

Speech recognition techniques may be applied to quality inspection. However, due to the fact that voices from different channels may overlap in the quality inspection process, the recognized text sequence is inconsistent with the actual voice sequence, namely, the accuracy of voice recognition cannot be guaranteed under the condition that voices overlap.

In order to at least partially solve the technical problems in the related art, the present disclosure provides a voice recognition method and apparatus, an electronic device, and a computer-readable storage medium, which can be applied to the fields of computer technology and voice processing technology. The voice recognition method comprises the following steps: responding to the received voice recognition request, and processing the voice to be recognized in the voice recognition request to obtain a first voice segment corresponding to the first voice channel and a second voice segment corresponding to the second voice channel; processing the first voice segment and the second voice segment respectively to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time; sequencing at least one first voice statement and at least one second voice statement according to the starting time to obtain an initial sequencing result; and processing the initial sequencing result according to the voice channel identification and the ending time to obtain a voice recognition result.

Fig. 1 schematically illustrates a system architecture to which a speech recognition method may be applied according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the voice recognition method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the voice recognition apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The speech recognition method provided by the embodiments of the present disclosure may be performed from a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105. Accordingly, the voice recognition apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

Alternatively, the voice recognition method provided by the embodiment of the present disclosure may be performed by the first terminal device 101, the second terminal device 102, or the third terminal device 103, or may be performed by other terminal devices different from the first terminal device 101, the second terminal device 102, or the third terminal device 103. Accordingly, the voice recognition apparatus provided by the embodiments of the present disclosure may also be provided in the first terminal device 101, the second terminal device 102, or the third terminal device 103, or in other terminal devices different from the first terminal device 101, the second terminal device 102, or the third terminal device 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.

Fig. 2 schematically illustrates a flow chart of a speech recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the voice recognition method includes operations S210 to S240.

In response to receiving the voice recognition request, the voice to be recognized in the voice recognition request is processed to obtain a first voice segment corresponding to the first voice channel and a second voice segment corresponding to the second voice channel in operation S210.

In operation S220, the first speech segment and the second speech segment are respectively processed to obtain at least one first speech sentence and at least one second speech sentence, where each first speech sentence and each second speech sentence corresponds to a speech channel identifier, a start time and an end time.

In operation S230, at least one first speech sentence and at least one second speech sentence are ranked according to the start time, resulting in an initial ranking result.

In operation S240, the initial sorting result is processed according to the voice channel identifier and the end time to obtain a voice recognition result.

According to the embodiment of the disclosure, a code for generating a voice recognition request can be written in a script in advance, and in response to detection of voice to be recognized input by a user, a server can run the script and generate a voice recognition request message according to the voice to be recognized. The server side can send the voice recognition request message to the client side so that the client side can perform voice recognition on the voice to be recognized according to the voice recognition request message.

According to an embodiment of the present disclosure, the speech to be recognized may include a plurality of speech channels, which may refer to speech signal transmission channels. The plurality of voice channels may each correspond to a voice channel identification, which may be used to characterize different voice channels. After receiving the voice recognition request, the voice to be recognized may be processed to obtain at least one voice segment corresponding to each of the plurality of voice channels. The processing means for obtaining the speech segment may include at least one of: channel splitting processing, voice activation detection processing, and voice recognition processing.

For example, the speech to be recognized may be obtained from an intelligent quality inspection platform that may be used to store voice call audio files between customer service and users for subsequent transcription and quality inspection of the audio files. In this case, the first voice channel may refer to a customer service channel, the second voice channel may refer to a user channel, and at least one of a channel splitting process, a voice activation detection process, and a voice recognition process may be performed on the voice to be recognized, so as to obtain a first voice segment corresponding to the customer service channel and a second voice segment corresponding to the user channel. The number of first speech segments may be one or more. The number of second speech segments may be one or more.

According to an embodiment of the present disclosure, the first speech segment may include a first number of first characters, a second number of first words, and first word timestamp information corresponding to each of the second number of first words. The first vocabulary timestamp information may include a first vocabulary start time and a first vocabulary end time. The second speech segment may include a third number of second characters, a fourth number of second words, and second word timestamp information corresponding to each of the fourth number of second words.

The second vocabulary timestamp information may include a second vocabulary start time and a second vocabulary end time.

According to the embodiment of the disclosure, after the first voice segment corresponding to the first voice channel and the second voice segment corresponding to the second voice channel are obtained, character detection processing may be performed on the first voice segment and the second voice segment, respectively, so that the first voice segment is split into at least one first voice sentence and the second voice segment is split into at least one second voice sentence according to a character detection result.

According to the embodiment of the disclosure, in the process of splitting the first voice segment to obtain the first voice sentence, character detection processing can be performed on the first voice segment according to the predetermined character to obtain the position information of at least one first character which is characterized in the first voice segment and matched with the predetermined character. And determining at least one first vocabulary corresponding to each first character according to the position information of the at least one first character, namely at least one first vocabulary belonging to the same first voice sentence. On this basis, the start time and the end time corresponding to the first speech sentence may be determined according to the first vocabulary start time and the first vocabulary end time corresponding to each first vocabulary, respectively. The process of splitting the second speech segment to obtain the second speech sentence is the same as the process of obtaining the first speech sentence, and will not be described herein.

According to the embodiment of the disclosure, after at least one first voice sentence and at least one second voice sentence are obtained, the voice sentences may be ranked according to the starting time corresponding to each first voice sentence and each second voice sentence, so as to obtain an initial ranking result. The initial ranking result may include at least one first speech sentence and at least one second speech sentence arranged in the starting time order, and a speech channel identification corresponding to each speech sentence.

According to the embodiment of the disclosure, after the initial ranking result is obtained, voice channel identification detection can be performed on every two adjacent voice sentences in the initial ranking result to obtain a voice channel identification detection result. For example, in the case where every adjacent two speech sentences belong to the same speech channel, the adjacent two speech sentences may be time-detected to obtain a time detection result. In the case where the end time of the preceding sentence and the start time of the following sentence match among the adjacent two voice sentences, the adjacent two voice sentences may be subjected to the merging processing. Alternatively, in the case where every adjacent two speech sentences do not belong to the same speech channel, the processing of the adjacent two speech sentences may be stopped. Alternatively, in the case where the end time of the preceding sentence and the start time of the following sentence do not match among the adjacent two voice sentences, the processing of the adjacent two voice sentences may be stopped.

According to the embodiment of the present disclosure, after the above-described processing is performed on each adjacent two speech sentences in the initial ranking result, respectively, a speech recognition result can be obtained. The speech recognition result may be in text form. The speech recognition result may include at least one speech statement after the merging process. After the voice recognition result is obtained, the at least one voice sentence can be displayed to the user according to the voice channel on the intelligent quality inspection platform according to the voice channel identification corresponding to the at least one voice sentence.

Referring now to fig. 3, 4A, 4B, 5 and 6, a speech recognition method 200 according to an embodiment of the present invention is further described.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And responding to the received voice recognition request, carrying out channel splitting processing on the voice to be recognized to obtain a first voice to be recognized corresponding to the first voice channel and a second voice to be recognized corresponding to the second voice channel. And respectively performing voice activation detection processing on the first voice to be recognized and the second voice segment to obtain at least one first sub voice to be recognized and at least one second sub voice to be recognized. And respectively carrying out voice recognition processing on at least one first sub voice to be recognized and at least one second sub voice to be recognized to obtain at least one first voice section and at least one second voice section.

According to embodiments of the present disclosure, the speech to be recognized may include at least two speech channels.

According to embodiments of the present disclosure, the speech to be recognized may refer to dual channel audio of customer service and user interaction. The first voice channel may refer to a customer service channel and the second voice channel may refer to a user channel. In this case, the channel splitting process may be performed on the voice to be recognized, so as to obtain the first voice to be recognized and the second voice to be recognized, which belong to the single-channel audio. The first speech to be recognized may refer to an audio stream comprising customer service sounds. The second speech to be recognized may refer to an audio stream comprising the user's voice.

In accordance with embodiments of the present disclosure, a voice endpoint detection technique may refer to a technique that locates a start time and an end time of a speaking voice segment from a voice stream. After the first to-be-identified voice is obtained, voice activation detection processing may be performed on the first to-be-identified voice by using a voice endpoint detection (Voice Activity Detection, VAD) technology, so as to obtain at least one first to-be-identified sub-voice. The first sub-speech to be recognized may refer to a paragraph of speech in the first speech to be recognized in which the speaker actually speaks. The at least one first sub-voice to be recognized may each correspond to a start time and an end time.

According to the embodiment of the disclosure, after the second speech to be recognized is obtained, a speech end point detection technology may be utilized to perform a speech activation detection process on the second speech to be recognized to obtain at least one second sub-speech to be recognized. The second sub-speech to be recognized may refer to a speech paragraph of the second sub-speech to be recognized in which the speaker actually speaks. The at least one second sub-voice to be recognized may each correspond to a start time and an end time.

According to an embodiment of the present disclosure, performing a speech recognition process on at least one first sub-speech to be recognized and at least one second sub-speech to be recognized, respectively, to obtain at least one first speech segment and at least one second speech segment may include the following operations.

And performing voice recognition processing on the first sub-voices to be recognized according to each first sub-voice to be recognized in the at least one first sub-voice to be recognized to obtain a first number of first characters and a second number of first words, wherein each first word corresponds to first word timestamp information.

And carrying out voice recognition processing on the second sub-voices to be recognized aiming at each second sub-voice to be recognized in the at least one second sub-voice to be recognized to obtain a third number of second characters and a fourth number of second words, wherein each second word corresponds to second word time stamp information.

According to embodiments of the present disclosure, speech recognition may refer to the process of recognizing specific speaking content from speech to be recognized and converting the speaking content into text output.

According to the embodiment of the disclosure, after at least one first sub-voice to be recognized is obtained, voice recognition processing may be performed on each first sub-voice to be recognized, so as to obtain a first voice segment in text form. The first speech segment may include a first number of first characters and a second number of first words. On the basis, first vocabulary time stamp information corresponding to each first vocabulary can be obtained in the process of carrying out voice recognition processing on the first sub-voices to be recognized.

According to the embodiment of the disclosure, after at least one second sub-voice to be recognized is obtained, voice recognition processing may be performed on each second sub-voice to be recognized, so as to obtain a second voice segment in text form. The second speech segment includes a third number of second characters and a fourth number of second words. On the basis, second vocabulary time stamp information corresponding to each second vocabulary can be obtained in the process of carrying out voice recognition processing on the second sub-voices to be recognized.

According to the embodiment of the disclosure, since the first voice to be recognized and the second voice to be recognized are obtained by performing channel splitting processing on the voice to be recognized, a plurality of voice channels can be separated, so that parallel processing on multi-channel voice input is realized. Because the first to-be-recognized sub-voice and the second to-be-recognized sub-voice are obtained by respectively performing voice activation detection processing on the first to-be-recognized voice and the second voice section, voice parts with and without voice can be effectively distinguished, background noise, silence and invalid audio can be removed, and accordingly accuracy and stability of voice recognition are improved. On the basis, the first voice section and the second voice section are obtained by respectively carrying out voice recognition processing on the first sub voice to be recognized and the second sub voice to be recognized, so that the processing capability of complex voice scenes is improved, and the effect and performance of subsequent voice recognition are improved.

Fig. 3 schematically illustrates an example schematic diagram of a process of obtaining a first speech segment corresponding to a first speech channel and a second speech segment corresponding to a second speech channel according to an embodiment of the disclosure.

As shown in fig. 3, in 300, in response to receiving a voice recognition request 301, a channel splitting process is performed on a voice to be recognized 301_1 in the voice recognition request 301, so as to obtain a first voice to be recognized 302 corresponding to a first voice channel and a second voice to be recognized 303 corresponding to a second voice channel.

The first speech to be recognized 302 is subjected to a speech activation detection process to obtain at least one first sub-speech to be recognized 304. And performing voice activation detection processing on the second voice to be recognized 303 to obtain at least one second sub-voice to be recognized 305.

For each first sub-voice to be recognized 304 in the at least one first sub-voice to be recognized 304, performing voice recognition processing on the first sub-voice to be recognized 304 to obtain a first number of first characters 306 and a second number of first words 307.

And performing voice recognition processing on the second sub-voices 305 to be recognized for each second sub-voice 305 to be recognized in the at least one second sub-voice 305 to obtain a third number of second characters 308 and a fourth number of second words 309.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

And carrying out character detection processing on a first number of first characters in the first voice section according to the preset characters to obtain a first character detection result. A fifth number of first words corresponding to the first character is determined in response to the first character detection result characterizing the first character as matching the predetermined character. And splitting the first voice segment according to the fifth number of first vocabularies to obtain a first voice sentence. And determining the starting time and the ending time corresponding to the first voice sentence according to the first vocabulary starting time and the first vocabulary ending time corresponding to the fifth number of first vocabularies.

According to an embodiment of the present disclosure, the first vocabulary timestamp information may include a first vocabulary start time and a first vocabulary end time. The specific content of the predetermined character may be set according to the actual service requirement, which is not limited herein. For example, the predetermined character may include at least one of: commas and periods.

According to the embodiment of the disclosure, for each first character in the first number of first characters, a character detection process may be performed on the first character according to a predetermined character, so as to obtain a first character detection result. For example, in the case where the first character detection result indicates that the first character matches a predetermined character, it may be determined that the first vocabulary corresponding to the first character belongs to a first vocabulary that needs to be split into clauses of smaller granularity.

According to the embodiment of the disclosure, a fifth number of first words corresponding to the first characters can be determined, and according to splitting processing is performed on the fifth number of first words, a first voice sentence comprising the fifth number of first words is obtained. On this basis, the start time and the end time corresponding to the first speech sentence may be determined based on the vocabulary timestamp information corresponding to each of the fifth number of first vocabularies.

According to the embodiment of the disclosure, since the first character detection result is obtained by performing character detection processing on the first number of first characters in the first voice segment according to the predetermined characters, the specific characters in the first voice segment can be identified, and the subsequent splitting from the first voice segment to obtain the first voice sentence is facilitated. Because the first voice sentence is obtained by splitting according to the fifth number of first vocabularies corresponding to the first character under the condition that the first character detection result represents that the first character is matched with the preset character, the first voice section can be divided according to the sentence unit, and more accurate voice text alignment and voice content extraction are facilitated. On the basis, the starting time and the ending time corresponding to the first voice sentences are determined according to the first vocabulary starting time and the first vocabulary ending time corresponding to the fifth number of first vocabularies, so that the positions of the first voice sentences in the first voice sections can be more accurately positioned, the order of the first voice sentences is adjusted based on the time stamp information of the vocabulary granularity, and the accuracy of the follow-up voice recognition results is improved.

Fig. 4A schematically illustrates an example schematic diagram of a process of processing a first speech segment to obtain at least one first speech sentence according to an embodiment of the present disclosure.

As shown in fig. 4A, in 400A, for a first voice segment 401, a first number of first characters 401_1 in the first voice segment 401 are subjected to a character detection process according to a predetermined character 402, resulting in a first character detection result 403. After the first character detection result 403 is obtained, operation S410 may be performed.

In operation S410, it is determined whether the first character detection result 403 characterizes that the first character 401_1 matches the predetermined character 402?

If so, a fifth number of first words 404 corresponding to the first character 401_1 may be determined. On this basis, the first speech segment 401 may be split according to the fifth number of first vocabularies 404, to obtain a first speech sentence 405. If not, the flow may end.

And carrying out character detection processing on a second number of second characters in the second voice segment according to the preset characters to obtain a second character detection result. And determining a sixth number of second words corresponding to the second character in response to the second character detection result characterizing that the second character matches the predetermined character. And splitting the second voice segment according to the sixth number of second words to obtain a second voice sentence. And determining the starting time and the ending time corresponding to the second voice sentence according to the second vocabulary starting time and the second vocabulary ending time corresponding to the sixth number of second vocabularies.

According to an embodiment of the present disclosure, the second vocabulary timestamp information may include a second vocabulary start time and a second vocabulary end time.

According to the embodiment of the disclosure, for each second character in the third number of second characters, the second character may be subjected to a character detection process according to a predetermined character, to obtain a second character detection result. For example, in the case where the second character detection result indicates that the second character matches the predetermined character, it may be determined that the second vocabulary corresponding to the second character belongs to the second vocabulary that needs to be split into clauses of smaller granularity.

According to the embodiment of the disclosure, a sixth number of second words corresponding to the second character can be determined, and according to splitting processing is performed on the sixth number of second words, a second voice sentence including the sixth number of second words is obtained. On this basis, the start time and the end time corresponding to the second speech sentence may be determined based on the vocabulary timestamp information corresponding to each of the sixth number of second vocabularies.

According to the embodiment of the disclosure, since the second character detection result is obtained by performing character detection processing on the second number of second characters in the second voice segment according to the predetermined characters, the specific characters in the second voice segment can be identified, and the second voice sentence can be obtained by splitting from the second voice segment. The second voice sentence is obtained by splitting according to a sixth number of second vocabularies corresponding to the second characters under the condition that the second character detection result represents that the second characters are matched with the preset characters, so that the second voice section can be divided according to the sentence units, and more accurate voice text alignment and voice content extraction are facilitated. On the basis, the starting time and the ending time corresponding to the second voice sentences are determined according to the second vocabulary starting time and the second vocabulary ending time corresponding to the sixth number of second vocabularies, so that the positions of the second voice sentences in the second voice sections can be more accurately positioned, the order of the second voice sentences is adjusted based on the timestamp information of the vocabulary granularity, and the accuracy of the subsequent voice recognition results is improved.

Fig. 4B schematically illustrates an example schematic diagram of a process of processing a second speech segment to obtain at least one second speech sentence according to an embodiment of the present disclosure.

As shown in fig. 4B, in 400B, for the second speech segment 406, a character detection process is performed on a second number of second characters 406_1 in the second speech segment 406 according to a predetermined character 407, resulting in a second character detection result 408. After the second character detection result 408 is obtained, operation S420 may be performed.

In operation S420, it is determined whether the second character detection result 408 characterizes that the second character 406_1 matches the predetermined character 407?

If so, a sixth number of second words 409 corresponding to the second character 406_1 may be determined. On this basis, the second speech segment 406 may be split according to the sixth number of second words 409, to obtain a second speech sentence 410. If not, the flow may end.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

And sequencing at least one first voice sentence and at least one second voice sentence according to the respective starting time of each first voice sentence and the respective starting time of each second voice sentence, so as to obtain an initial sequencing result.

According to the embodiment of the disclosure, after obtaining S first speech sentences and T second speech sentences, the (s+t) speech sentences may be ranked according to the sentence start time corresponding to each speech sentence, to obtain an initial ranking result, that is, a result after ranking the (s+t) speech sentences.

According to the embodiment of the disclosure, since the association relationship information is determined according to the respective start time of each first voice sentence and the respective start time of each second voice sentence, the relative order relationship between the different sentences can be acquired from the first voice section and the second voice section. On the basis, the first voice statement and the second voice statement are ordered according to the association relation information, so that an initial ordering result is obtained, reasonable organization and arrangement of the voice statements are facilitated, and the efficiency and accuracy of subsequent voice recognition are improved.

According to an embodiment of the present disclosure, operation S240 may include the following operations.

And determining a p-th voice sentence and a p+1-th voice sentence in at least one initial voice sentence, wherein p is a positive integer. And carrying out voice channel identification detection on the p-th voice sentence and the p+1th voice sentence to obtain a voice channel identification detection result. Responding to the voice channel identification detection result to represent that the p-th voice sentence and the p+1-th voice sentence have the same voice channel identification, and carrying out voice sentence time detection on the p-th voice sentence and the p+1-th voice sentence to obtain a voice sentence time detection result. Responding to the voice sentence time detection result to represent that the sentence end time of the p-th voice sentence is matched with the sentence start time of the p+1th voice sentence, and combining the p-th voice sentence and the p+1th voice sentence to obtain the combined p-th voice sentence.

According to an embodiment of the present disclosure, after the initial ranking result is obtained, every two adjacent phonetic sentences, that is, the p-th phonetic sentence and the p+1-th phonetic sentence, may be sequentially determined in the initial ranking result.

According to the embodiment of the disclosure, the voice channel identification detection can be performed on the p-th voice sentence and the p+1-th voice sentence, so as to obtain a voice channel identification detection result. The speech channel identification detection result may be used to characterize whether the speech channel identification of the p-th speech sentence and the speech channel identification of the p+1th speech sentence match.

According to the embodiment of the disclosure, when the voice channel identification detection result represents that the voice channel identifications of the p-th voice sentence and the p+1-th voice sentence are matched, voice sentence time detection can be performed on the p-th voice sentence and the p+1-th voice sentence, and a voice sentence time detection result is obtained. The phonetic sentence time detection result may be used to characterize whether the sentence end time of the p-th phonetic sentence and the start time of the p+1th phonetic sentence match. Under the condition that the sentence end time of the p-th voice sentence is represented by the voice sentence time detection result and the p+1-th voice sentence start time are matched, the p-th voice sentence and the p+1-th voice sentence can be combined to obtain the combined p-th voice sentence.

According to the embodiment of the disclosure, in the case that the voice channel identifications of the p-th voice sentence and the p+1th voice sentence are not matched by the voice channel identification detection result, the p-th voice sentence and the p+1th voice sentence may not be processed. In the case where the speech sentence time detection result indicates that the sentence end time of the p-th speech sentence and the p+1th speech sentence start time do not match, the p-th speech sentence and the p+1th speech sentence may not be processed.

According to the embodiment of the present disclosure, since the voice channel identification detection result is obtained by performing voice channel identification detection on the p-th voice sentence and the p+1th voice sentence, the voice channel identification detection result can be used to determine whether or not every two voice sentences are from the same voice channel. Since the speech sentence time detection result is obtained by performing speech sentence time detection on the p-th speech sentence and the p+1th speech sentence in the case where the p-th speech sentence and the p+1th speech sentence have the same speech channel identification, the speech sentence time detection result can be used to determine whether the sentence end time of the p-th speech sentence matches the sentence start time of the p+1th speech sentence. On the basis, when the sentence ending time of the p-th voice sentence is matched with the sentence starting time of the p+1th voice sentence, the p-th voice sentence and the p+1th voice sentence are combined to obtain the combined p-th voice sentence, so that adjacent voice sentences can be combined, and the continuity and the integrity of a voice recognition result are improved.

Fig. 5 schematically illustrates an example schematic diagram of a process of ranking at least one first speech sentence and at least one second speech sentence according to a start time, resulting in an initial ranking result, according to an embodiment of the present disclosure.

As shown in fig. 5, in 500, a p-th speech sentence 502 and a p+1th speech sentence 503 are determined in at least one initial speech sentence 501. The p-th speech sentence 502 and the p+1th speech sentence 503 are subjected to speech channel identification detection, and a speech channel identification detection result 504 is obtained. After the voice channel identification detection result 504 is obtained, operation S510 may be performed.

In operation S510, it is determined whether the speech channel identification detection result 504 characterizes that the p-th speech sentence 502 and the p+1th speech sentence 503 have the same speech channel identification?

If yes, the p-th speech sentence 502 and the p+1th speech sentence 503 can be subjected to speech sentence time detection, so as to obtain a speech sentence time detection result 505. If not, the flow may end. After the speech statement time detection result 505 is obtained, operation S520 may be performed.

In operation S520, it is determined whether the speech sentence time detection result 505 characterizes that the sentence end time of the p-th speech sentence 502 matches the sentence start time of the p+1th speech sentence 503?

If so, the p-th speech sentence 502 and the p+1th speech sentence 503 may be combined to obtain a combined p-th speech sentence 506. If not, the flow may end.

According to an embodiment of the present disclosure, the speech recognition method 200 may further include the following operations.

Sequentially determining the q-th voice sentence in at least one voice sentence, wherein q is a positive integer. And responding to the voice channel identification token corresponding to the voice statement to belong to the first voice channel, and displaying the voice statement in a first target area of the target page. And responding to the voice channel identification token corresponding to the voice statement to belong to a second voice channel, and displaying the voice statement in a second target area of the target page.

According to an embodiment of the present disclosure, the speech recognition result may include at least one speech statement and a speech channel identification corresponding to each speech statement.

According to embodiments of the present disclosure, after the voice recognition result is obtained, the voice recognition result may be returned to the intelligent quality inspection platform. The intelligent quality inspection platform can sequentially determine the q-th voice sentence from front to back in the voice recognition result according to the sentence starting time, and display the q-th voice sentence on the target page according to the voice channel identifications corresponding to the q-th voice sentence. The destination page may refer to a chat window.

According to the embodiment of the disclosure, the specific display manner may be set according to the actual service requirement, which is not limited herein. For example, in the case that the speech channel identifier corresponding to the qth speech sentence belongs to the first speech channel, the qth speech sentence may be displayed in a first target area of the target page, where the first target area may refer to the right side of the target page. Alternatively, in the case that the speech channel identifier corresponding to the q-th speech sentence belongs to the second speech channel, the q-th speech sentence may be displayed in a second target area of the target page, where the second target area may refer to the left side of the target page.

According to the embodiment of the disclosure, the q-th voice sentence is sequentially determined in at least one voice sentence, so that processing and displaying according to the sequence of the voice sentences are facilitated, and the display order of the voice recognition results can be ensured. On the basis, the display area of the voice sentence is determined according to the voice channel identification corresponding to the voice sentence, so that the voice contents of different voice channels can be distinguished, the voice contents of different voice channels are respectively displayed in different areas of a target page, the effect that the voice sentences are displayed one by one can be presented on the target page, and the understanding and the use experience of a user to the voice to be recognized can be improved.

Fig. 6 schematically illustrates an example schematic diagram of a process of presenting speech recognition results according to an embodiment of the disclosure.

As shown in fig. 6, in 600, we see that the address is known with the first voice channel as the customer service channel and the first voice segment corresponding to the first voice channel as "this case. Is a combination of the above. That dispute order is turned off for you. That is hard you say this side to the family. The second voice channel is a user channel, and the second voice segment corresponding to the second voice channel is o. Good. And (5) row lines. By way of example, the first target area is on the right side and the second target area is on the left side, the speech recognition process of the embodiments of the present disclosure is described.

The first speech segment may be processed to obtain 4 first speech sentences, i.e. speech sentence a ", in which case we know the address. ", speech statement B". The dispute list of ", phonetic sentence C" is turned off for you. The "and voice statement D" is hard to say to the family. ". The second speech segment may be processed to obtain 3 second speech sentences, i.e. speech sentences E ". ", speech statement F" is good. Line "and phonetic statement G". ".

The 4 first voice sentences and the 3 second voice sentences can be ordered according to the starting time corresponding to the 4 first voice sentences and the 3 second voice sentences respectively, and an initial ordering result is obtained as follows: phonetic sentence A- > phonetic sentence B- > phonetic sentence E- > phonetic sentence C- > phonetic sentence F- > phonetic sentence D- > phonetic sentence G- > phonetic sentence.

On this basis, since the speech sentence a and the speech sentence B have the same speech channel identification and the sentence end time of the speech sentence a is matched with the sentence start time of the speech sentence B, the speech sentence a and the speech sentence B can be combined to obtain the speech sentence H.

After the speech recognition result is obtained, speech sentence H601 may be sequentially presented on the right side of the target page, speech sentence E602 on the left side of the target page, speech sentence C603 on the right side of the target page, speech sentence F604 on the left side of the target page, speech sentence D605 on the right side of the target page, and speech sentence G606 on the left side of the target page.

The above is only an exemplary embodiment, but is not limited thereto, and other speech recognition methods known in the art may be included as long as more accurate and ordered speech recognition results can be obtained.

Fig. 7 schematically shows a block diagram of a speech recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the speech recognition apparatus 700 may include a first processing module 710, a second processing module 720, a ranking module 730, and a third processing module 740.

The first processing module 710 is configured to process, in response to receiving the voice recognition request, a voice to be recognized in the voice recognition request, and obtain a first voice segment corresponding to the first voice channel and a second voice segment corresponding to the second voice channel.

The second processing module 720 is configured to process the first speech segment and the second speech segment respectively to obtain at least one first speech sentence and at least one second speech sentence, where each first speech sentence and each second speech sentence corresponds to a speech channel identifier, a start time and an end time.

The ranking module 730 is configured to rank at least one first speech sentence and at least one second speech sentence according to the start time, so as to obtain an initial ranking result.

And a third processing module 740, configured to process the initial sorting result according to the voice channel identifier and the end time, so as to obtain a voice recognition result.

According to an embodiment of the present disclosure, the sorting module 730 may include a sorting unit.

The ranking unit is used for ranking the at least one first voice sentence and the at least one second voice sentence according to the respective start time of each first voice sentence and the respective start time of each second voice sentence to obtain an initial ranking result.

According to an embodiment of the present disclosure, the third processing module 740 may include a first determining unit, a voice channel identification detecting unit, a voice sentence time detecting unit, and a combining processing unit.

A first determining unit, configured to determine a p-th speech sentence and a p+1th speech sentence in at least one initial speech sentence, where p is a positive integer.

The voice channel identification detection unit is used for carrying out voice channel identification detection on the p-th voice statement and the p+1-th voice statement to obtain a voice channel identification detection result.

The voice sentence time detection unit is used for responding to the voice channel identification detection result to represent that the p-th voice sentence and the p+1-th voice sentence have the same voice channel identification, and carrying out voice sentence time detection on the p-th voice sentence and the p+1-th voice sentence to obtain a voice sentence time detection result.

And the merging processing unit is used for responding to the voice sentence time detection result to represent that the sentence end time of the p-th voice sentence is matched with the sentence start time of the p+1th voice sentence, and carrying out merging processing on the p-th voice sentence and the p+1th voice sentence to obtain the merged p-th voice sentence.

According to an embodiment of the present disclosure, the speech recognition result includes at least one speech statement and a speech channel identification corresponding to each speech statement.

According to an embodiment of the present disclosure, the speech recognition apparatus 700 may further include a determination module, a first presentation module, and a second presentation module.

And the determining module is used for sequentially determining the q-th voice sentence in at least one voice sentence, wherein q is a positive integer.

The first display module is used for responding to the voice channel identification representation corresponding to the voice statement and belonging to the first voice channel, and displaying the voice statement in a first target area of the target page.

And the second display module is used for responding to the voice channel identification representation corresponding to the voice statement and belonging to a second voice channel, and displaying the voice statement in a second target area of the target page.

According to an embodiment of the present disclosure, the speech to be recognized comprises at least two speech channels.

According to an embodiment of the present disclosure, the first processing module 710 may include a first split processing unit, a voice activation detection processing unit, and a voice recognition processing unit.

The first splitting processing unit is used for responding to the received voice recognition request, carrying out channel splitting processing on the voice to be recognized, and obtaining a first voice to be recognized corresponding to the first voice channel and a second voice to be recognized corresponding to the second voice channel.

The voice activation detection processing unit is used for respectively carrying out voice activation detection processing on the first voice to be recognized and the second voice segment to obtain at least one first sub voice to be recognized and at least one second sub voice to be recognized.

The voice recognition processing unit is used for respectively carrying out voice recognition processing on at least one first sub voice to be recognized and at least one second sub voice to be recognized to obtain at least one first voice segment and at least one second voice segment.

According to an embodiment of the present disclosure, the speech recognition processing unit may comprise a first speech recognition processing subunit and a second speech recognition processing subunit.

The first voice recognition processing subunit is configured to perform voice recognition processing on the first to-be-recognized sub-voices according to each first to-be-recognized sub-voice in the at least one first to-be-recognized sub-voice to obtain a first number of first characters and a second number of first vocabularies, where each first vocabulary corresponds to first vocabulary timestamp information.

The second voice recognition processing subunit is configured to perform voice recognition processing on the second to-be-recognized sub-voices according to each second to-be-recognized sub-voice in the at least one second to-be-recognized sub-voice to obtain a third number of second characters and a fourth number of second vocabularies, where each second vocabulary corresponds to second vocabulary timestamp information.

According to an embodiment of the present disclosure, the second processing module 720 may include a first character detection processing unit, a second determination unit, a second split processing unit, and a third determination unit.

And the first character detection processing unit is used for carrying out character detection processing on a first number of first characters in the first voice section according to the preset characters to obtain a first character detection result.

And the second determining unit is used for responding to the first character detection result to represent that the first character is matched with the preset character, and determining a fifth number of first vocabularies corresponding to the first character.

The second splitting processing unit is used for splitting the first voice segment according to the fifth number of first vocabularies to obtain a first voice sentence.

And the third determining unit is used for determining the starting time and the ending time corresponding to the first voice sentences according to the first vocabulary starting time and the first vocabulary ending time corresponding to the fifth number of first vocabularies.

According to an embodiment of the present disclosure, the second processing module 720 may include a second character detection processing unit, a fourth determination unit, a third split processing unit, and a fifth determination unit.

And the second character detection processing unit is used for carrying out character detection processing on a second number of second characters in the second voice segment according to the preset characters to obtain a second character detection result.

And a fourth determining unit for determining a sixth number of second words corresponding to the second character in response to the second character detection result characterizing that the second character matches the predetermined character.

And the third splitting processing unit is used for splitting the second voice segment according to the sixth number of second vocabularies to obtain a second voice sentence.

And a fifth determining unit, configured to determine a start time and an end time corresponding to the second speech sentence according to a second vocabulary start time and a second vocabulary end time corresponding to each of the sixth number of second vocabularies.

Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Or one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which, when executed, may perform the corresponding functions.

For example, any of the first processing module 710, the second processing module 720, the sorting module 730, and the third processing module 740 may be combined in one module/unit/sub-unit or any of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Or at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first processing module 710, the second processing module 720, the ordering module 730, and the third processing module 740 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Or at least one of the first processing module 710, the second processing module 720, the sorting module 730 and the third processing module 740 may be at least partially implemented as computer program modules which, when executed, may perform the respective functions.

It should be noted that, in the embodiment of the present disclosure, the voice recognition device portion corresponds to the voice recognition method portion in the embodiment of the present disclosure, and the description of the voice recognition device portion specifically refers to the voice recognition method portion and is not described herein again.

Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a speech recognition method according to an embodiment of the disclosure. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 8, a computer electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 809 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to an input/output (I/O) interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to an input/output (I/O) interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, the program code for causing an electronic device to implement the voice recognition methods provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of speech recognition, comprising:

Responding to a received voice recognition request, and processing voice to be recognized in the voice recognition request to obtain a first voice segment corresponding to a first voice channel and a second voice segment corresponding to a second voice channel;

Processing the first voice segment and the second voice segment respectively to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time;

Sorting the at least one first voice sentence and the at least one second voice sentence according to the starting time to obtain an initial sorting result; and

And processing the initial sequencing result according to the voice channel identifier and the ending time to obtain a voice recognition result.

2. The method of claim 1, wherein the ranking the at least one first speech sentence and the at least one second speech sentence according to the start time, resulting in an initial ranking result comprises:

and sequencing the at least one first voice sentence and the at least one second voice sentence according to the respective starting time of each first voice sentence and the respective starting time of each second voice sentence, so as to obtain the initial sequencing result.

3. The method of claim 2, wherein the initial sequencing result comprises at least one initial speech statement arranged by a start time and a speech channel identification corresponding to each of the initial speech statements;

And processing the initial sequencing result according to the voice channel identifier and the ending time to obtain a voice recognition result, wherein the step of obtaining the voice recognition result comprises the following steps:

determining a p-th voice sentence and a p+1-th voice sentence in the at least one initial voice sentence, wherein p is a positive integer;

Performing voice channel identification detection on the p-th voice sentence and the p+1th voice sentence to obtain a voice channel identification detection result;

Responding to the voice channel identification detection result to represent that the p-th voice sentence and the p+1th voice sentence have the same voice channel identification, and carrying out voice sentence time detection on the p-th voice sentence and the p+1th voice sentence to obtain a voice sentence time detection result; and

Responding to the voice sentence time detection result to represent that the sentence ending time of the p-th voice sentence is matched with the sentence starting time of the p+1th voice sentence, and carrying out merging processing on the p-th voice sentence and the p+1th voice sentence to obtain a merged p-th voice sentence.

4. A method according to any one of claims 1 to 3, wherein the speech recognition result comprises at least one speech statement and a speech channel identity corresponding to each of the speech statements;

The method further comprises, after the processing the initial sorting result according to the voice channel identifier and the ending time to obtain a voice recognition result:

Sequentially determining a q-th voice sentence in the at least one voice sentence, wherein q is a positive integer;

responding to the voice channel identification representation corresponding to the q-th voice sentence to belong to a first voice channel, and displaying the voice sentence in a first target area of a target page; and

And responding to the voice channel identification representation corresponding to the q-th voice sentence to belong to a second voice channel, and displaying the voice sentence in a second target area of the target page.

5. A method according to any one of claims 1 to 3, wherein the speech to be recognized comprises at least two speech channels;

the responding to the received voice recognition request, processing the voice to be recognized in the voice recognition request, and obtaining a first voice segment corresponding to the first voice channel and a second voice segment corresponding to the second voice channel comprises the following steps:

In response to receiving the voice recognition request, carrying out channel splitting processing on the voice to be recognized to obtain a first voice to be recognized corresponding to the first voice channel and a second voice to be recognized corresponding to the second voice channel;

respectively performing voice activation detection processing on the first voice to be recognized and the second voice segment to obtain at least one first sub voice to be recognized and at least one second sub voice to be recognized; and

And respectively carrying out voice recognition processing on the at least one first sub voice to be recognized and the at least one second sub voice to be recognized to obtain at least one first voice section and at least one second voice section.

6. The method of claim 5, wherein performing speech recognition processing on the at least one first sub-speech to be recognized and the at least one second sub-speech to be recognized, respectively, to obtain at least one first speech segment and at least one second speech segment comprises:

for each first sub-voice to be recognized of the at least one first sub-voice to be recognized,

Performing voice recognition processing on the first sub-voice to be recognized to obtain a first number of first characters and a second number of first words, wherein each first word corresponds to first word timestamp information; and

For each of the at least one second sub-voices to be recognized,

And performing voice recognition processing on the second sub-voices to be recognized to obtain a third number of second characters and a fourth number of second words, wherein each second word corresponds to second word timestamp information.

7. The method of claim 6, wherein the first vocabulary timestamp information comprises a first vocabulary start time and a first vocabulary end time;

The processing the first voice segment and the second voice segment respectively to obtain at least one first voice sentence and at least one second voice sentence includes:

Performing character detection processing on the first number of first characters in the first voice segment according to the preset characters to obtain a first character detection result;

determining a fifth number of first words corresponding to the first character in response to the first character detection result characterizing that the first character matches the predetermined character;

Splitting the first voice segment according to the fifth number of first vocabularies to obtain the first voice sentence; and

And determining the starting time and the ending time corresponding to the first voice sentence according to the first vocabulary starting time and the first vocabulary ending time corresponding to the fifth number of first vocabularies.

8. The method of claim 6, wherein the second vocabulary timestamp information comprises a second vocabulary start time and a second vocabulary end time;

Performing character detection processing on the second number of second characters in the second voice segment according to the preset characters to obtain a second character detection result;

Determining a sixth number of second words corresponding to the second character in response to the second character detection result characterizing that the second character matches the predetermined character;

Splitting the second voice segment according to the sixth number of second vocabularies to obtain the second voice sentence; and

And determining the starting time and the ending time corresponding to the second voice sentence according to the second vocabulary starting time and the second vocabulary ending time corresponding to the sixth number of second vocabularies.

9. A speech recognition apparatus comprising:

The first processing module is used for responding to a received voice recognition request, processing voice to be recognized in the voice recognition request and obtaining a first voice segment corresponding to a first voice channel and a second voice segment corresponding to a second voice channel;

The second processing module is used for respectively processing the first voice section and the second voice section to obtain at least one first voice sentence and at least one second voice sentence, wherein each first voice sentence and each second voice sentence respectively correspond to a voice channel identifier, a starting time and an ending time;

The ordering module is used for ordering the at least one first voice statement and the at least one second voice statement according to the starting time to obtain an initial ordering result; and

And the third processing module is used for processing the initial sequencing result according to the voice channel identifier and the ending time to obtain a voice recognition result.

10. An electronic device, comprising:

One or more processors;

a memory for storing one or more instructions,

Wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 8.

11. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 8.

12. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1 to 8 when executed.