CN114242067A

CN114242067A - Speech recognition method, apparatus, device and storage medium

Info

Publication number: CN114242067A
Application number: CN202111294034.2A
Authority: CN
Inventors: 鄢发鹏; 张洪彬; 王璞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-03-25

Abstract

The present disclosure provides a speech recognition method, apparatus, device and storage medium, relating to the field of computer technology, in particular to the field of artificial intelligence such as speech recognition and deep learning. The voice recognition method comprises the following steps: determining a voice signal of a speaker; adopting a special voice recognition channel to obtain a voice recognition result of the voice signal of the speaker, wherein different speakers have different special voice recognition channels; and processing the voice recognition result. The present disclosure can achieve continuous speech recognition.

Description

Speech recognition method, apparatus, device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence, such as speech recognition and deep learning, and more particularly, to a speech recognition method, apparatus, device, and storage medium.

Background

The continuous speech recognition technology in real-time communication is a technology for carrying out speech recognition on continuous audio of multiple persons in real-time communication and distinguishing speakers, and can provide multiple intelligent functions for real-time communication.

In the related art, some problems in real-time communication can be solved, but a solution of continuous speech recognition as a whole is lacking.

Disclosure of Invention

The present disclosure provides a voice recognition method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided a speech recognition method including: determining a voice signal of a speaker; adopting a special voice recognition channel to obtain a voice recognition result of the voice signal of the speaker, wherein different speakers have different special voice recognition channels; and processing the voice recognition result.

According to another aspect of the present disclosure, there is provided a voice recognition apparatus including: the determining module is used for determining the voice signal of the speaker; the acquisition module is used for acquiring a voice recognition result of the voice signal of the speaker by adopting a special voice recognition channel, wherein different speakers have different special voice recognition channels; and the processing module is used for processing the voice recognition result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme of the present disclosure, continuous speech recognition can be realized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

3 a-3 b are schematic diagrams of storing voice signals into a voice queue in response to different paths of the voice signals according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

fig. 8 is a schematic diagram of an electronic device for implementing any one of the speech recognition methods of the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The continuous speech recognition technology in real-time communication is a technology for carrying out speech recognition on continuous audio of multiple persons in real-time communication and distinguishing speakers, and can provide multiple intelligent functions for real-time communication. The real-time communication is real-time audio and video conference, and the intelligent function is real-time caption on-screen of the real-time audio and video conference. Real-time subtitle on-screen refers to that in a real-time audio and video conference, subtitles are continuously refreshed according to the voice of a speaker, the voice and characters are always synchronous, and different speakers can be distinguished.

There is no systematic (or integral) solution for continuous speech recognition technology in real-time communication.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, which provides a speech recognition method, including:

101. a speech signal of the speaker is determined.

102. And obtaining a voice recognition result of the voice signal of the speaker by adopting a special voice recognition channel, wherein different speakers have different special voice recognition channels.

103. And processing the voice recognition result.

The execution subject of this embodiment may be a speech recognition device, and the specific form of this device is not limited, and may be hardware, software, or a combination of hardware and software. The device can be located electronic equipment, and electronic equipment can be user terminal or server, and the server can be for local server or high in the clouds etc. and user terminal can include mobile device (like cell-phone, panel computer), wearable equipment (like intelligent wrist-watch, intelligent bracelet), intelligent home equipment (like intelligent TV, intelligent audio amplifier) etc..

The speech recognition method of the embodiment can be applied to real-time communication, such as a real-time audio and video conference.

In the real-time audio and video conference, the participants can be multiple persons, the multiple persons refer to at least two persons, and the participants can be dozens of persons, hundreds of persons and the like when the number of the participants is large.

Taking voice recognition at the server as an example, as shown in fig. 2, a client 201 for participating in a conference may be installed on a user terminal of a participant, and the client 201 may collect voice signals of the participant. The user terminal may be provided with a voice collecting device, such as a microphone or a microphone array, and the client may collect a voice signal through the voice collecting device. Then, the client 201 sends the collected voice signal to the server 202. Referring to fig. 2, assuming that there are N participants, where N is a positive integer, the clients are respectively represented by end 1 to end N.

Any client side which collects the voice signals can send the respective voice signals to the server side. Referring to fig. 2, a voice signal transmitted from a client to a server may be referred to as an upstream.

After the server receives the voice signal sent by at least one client, the voice signal of the speaker can be determined, and the process can be called as voice excitation. After voice excitation, the server can determine the voice signal not greater than K paths as the voice signal of the speaker. Where K is a positive integer and is a settable value, such as set to 6.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

After the server determines the voice signal of the speaker, the server can perform voice recognition on the voice signal of the speaker, and different special voice recognition channels are adopted for voice recognition corresponding to different speakers.

The server side can perform voice recognition on the voice signal of the speaker and can also forward the voice signal of the speaker to each client side in the N client sides. The voice signal sent by the server to the client may be referred to as downstream.

The speech signal may also be referred to as speech, audio signal, etc. Further, in real-time communication, the voice signal is a real-time voice signal, which may also be referred to as a voice stream, an audio stream, and the like. For example, voice streams can be divided into the above-mentioned upstream and downstream streams according to the transmission direction.

In Real-time communication, the voice packet may also be referred to as a Real-time Transport Protocol (RTP) voice packet.

The client can collect voice signals at preset periods to form voice packets. For example, the predetermined period is 20 milliseconds (ms), so that the duration of each voice packet is 20ms, and 1 second can generate 50 voice packets.

In the real-time communication, as the participants speak continuously, the client can continuously collect the voice packet and continuously send the voice packet to the server, and the process of continuously collecting and sending the voice packet can also be called as continuously collecting and sending the voice packet, or collecting and sending the voice packet in real time.

The speaker's voice signal is the voice signal to be transmitted to each participant.

Different from speakers, speakers who produce sounds among participants can be called speakers, but in order to guarantee listening effects, voice signals of each speaker are not transmitted.

In a real-time audio and video conference scene with more participants, in order to ensure the listening quality, the voice packet of each speaker is not forwarded, but the number of speakers is limited. That is, assuming that the participant has N persons, and assuming that N is large (e.g., greater than 10), even if N persons all utter, K (K < N) persons are selected as speakers among the N persons instead of all the N persons as speakers.

K is a set value, and in general, K may be selected to be 6. I.e. at most 6 persons are considered as speakers.

Taking K as an example, if the number of speakers is less than or equal to 6, all speakers may be selected as speakers, and if the number of speakers is greater than 6, 6 persons may be selected as speakers.

After the voice signal of the speaker is determined, different special voice recognition channels can be adopted to obtain the voice recognition result of the voice signal of the corresponding speaker corresponding to different speakers.

For example, if there are three speakers, the first dedicated speech recognition channel may be used to obtain the speech recognition result of the first speaker, the second dedicated speech recognition channel may be used to obtain the speech recognition result of the second speaker, and the third dedicated speech recognition channel may be used to obtain the speech recognition result of the third speaker. The first dedicated voice recognition channel, the second dedicated voice recognition channel and the third dedicated voice recognition channel are different.

Because different speakers have different voice recognition channels, the voice recognition results of different speakers are obtained through different voice recognition channels, so that different speakers can be distinguished based on different voice recognition channels.

After obtaining the speech recognition result of the speaker, the speech recognition result may be processed, and the processing may include: and sending the voice recognition result to a client of each participant, and/or sending the voice recognition result to a service server, wherein the service server is a server for recording the conference content. After the client obtains the voice recognition result, the voice recognition result can be subjected to on-screen processing, and the voice recognition result is a text corresponding to the voice packet of the speaker, that is, the text content corresponding to the voice content of each speaker can be displayed on the display interface of the client. And after receiving the voice recognition result of each speaker, the service server can record the voice recognition result of each speaker, thereby completing the recording of the conference content.

In the embodiment of the disclosure, the voice recognition of the speaker can be realized by determining the voice signal of the speaker and obtaining the voice recognition result of the voice signal of the speaker, and the speaker can be distinguished by obtaining the voice recognition result by adopting the special voice recognition channel, so that the continuous voice recognition can be realized.

In some embodiments, the determining the speech signal of the speaker comprises: receiving at least one path of voice signals; if the number of the at least one path is less than or equal to the preset number, taking the voice signal of the at least one path as the voice signal of the speaker; or, if the number of the at least one path is greater than the preset number, based on the energy value of the at least one path of voice signals, sequentially selecting the preset number of voice signals as the voice signals of the speaker.

For example, as shown in fig. 2, the server may receive three voice signals, and assume that the preset number K is 6, at this time, since 3<6, the three voice signals may be used as the voice signal of the speaker.

For another example, the server receives 10 voice signals, and since 10>6, in this case, 6 voice signals can be selected as the voice signal of the speaker from the 10 voice signals in the order of the energy value of the voice signal from large to small.

When the client sends a Voice packet, a Voice Activity Detection (VAD) value and a packet energy value of the Voice packet may be carried in the Voice packet, and the server may calculate an energy value of a Voice signal of a corresponding path based on the VAD value and the packet energy value of the Voice packet received within a preset time period.

For example, voice packets sent by the corresponding end 1 and the end 1 may carry a VAD value and a packet energy value thereof, the VAD value may be 1 or 0, and after receiving a voice packet sent by the end 1, the service end may multiply and accumulate the VAD value and the packet energy value of each voice packet within a preset time (for example, several hundred milliseconds), so as to obtain an energy value of the voice signal at the end 1, which is expressed by a formula:

wherein E is the energy value of a certain path of voice signal, M is the number of voice packets received by the path of voice signal within a preset time length, VAD_iVAD value for ith voice packet, e_iIs the packet energy value of the ith voice packet, i is [1, M%]Positive integer of (1), VAD_iIs 0 or 1, where 0 indicates the voice packet is a mute packet, 1 indicates the voice packet is a non-mute packet, and ei may be determined based on the amplitude of the voice packet.

It is understood that the above-mentioned calculation method of the energy value of the speech signal is an example, and other related calculation methods may also be adopted for determination.

By selecting the voice signals with the number not greater than the preset number as the voice signals of the speaker, the noise problem caused by the large number of the transmitted voice signals can be avoided, and the listening effect of the client side is improved. Further, generally speaking, the energy of the voice signal of the speaker in the real-time audio/video conference is higher, so that the voice signal with higher energy value is selected as the voice signal of the speaker, which also conforms to the actual application scene of the real-time audio/video conference and is more suitable for practical application.

In some embodiments, the obtaining the Speech Recognition result of the Speech signal of the speaker using the dedicated Speech Recognition channel includes: and sending the voice signal of the speaker to the ASR service by adopting the long connection, and receiving a voice recognition result of the voice signal of the speaker obtained by the ASR service.

The ASR service is a service for performing speech recognition on a speech signal to obtain a speech recognition result. The voice recognition result is the text corresponding to the voice signal. The ASR service may also be referred to as an ASR service, and the ASR service may be located on the same or a different server than the service that receives the speech signal sent by the client.

As shown in fig. 3a and fig. 3b, a preset number (e.g. 6) of voice queues may be pre-established in the service end, and each voice queue establishes an independent long connection with the ASR service.

The server refers to a server interacting with the client, unless otherwise specified.

Independent means that different speech queues correspond to different long connections so that different speakers can have different dedicated speech recognition channels.

The long connection may also be referred to as a persistent connection, which means that the data transmission is completed and the connection is kept unbroken, so that the service end can continuously send the voice packet to the ASR service through the long connection.

Referring to fig. 3a, if the number of received voice signals is less than or equal to the preset number, for example, the preset number is 6, and three voice signals are received, the received three voice signals may be stored in three voice queues of the 6 voice queues.

Referring to fig. 3b, if the number of received voice signals is greater than the preset number, for example, the preset number is 6, and 10 voice signals are received, 6 voice signals may be selected from the 10 received voice signals and stored in 6 voice queues.

In fig. 3a and 3b, the voice queue with the voice signal stored therein is indicated by black fill.

Further, the voice signal may be transmitted in units of voice packets as a sequence.

The stream identification (stream _ id) and the sequence number (seq _ num) may be carried in the voice packet.

The type of flow identification may be unit64 and the type of sequence number may be unit 16.

The stream identifier is used to identify the voice signal stream to which the voice packet belongs, and a general client corresponds to one voice signal stream, so the stream identifier can be used to identify a client or a user, and voice signals generated by different clients or users have different stream identifiers. When the real-time audio and video conference is carried out, unique stream identification can be distributed to the client corresponding to each participant.

The sequence number is used for identifying the sequence order of the voice packets and is continuously increased according to the acquisition order of the voice packets. For example, a client will sequentially generate 50 voice packets within 1s after entering a meeting, the sequence numbers of the voice packets are 1 to 50, but when seq _ num reaches the maximum value of the uint16, the voice packets will automatically cycle from 0.

Although the voice packets are sent in sequence from the client for a certain flow, the transmission of the audio/video media is based on a User Datagram Protocol (UDP), so the sequence of the voice packets arriving at the server is random, and the network cannot guarantee that the data sent first arrives first, so if voice recognition is to be performed on the voice signal, the received voice packets are sequenced and then correctly recognized when being sent to ASR service according to the sequence during collection.

In order to simultaneously realize voice recognition of K audio streams in a conference, after conference recognition is started, a server allocates 6 voice packet ordered queues (similar to a Map data structure) for each conference, each ordered queue has a queue unique identifier (stream _ queue _ id), the 6 queues are used for receiving voice packets of K (K < ═ 6) streams needing to be recognized in the conference, when K is 6, each queue is explained to receive the voice packets, and when K is 6, a plurality of queues are explained to not receive the audio streams. The ordered queue stores elements that are key value pairs, keys that are seq _ num of voice packets, and values that are voice packet data (payload). And the elements in the ordered queue are ordered from small to large according to seq _ num, and each time the server receives a voice packet, if the voice packet is an audio stream needing to be identified, the corresponding ordered queue is inserted and automatically ordered.

The service end can establish long connection with the ASR service for each voice queue, each path of voice signal can adopt the independent long connection with the ASR service, and the voice packet is sent to the ASR service and the voice recognition result returned by the ASR service is received. Through the mapping relation between the stream identifier and the queue identifier, the client or the user corresponding to each path of voice recognition result can be accurately distinguished.

By adopting a long connection mode, the processing speed can be improved, and the time delay can be reduced.

In some embodiments, the speech signal is stored in the speech queue in a sequence unit of speech packets, the sending the speech signal of the speaker to the ASR service, and the receiving the speech recognition result of the speech signal of the speaker obtained by the ASR service includes: sending the voice packets to the ASR service in sequence; and receiving a voice recognition result of the voice packet obtained by the ASR service.

For example, corresponding to a certain path of voice signal, the voice signals are sorted in the voice queue according to the sequence of the first voice packet, the second voice packet, and the like, so that during voice recognition, the voice signals are sequentially sent to the ASR service according to the sequence of the first voice packet and the second voice packet, and the voice recognition results of the voice packets are sequentially obtained.

The voice packet is sent to the ASR service, and the voice recognition result of the voice packet by the ASR service is received, so that the voice packet can be used as a unit for processing, the voice recognition result can be basically obtained in real time, and the synchronization of voice and text is realized. Further, by sending voice packets in sequence, the accuracy of the voice recognition result can be ensured.

Further, the voice packet carries a sequence number, and the sending the voice packet to the ASR service in sequence includes: determining a current sequence number, wherein the current sequence number increases in sequence; and sending the voice packet with the sequence number as the current sequence number to the ASR service.

The current sequence number is sequentially increased, that is, the current sequence number is increased by 1 after one voice packet is read in the voice queue each time from an initial value of the current sequence number. For example, if the current sequence number has an initial value of 1, the speech packet with sequence number 1 may be read, and the speech packet with sequence number 1 may be sent to the ASR service. Thereafter, the current sequence number may be increased by 1, i.e. the current sequence number is changed from 1 to 2, and accordingly, the speech packet with sequence number 2 may be read and sent to the ASR service.

For example, when the voice packets are collected at the client, the sequence numbers of the voice packets are respectively 1, 2, and 3, but since the transmission process may be out of order, the sequence numbers of the voice packets received by the server may be 1, 3, and 2, if the voice packets are processed according to the receiving sequence, the voice packets-3 are processed first, and then the voice packets-2 are processed, which may cause a voice recognition error. By increasing the current serial number in sequence and reading the voice packet with the current serial number to perform ASR processing, the voice packet can be ensured to be subjected to sequential ASR processing, for example, after the voice packet-1 is processed, the current serial number is increased in sequence to be 2, the voice packet with the serial number of 2 is processed, and even if the voice packet-3 is received first, the voice packet-2 is processed first instead of the voice packet-3, so that the processing sequence of the voice packet is consistent with the acquisition sequence, and the accuracy of a voice recognition result is ensured.

Further, the method further comprises: if the voice packet with the serial number being the current serial number is not obtained, waiting for a preset time length; and if the voice packet of the current serial number is not obtained after the preset time length, sequentially obtaining a next voice packet of the current serial number, and sending the next voice packet to the ASR service.

Since the voice queue (or queue for short) will automatically sequence the voice packets according to seq _ num, the voice packets inserted into the voice queue are all automatically sequenced.

As shown in fig. 4, in each voice queue (for example, queue _ 1), the voice packets may be arranged in the order of seq _ num from small to large (e.g., starting from 1).

While receiving the voice packet, the server records the current _ seq _ num, and the initial value of the current sequence number is the sequence number of the first voice packet received by the queue.

The server may search for the corresponding voice packet from the queue according to the current sequence number, because the voice packets in the queue are already in order, the voice packet can be quickly found by binary search, and after the voice packet of the current sequence number is found, the voice packet is sent to the ASR service, as shown in fig. 4, the voice packet sent to the ASR service (represented as ASR in fig. 4) is transmitted in order, for example, seq _ num is sent to the ASR service in order of 199, 200, and 201. Further, if the voice packet is an encoded voice packet, the encoded voice packet may be decoded, and the decoded voice packet may be sent to the ASR service.

The connection between the service end and the ASR service is a long Transmission Control Protocol (TCP) connection, which can ensure data continuity.

Each time a speech packet is read from each queue, current _ seq _ num will be incremented by 1. The read voice packet is not immediately deleted, but is marked as a read state. In fig. 4, white filled speech packets indicate that the read state is true, i.e., the read state, and gray filled speech packets indicate that the read state is false, i.e., the unread state.

When a certain seq _ num packet cannot be found, it indicates that the packet does not arrive due to network delay at this time, or the packet is lost due to network packet loss, the server will wait for a certain time (jitter) for this packet, and when this time is exceeded, the server will skip this packet and directly read the packet corresponding to the next seq _ num.

In the process of continuously transmitting audio to the ASR service, the ASR service can continuously return a speech recognition result in real time. The server side can forward the voice recognition result to each client side, and therefore functions such as subtitle screen display can be achieved.

Because long connection is established between the server and the ASR service, the server receives the speech recognition result and recalls the speech recognition result very quickly, generally, the ASR service returns the speech recognition result every 100-200 ms, which is equivalent to returning the speech recognition result every time a word is recognized. In a thread circulation process, each queue is processed according to the logic, wherein the uplink voice packet buffer queue is scanned once every 20ms, the corresponding ordered queue is inserted, the queue is read, decoded and identified every 10ms, the server can timely catch up even after reading and waiting for a certain packet, the identification of a certain path is temporarily lagged behind real-time audio in the whole conference process, and the real-time performance can be quickly recovered.

Furthermore, different waiting times jitter can be adjusted according to different service requirements, so as to achieve different purposes. For example, a smaller jitter is set to avoid waiting for a long time due to delayed arrival or loss of a certain packet, so that a faster recognition speed can be realized; and the larger jitter is set, so that the higher quality of the audio sent to the identification can be ensured, and the identification effect is better, which is very important in certain scenes.

By basing the current sequence number on the current sequence number, in-order sending of speech packets to the ASR service may be achieved due to the in-order increase of the current sequence number.

The reliability can be improved by waiting for the preset time to obtain the voice packet again, and if the voice packet corresponding to the current serial number is not obtained after waiting for the preset time, the next voice packet is obtained, so that long-time waiting can be avoided, and the time delay is reduced.

In some embodiments, the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, the method further comprising: and storing the voice signal of the speaker into a voice queue corresponding to the speaker.

Further, the voice signal is stored in the voice queue by taking a voice packet as a sequence unit, where the voice packet carries a sequence number, and the method further includes: if the sequence number of the currently stored voice packet is larger than the maximum queue length of the voice queue and the emptying operation is not triggered, deleting the voice packets in the voice queue in sequence; or, if the sequence number of the currently stored voice packet triggers an emptying operation, emptying the voice queue.

For example, referring to FIGS. 3 a-3 b, the speaker's speech signals may be stored in separate speech queues.

In order to limit the memory usage, the server side limits the size of each queue.

The maximum queue length (max _ cached _ num) can be set according to the service requirement, for example, max _ cached _ num is set to 200. At this time, the server will buffer 200 voice packets for each audio stream, that is, the audio stream of about the past 4s (since there may be packet delay and packet loss, the sequence number of the packets in the queue is not necessarily completely continuous, so it is not the exact time of 4 s).

When the number of voice packets in the queue accumulates to max _ cached _ num, the head packet of the queue, that is, the voice packet with the minimum seq _ num, is deleted every time a new voice packet is inserted, and at this time, the voice packet with the minimum seq _ num is data before 4s, and is generally read and sent to the ASR service for voice recognition.

As shown in fig. 4, for example, with the maximum queue length of 200, since the voice packet with the sequence number of 101 is lost, when the 202 nd voice packet is inserted, the voice packet with the seq _ num of 1 is deleted.

It can be understood that the value of the queue identifier and the value of the flow identifier in fig. 4 are only examples, and in practical application, assuming that the number of created voice queues is 6, the value of the queue identifier may specifically be any integer value from 1 to 6; if the method is applied to a real-time audio and video conference scene and the number of the participators is N, the value of the stream identifier can be any integer value from 1 to N.

By storing the voice signals in the voice queue, buffering of the voice signals can be achieved.

By deleting the speech signals in the speech queue in order, overflow of the speech signals can be avoided.

The emptying operation may be triggered when the absolute value of the difference between the sequence number of the speech packet and the maximum queue length is greater than half the maximum sequence number.

When seq _ num reaches 65535 the maximum value of the uint16, the seq _ num of the following voice packet starts from the minimum value (e.g. 1), and the latest voice packet with the minimum value of seq _ num is deleted directly because the head packet of the queue is deleted to keep the queue size below max _ cached _ num.

In order to avoid deleting the voice packet just inserted, the queue can be emptied, and the latest voice packet buffered in the queue is also ensured.

That is, when a voice packet is stored in the voice queue, and the sequence number (new _ seq _ num) of the voice packet to be stored and the maximum queue length (max _ seq _ num) satisfy the following conditions, the queue is emptied:

abs(new_seq_num max_seq_num)>max_int16/2；

wherein abs is an absolute value function.

Through the above process, the voice packet receiving and reading and sending identification are processed in real time, so that the voice packet receiving, reading and identifying are almost simultaneously carried out, and the received voice packet can be read, decoded and identified quickly. So most of the speech packets buffered in the queue are read and the occasional queue being emptied (basically once in 21 minutes) does not affect speech recognition. The voice packet buffer of the queue solves the problem of audio receiving needing waiting when some packets arrive in a delayed mode or some packets are lost, and ensures that the uplink voice packet can be normally buffered when some packets are waited. The general state of an ordered queue can be seen by fig. 4.

In some embodiments, the voice signal carries a stream identification, the method further comprising: and establishing a mapping relation between the flow identifier and a queue identifier of a voice queue in which the voice packet is stored.

As described above, the stream identifier stream _ id may be carried in the voice signal received by the server, and after the server determines that a certain path of voice signal is a voice signal of a speaker, different voice queues may be allocated to voice signals of different speakers, and different voice queues have different queue identifiers (stream _ queue _ id), so that a mapping relationship between the stream identifier and the queue identifier may be established, for example, referring to fig. 4, a mapping relationship between the stream identifier 0x238967 and the queue identifier 0x34fd 45.

It is understood that when the speaker is updated, the mapping relationship can be updated.

As described above, the server can perform speech recognition on K channels of speech signals (or called as audio streams) in the real-time audio/video conference. In an actual conference, the K audio streams are not uniform, but change from time to time. Specifically, the speech excitation algorithm of the server may calculate the speaker list every several hundred ms. If the calculated list does not change, it indicates that the speaker in the conference has not changed. When the speaker changes, a new audio stream needs to be forwarded to each end and simultaneously to the ASR service for speech recognition. Since the downstream has only 6 fixed lanes and there are only 6 queues for identification, this requires that some of the stream _ queue _ id queues need to receive new stream _ id audio streams and not receive audio streams that are no longer speakers, which process can be called identification switch. The case where the recognition handover occurs may be as follows:

(1) the addition of a new person will result in the addition of speakers, and only k (k <6) persons in the previous case will be identified, and the k audio streams will be identified. Now a new person is added to the conference, resulting in the audio of k +1 people being identified;

(2) the conference people begin speaking without speaking, and crowd out people who were in the speaker list before, which is determined by the speech excitation algorithm.

When the sequence of the speaker list calculated by the voice excitation is changed but the members are not changed, the audio frequency stream needing to be identified does not need to be changed, so that the identification switching can not occur.

When the situation (1) occurs, a certain queue will start to receive the audio stream, the Server will create a long connection with the ASR service, and establish mappings corresponding to stream _ queue _ id and stream _ id, and perform voice recognition on this path of audio stream according to the above description.

When the situation (2) occurs, the Server ensures that the new mapping relation does not need to be changed when the new mapping relation exists, and only needs to update the new mapping relation, namely, the mapping of the stream _ queue _ id which needs to be updated is updated into the stream _ id of the new stream, the current _ seq _ num of the current serial number of the queue is updated into the seq _ num of the head packet of the new audio stream, the queue is emptied (because the queue caches the audio data of the previous person), the previous ASR long connection is actively disconnected, a new ASR long connection is established, and the audio stream is subjected to voice recognition according to the related description.

For example, after the identification switch occurs, a voice stream with stream _ queue _ id being 0x233578 is stored in a voice queue with stream _ queue _ id, where a first packet in the queue is a first received voice packet, e.g., seq _ num _ 367. At this time, the flow id corresponding to the queue id 0x34fd45 in the mapping relationship is updated to 0x 233578. And, switching the long connection of queue 1 with the ASR service from ASR connection _1 to ASR connection _ 2.

By establishing the mapping relationship between the stream identifier and the queue identifier, the voice recognition result can be conveniently corresponded to the user.

In some embodiments, the method is applied to an audio-video conference, and the method further comprises:

the voice signal of the speaker is forwarded to the client of the participant of the audio and video conference; and/or the presence of a gas in the gas,

the processing the speech recognition result comprises: sending the voice recognition result to the client; and/or sending the voice recognition result to a server for recording conference content.

For example, the audio and video conference is a real-time audio and video conference, and when the real-time audio and video conference is performed, the server determines voice signals of 6 speakers, and then the voice signals of the 6 speakers can be forwarded to the clients of N (N is greater than or equal to 6) individuals participating in the conference.

In addition, the server can also obtain the voice recognition result of the voice signals of 6 speakers from the ASR service and send the voice recognition result to the N clients, and after receiving the voice recognition result, the clients can display the voice recognition result on a screen. Further, the speech recognition result may include a user identifier, such as a stream identifier, so that the speech recognition result of each speaker can be distinguished.

The server side can also send the voice recognition result to a server for recording the conference content, so that the recording of the conference content is realized.

By forwarding the speaker's voice signal, rather than forwarding all of the voice signal, the listening effect can be guaranteed. By sending the voice recognition result to the client, the client can perform processing such as on-screen display on the voice recognition result. Conference recording can be achieved by sending the speech recognition result to a server for recording conference content.

In the embodiment of the disclosure, the Server performs voice recognition on the audio stream of each client, and the recognition result is distributed to each client through the network to realize real-time caption on-screen or sent to the service Server to perform automatic text recording on the conference. The real-time property is extremely important to the screen-on effect of the subtitles and is one of the most important performances for ensuring the product experience. The disclosed embodiment mainly causes delay when waiting for a certain voice packet, the maximum waiting time is jitter, but the parameter is configurable, if the product particularly emphasizes real-time, a smaller jitter can be set to achieve the product effect. Meanwhile, the recognition result is sent from the server to each client through the network with a certain delay, but the same path is used for sending the downstream from the server to each client, so that the voice and the text almost reach the client at the same time. In addition, when the client plays the audio, a certain lag (about 50 ms) is caused by waiting for the voice packet of the downstream, so that the voice recognition result can be ensured to just reach the client even arrive in advance when the downstream is played on the end, and the synchronism of the voice and the recognized subtitle is ensured.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. As shown in fig. 6, the present embodiment provides a speech recognition method, including:

601. the server receives the voice signal sent by the client.

The voice signal transmitted by the client may also be referred to as an upstream.

602. The server determines the voice signal of the speaker through a voice excitation algorithm.

603. The server stores the voice signals of the speaker into the corresponding voice queues respectively.

The voice excitation algorithm can calculate the maximum 6 paths of current high-energy audio streams (the number of the current high-energy audio streams is less than 6 when the number of the current high-energy audio streams is less than 6), and the 6 paths of current high-energy audio streams are used as downlink audio streams to be sent to the clients. At the same time, these 6 flows are sent to the 6 ordered queues of the recognition module.

604. The server performs voice recognition on the voice signal of the speaker to obtain a voice recognition result.

The speech signal can be stored in the speech queue by taking the speech packet as a sequence unit, the service end can read the speech packet from the speech queue in sequence, and send the speech packet to the ASR service after processing such as decoding, and the ASR service can perform speech recognition on each received speech packet to obtain a speech recognition result and return the speech recognition result to the service end.

Specifically, the 6-way ordered queue receives the audio stream of the speaker list calculated by the speech excitation algorithm, but when the speaker changes, the switching of the audio stream will occur, corresponding to the above-described recognition switching. And each queue is respectively and independently connected with the ASR, and each path of audio is identified in a shunt way, so that speakers corresponding to identification results are automatically distinguished.

Each queue reads the speech packet according to the current read sequence number current _ seq _ num recorded by it, decodes and sends to the recognition service (ASR service), and after each read, current _ seq _ num will be incremented by 1. When the voice packet which is not found is sent, the method waits for the jitter time to search all the time, and skips the packet when the packet does not exist after the waiting time.

605. The server side sends the voice signal of the speaker and the voice recognition result thereof to each client side.

In real-time communication, such as a real-time audio and video conference, the processes 601-605 can be repeatedly executed. That is, in the process that the service end continuously sends the speech packets to the ASR service, the ASR service also continuously generates the speech recognition result, and the service end receives the speech recognition result and then issues the speech recognition result, for example, the speech recognition result is issued to the client to realize the screen loading of the subtitles, and/or the speech recognition result is sent to the service Server to form a conference record.

601-605, the whole system receives the uplink audio stream and also continuously performs voice recognition, so as to realize continuous voice real-time recognition in real-time communication.

In the embodiment of the disclosure, the voice signal of the speaker is subjected to voice recognition through the long connection between the voice queue and the ASR service, so that the voice in the real-time audio and video conference can be recognized, and each speaker can be distinguished. By recognizing 6 paths of voice signals at most, unnecessary recognition connection can be avoided, and recognition resources are saved. By carrying out voice recognition on the voice packet in real time, the basic 0 delay of the subtitle on the screen can be recognized, and the subtitle refreshing speed can be refreshed word by word.

Fig. 7 is a schematic diagram according to a seventh embodiment of the present disclosure, which provides a speech recognition apparatus. As shown in fig. 7, the apparatus 700 includes: a determining module 701, an obtaining module 702 and a processing module 703.

The determining module 701 is used for determining a voice signal of a speaker; the obtaining module 702 is configured to obtain a speech recognition result of the speech signal of the speaker by using a dedicated speech recognition channel, where different speakers have different dedicated speech recognition channels; the processing module 703 is configured to process the speech recognition result.

In some embodiments, the determining module 701 is specifically configured to:

receiving at least one path of voice signals;

if the number of the at least one path is less than or equal to the preset number, taking the voice signal of the at least one path as the voice signal of the speaker; or,

and if the number of the at least one path is greater than the preset number, sequentially selecting the preset number of voice signals as the voice signals of the speaker based on the energy value of the voice signals of the at least one path.

In some embodiments, the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, and the obtaining module 702 is specifically configured to:

and sending the voice signal of the speaker to the ASR service by adopting the long connection, and receiving a voice recognition result of the voice signal of the speaker obtained by the ASR service.

In some embodiments, the voice signal is stored in the voice queue with a voice packet as a sequence unit, and the obtaining module 702 is further specifically configured to:

sending the voice packets to the ASR service in sequence; and the number of the first and second groups,

and receiving a voice recognition result of the voice packet obtained by the ASR service.

In some embodiments, the voice packet carries a sequence number, and the obtaining module 702 is further specifically configured to:

determining a current sequence number, wherein the current sequence number increases in sequence;

and sending the voice packet with the sequence number as the current sequence number to the ASR service.

In some embodiments, the apparatus 700 further comprises:

the holding module is used for waiting for a preset time length if the voice packet with the serial number as the current serial number is not acquired;

and the sending module is used for obtaining the next voice packet of the voice packet with the current serial number in sequence and sending the next voice packet to the ASR service if the voice packet with the current serial number is not obtained after the preset duration.

In some embodiments, the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, the apparatus 700 further comprising:

and the storage module is used for storing the voice signal of the speaker into the voice queue corresponding to the speaker.

In some embodiments, the voice signal is stored in the voice queue in units of voice packets as sequence numbers, the voice packets carrying sequence numbers, the apparatus 700 further comprises:

the deleting module is used for deleting the voice packets in the voice queue in sequence if the sequence number of the currently stored voice packet is larger than the maximum queue length of the voice queue and the emptying operation is not triggered; or,

and the emptying module is used for emptying the voice queue if the sequence number of the currently stored voice packet triggers an emptying operation.

In some embodiments, the voice signal carries a stream identification, and the apparatus 700 further comprises:

and the mapping module is used for establishing a mapping relation between the stream identifier and the queue identifier of the voice queue in which the voice packet is stored.

In some embodiments, the apparatus is applied to an audio-video conference, and the apparatus 700 further includes:

the forwarding module is used for forwarding the voice signal of the speaker to the client of the participant of the audio and video conference; and/or the presence of a gas in the gas,

the processing module 703 is specifically configured to:

sending the voice recognition result to the client; and/or the presence of a gas in the gas,

and sending the voice recognition result to a server for recording the conference content.

In the embodiment of the disclosure, the voice recognition of the speaker can be realized by determining the voice packet of the speaker and obtaining the voice recognition result of the voice packet of the speaker, and the speaker can be distinguished by obtaining the voice recognition result by adopting the special voice recognition channel, so that the continuous voice recognition can be realized.

It is to be understood that in the disclosed embodiments, the same or similar elements in different embodiments may be referenced.

It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a voice recognition method. For example, in some embodiments, the speech recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the speech recognition method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the speech recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A speech recognition method comprising:

determining a voice signal of a speaker;

adopting a special voice recognition channel to obtain a voice recognition result of the voice signal of the speaker, wherein different speakers have different special voice recognition channels;

and processing the voice recognition result.

2. The method of claim 1, wherein the determining the speech signal of the speaker comprises:

receiving at least one path of voice signals;

3. The method of claim 1, wherein the dedicated speech recognition channel is a long connection between a speech queue and an Automatic Speech Recognition (ASR) service, and different speakers correspond to different speech queues, and obtaining the speech recognition result of the speaker's speech signal using the dedicated speech recognition channel comprises:

4. The method according to claim 3, wherein the speech signals are stored in the speech queue in speech packet sequence units, the sending the speech signals of the speaker to the ASR service, and the receiving the speech recognition results of the speech signals of the speaker obtained by the ASR service comprises:

5. The method according to claim 4, wherein the speech packet carries a sequence number, said sending the speech packet in order to the ASR service comprising:

6. The method of claim 5, further comprising:

if the voice packet with the serial number being the current serial number is not obtained, waiting for a preset time length;

and if the voice packet of the current serial number is not obtained after the preset time length, sequentially obtaining a next voice packet of the current serial number, and sending the next voice packet to the ASR service.

7. The method of any of claims 1-6, wherein the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, the method further comprising:

and storing the voice signal of the speaker into a voice queue corresponding to the speaker.

8. The method of claim 7, wherein the voice signal is stored in the voice queue in units of a sequence of voice packets, the voice packets carrying sequence numbers, the method further comprising:

if the sequence number of the currently stored voice packet is larger than the maximum queue length of the voice queue and the emptying operation is not triggered, deleting the voice packets in the voice queue in sequence; or,

and if the sequence number of the currently stored voice packet triggers an emptying operation, emptying the voice queue.

9. The method of claim 7, wherein the voice signal carries a stream identification, the method further comprising:

and establishing a mapping relation between the flow identifier and a queue identifier of a voice queue in which the voice packet is stored.

10. The method according to any one of claims 1-6, applied to an audio-visual conference, further comprising:

the processing the speech recognition result comprises:

11. A speech recognition apparatus comprising:

the determining module is used for determining the voice signal of the speaker;

the acquisition module is used for acquiring a voice recognition result of the voice signal of the speaker by adopting a special voice recognition channel, wherein different speakers have different special voice recognition channels;

and the processing module is used for processing the voice recognition result.

12. The apparatus of claim 11, wherein the determining module is specifically configured to:

receiving at least one path of voice signals;

13. The apparatus of claim 11, wherein the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, and the obtaining module is specifically configured to:

14. The apparatus according to claim 13, wherein the voice signal is stored in the voice queue in a sequence unit of voice packets, and the obtaining module is further specifically configured to:

15. The apparatus according to claim 14, wherein the voice packet carries a sequence number, and the obtaining module is further specifically configured to:

16. The apparatus of claim 14, further comprising:

17. The apparatus according to any of claims 11-16, wherein the dedicated speech recognition channel is a long connection between a speech queue and an ASR service, and different speakers correspond to different speech queues, the apparatus further comprising:

18. The apparatus of claim 17, wherein the voice signal is stored in the voice queue in units of a sequence of voice packets, the voice packets carrying sequence numbers, the apparatus further comprising:

19. The apparatus of claim 17, wherein the voice signal carries a stream identification, the apparatus further comprising:

20. The apparatus according to any one of claims 11-16, applied to an audio-video conference, further comprising:

the processing module is specifically configured to:

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.