CN111755008B - Information processing method, information processing apparatus, electronic device, and medium - Google Patents

Information processing method, information processing apparatus, electronic device, and medium Download PDF

Info

Publication number
CN111755008B
CN111755008B CN202010531056.5A CN202010531056A CN111755008B CN 111755008 B CN111755008 B CN 111755008B CN 202010531056 A CN202010531056 A CN 202010531056A CN 111755008 B CN111755008 B CN 111755008B
Authority
CN
China
Prior art keywords
audio transmission
transmission channel
voice
information
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010531056.5A
Other languages
Chinese (zh)
Other versions
CN111755008A (en
Inventor
韩晓
赵立
童剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010531056.5A priority Critical patent/CN111755008B/en
Publication of CN111755008A publication Critical patent/CN111755008A/en
Application granted granted Critical
Publication of CN111755008B publication Critical patent/CN111755008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Abstract

The embodiment of the disclosure discloses an information processing method, an information processing device, an electronic device and a medium, wherein the method comprises the following steps: collecting and caching voice information of each speaking user; if the cached voice time of the target user is determined to reach a preset time threshold, determining target voice-to-text connection from a voice-to-text connection pool; and sending the cached voice information to a voice-to-character module based on the target voice-to-character connection, and processing the cached voice information into characters based on the voice-to-character module. According to the technical scheme of the embodiment of the disclosure, the technical effects of improving the utilization rate of the audio transmission channels and saving resources when only the voice information of a limited number of audio transmission channels is processed are achieved.

Description

Information processing method, information processing apparatus, electronic device, and medium
Technical Field
The embodiment of the disclosure relates to the technical field of computer data processing, and in particular, to an information processing method, an information processing device, an electronic device, and a medium.
Background
Currently, in a video conference, a video chat, or a multi-user chat based on a network, a server may receive voice information of each speaking user, process the voice information, and play the processed voice information.
However, in practical applications, in order to convert the audio information of each participating user into text information, it is necessary to perform audio-to-text processing on multiple audio streams, and at this time, it is necessary to maintain multiple audio-to-text processing channels.
Disclosure of Invention
The embodiment of the disclosure provides an information processing method, an information processing device, an electronic device and a medium, which are used for optimizing a mode of determining an audio transmission channel, so that a technical effect of saving resources is achieved.
In a first aspect, an embodiment of the present disclosure provides an information processing method, where the method includes:
collecting and caching voice information of each speaking user;
if the cached voice time of the target speaking user reaches a preset voice time threshold, determining a target audio transmission channel from an audio transmission channel pool;
and transmitting the cached voice information to a voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
In a second aspect, an embodiment of the present disclosure further provides an information processing apparatus, including:
the information acquisition module is used for acquiring and caching voice information of each speaking user;
the connection establishing module is used for determining a target audio transmission channel from the audio transmission channel pool if the cached voice time of the target speaking user reaches a preset voice time threshold;
and the voice information conversion module is used for transmitting the cached voice information to the voice-to-character module based on the target audio transmission channel so as to process the cached voice information into characters based on the voice-to-character module.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the information processing method according to any one of the embodiments of the present disclosure.
In a fourth aspect, the present disclosure further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are used for executing the information processing method according to any one of the embodiments of the present disclosure.
According to the technical scheme of the embodiment, when the audio transmission channel between the message queue to which the voice information belongs and the voice-to-character module needs to be established is determined based on the collected voice information, the target audio transmission channel is selected from the audio transmission channel pool, so that the voice information is sent to the voice-to-character module based on the target audio transmission channel, and then the character expression corresponding to the voice information is obtained, the voice information of a certain number of audio transmission channels is utilized for processing, and the technical effects of improving the utilization rate of the audio transmission channels and saving resources are achieved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of an information processing method according to a first embodiment of the disclosure;
fig. 2 is a schematic flow chart of an information processing method according to a second embodiment of the disclosure;
fig. 3 is a schematic flow chart of an information processing method according to a third embodiment of the present disclosure;
fig. 4 is a schematic flow chart of an information processing method according to a fourth embodiment of the present disclosure;
FIG. 5 is a block diagram of a module provided by an embodiment of the present disclosure;
fig. 6 is a schematic flow chart of an information processing method according to a fifth embodiment of the disclosure;
fig. 7 is a schematic structural diagram of an information processing apparatus according to a sixth embodiment of the disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to a seventh embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
Example one
Fig. 1 is a flowchart illustrating an information processing method according to an embodiment of the present disclosure, where the method is applied to a multi-user interactive application scenario supported by the internet and determines an audio transmission channel corresponding to a target speaking user, where the method may be executed by an information processing apparatus, and the apparatus may be implemented in software and/or hardware, and optionally, implemented by an electronic device, where the electronic device may be a mobile terminal, a PC end, a server, or the like. The interaction scenario is usually implemented by cooperation of a client and a server, and the method provided by the embodiment may be executed by a server, or executed by cooperation of the client and the server.
As shown in fig. 1, the method of the present embodiment includes:
and S110, collecting and caching voice information of each speaking user.
It should be noted that the technical solution of the embodiment of the present invention can be applied to real-time interactive scenes, such as video conference, live broadcast, and the like.
The real-time interactive interface is any interactive interface in a real-time interactive application scene. The real-time interactive application scenario may be implemented through the internet and computer means, such as an interactive application implemented through a native program or a web program, etc. In a real-time interactive interface, multiple users may be allowed to interact with various forms of interactive behavior, such as entering text, voice, video, or sharing of content objects. The users participating in the real-time interaction are used as interaction users, the number of the interaction users participating in the real-time interaction can be multiple, and the users participating in the real-time interaction and giving out the speech can be used as speech users. When each speaking user speaks, the voice information of each speaking user can be collected and cached.
Specifically, when a plurality of users interact based on a real-time interactive interface, the server or the client can collect voice information of each speaking user and cache the voice information so as to convert the collected voice information into corresponding word expressions, so that other interactive users can conveniently know the speaking sentences issued by each speaking user through the converted word expressions, and further the technical effect of interactive interaction efficiency is improved.
It should be noted that, in this embodiment, the audio stream may be collected in Real Time based on a Real-Time Communication (RTC) technology, that is, end-to-end Communication may be implemented based on the RTC technology, so as to provide technical support for transmitting conference audio data in Real Time, and further collect voice information of each speaking user.
And S120, if the cached voice time of the target user reaches a preset voice time threshold, determining a target voice frequency transmission channel from the voice frequency transmission channel pool.
The target speaking user is the current speaking user, namely the user corresponding to the collected voice information. The caching of the voice information can be realized by a voice information caching module. When the voice information is cached, the effective voice time of the voice information, namely the voice time of the voice information including the substantial meaning content, can be determined, and meanwhile, the effective voice time can be accumulated and used as the cached voice time. For example, if the speech information includes 5 audio frames, and the cumulative valid speech duration of the 5 audio frames is 80ms, the buffered speech duration in the buffered speech information is 80 ms. The preset voice time length threshold is preset and is used as a trigger condition for judging whether to call an audio transmission channel to transmit the cached voice information. The target audio transmission channel is an audio transmission channel for transmitting the cached voice information of the target speaking user. An audio transmission channel pool may be understood as a container storing audio transmission channels. If the audio transmission channel needs to be selected, the audio transmission channel can be selected from the audio transmission channel pool.
Specifically, when the voice information is cached, whether the cached voice duration reaches a preset cached voice duration threshold may be detected. And if so, selecting an audio transmission channel from the audio transmission channel pool, and taking the selected audio transmission channel as a target audio transmission channel.
And S130, sending the cached voice information to a voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
The voice-to-text module based on Automatic Speech Recognition (ASR) can extract corresponding text information from the audio data. I.e., based on the voice-to-text module, text corresponding to the voice message may be determined.
Specifically, after the target audio transmission channel is determined from the audio transmission channel pool, the audio information can be sent to the voice-to-text module through the target audio transmission channel, so that the audio-to-text module converts the audio information into corresponding text.
According to the technical scheme of the embodiment, when the cached voice time of the target user reaches the preset voice time threshold, the target audio transmission channel is determined from the audio transmission channel pool, the cached voice information is sent to the voice-to-text module based on the target audio transmission channel, and the text expression of the voice information is obtained. That is, the technical solution of this embodiment achieves the technical effects of improving the utilization rate of the audio transmission channels and saving resources when only processing the voice information of a limited number of audio transmission channels.
On the basis of the above scheme, after determining the target audio transmission channel corresponding to the buffered voice information message queue, the method further includes: and marking the identity of the speaking user on the target audio transmission channel, so that when the cache voice time of the message queue reaches a preset voice time threshold, acquiring the target audio transmission channel corresponding to the identity from the audio transmission channel pool based on the identity of the message queue, and transmitting the cache voice information.
It should be noted that after the target audio transmission channel is determined, an identity corresponding to the speaking user may be marked on the target audio transmission channel, so that when it is detected that the cached voice time corresponding to the speaking user reaches the preset voice time threshold, the target audio transmission channel corresponding to the identity of the speaking user is obtained, and the voice information of the speaking user is transmitted based on the target audio transmission channel.
On the basis of the scheme, after the voice information is converted into corresponding characters based on the voice-to-character module, the method further comprises the following steps: and sending the characters corresponding to the cached voice information to the client so as to display the characters in a preset area of the client.
Specifically, after the voice information of the target speaking user is converted into corresponding text information based on the voice-to-text module, the text information can be displayed at a preset position on the interactive interface. The voice interaction method has the advantages that the interactive users can clearly know the voice information of the speaking users based on the text information displayed on the interactive interface, and accordingly the technical effect of interactive interaction efficiency is improved.
Example two
Fig. 2 is a schematic flow chart of an information processing method according to a second embodiment of the disclosure. On the basis of the foregoing embodiment, optimization may be performed on "collecting and caching voice information of each speaking user" in the foregoing embodiment, and specific optimization steps are shown in fig. 2:
s210, determining effective voice duration in each audio frame in the voice information, and determining cache voice duration based on the effective voice duration.
The valid speech duration of each audio frame in the speech information is determined, which may be determined based on a valid speech duration detection algorithm. That is, the valid speech duration detection algorithm may determine the valid speech duration in each audio frame, and further determine the cached speech duration of the cached speech information.
That is, the accumulated buffered speech duration buffered into the target buffer may be determined based on the valid speech duration in each audio frame and used as the buffered speech duration.
And S220, sequentially storing each audio frame into a target buffer.
Wherein, the target cache can be understood as a space for storing the cached voice information.
Specifically, after the valid voice duration in each audio frame is determined, each audio frame may be sequentially cached in the target cache, so that when the cached voice duration reaches the preset voice duration threshold, the voice information cached in the target cache is converted into corresponding text information.
Optionally, the target buffer includes at least one message queue, and sequentially storing each audio frame into the target buffer includes: and sequentially storing the audio frames into a message queue corresponding to the speaking user, and clearing the cached voice information in the message queue after detecting that all the cached voice information in the message queue is converted into characters.
In this embodiment, the number of audio transmission channels in the audio transmission channel pool is limited, and therefore, when the buffered speech information is transmitted to the speech-to-text module, a target audio transmission channel corresponding to the speech information needs to be determined. Communication channels between the message queues corresponding to all speaking users in the cache module and the voice-to-text module can be established based on the target audio transmission channel, so that the voice information is sent to the voice-to-text module based on the audio transmission channel, and then the text expression corresponding to the voice information is obtained.
The target buffer comprises at least one message queue, and each message queue corresponds to a speaking user. Illustratively, the number of speaking users is 10, and there may be 10 message queues in the voice message buffer module. When a new speaking user speaks, a message queue corresponding to the new speaking user can be added in the voice message buffer module.
It can be understood that when the voice information of each speaking user is collected, the voice information can be sequentially stored in the message queue corresponding to the voice information. In this embodiment, the advantage of establishing the message queue is that when the buffered voice duration reaches the preset voice duration threshold, the target audio transmission channel corresponding to the message queue may be determined, so that the voice information of the speaking user corresponding to the message queue is sent to the voice to text module through the target audio transmission channel.
It should be noted that, after the buffered voice information in the message queue is detected to be completely converted into the corresponding text information, the buffered voice information stored in the message queue may be deleted, so as to avoid the situation of redundant information storage in the buffer.
And S230, if the cached voice time reaches a preset voice time threshold, determining a target voice frequency transmission channel from the voice frequency transmission channel pool.
And S240, sending the cached voice information to the voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
According to the technical scheme of the embodiment of the disclosure, the cached voice time is determined by determining the effective voice time in the voice information, and the cached voice time is stored in the corresponding message queue, so that the audio transmission communication between each message queue and the voice-to-text module can be conveniently and rapidly established, and the text information corresponding to the voice information is obtained.
On the basis of the technical scheme, when the cached voice time is lower than a preset voice time threshold and an audio transmission channel is established between a message queue to which the cached voice information belongs and a voice-to-character module, judging whether the cached voice time is smaller than a preset closing threshold; and if not, transmitting the cached voice information based on the target audio transmission channel.
In the practical application process, if the cached voice time reaches the preset voice time threshold when the speaking user A speaks for the first time, a target audio transmission channel corresponding to the message queue to which the speaking user A belongs is established. After one minute or several minutes, when the speaking user a speaks for the second time, the cache voice time length corresponding to the speaking user a is lower than the preset voice time length threshold, and the specific processing mode of the audio transmission channel to which the speaking user a belongs can be determined according to the relationship between the second cache voice time length and the preset closing threshold. Optionally, if the duration of the buffered voice is greater than the preset closing threshold, the voice information may be continuously buffered based on the target audio transmission channel.
In the application process, if the duration of the cached voice is less than the preset closing threshold, releasing the target audio transmission channel, so that the current identification information of the target audio transmission channel is updated to be in an idle state when the release duration reaches the preset release duration threshold.
Specifically, when the duration of the buffered speech is less than the preset closing threshold, the target audio transmission channel may be marked as released, so that when the duration of the release reaches the preset duration, the state of the target audio transmission channel is updated to an idle state, so that when the duration of the buffered speech corresponding to other speaking users reaches the preset buffered speech duration threshold, the audio transmission channel in the idle state may be multiplexed. The advantage of this arrangement is that when the duration of the buffered voice is less than the preset closing threshold, if the state of the audio transmission channel is directly updated from the busy state to the idle state, that is, the communication between the audio transmission channel and the voice-to-text module is directly disconnected, and if the voice information buffered by the speaking user within the preset duration reaches the preset buffer duration threshold, all the steps provided by the embodiment of the present disclosure for obtaining the audio transmission channel need to be repeatedly executed to reestablish the audio transmission channel corresponding to the speaking user, which results in the technical problem that the audio transmission channel needs to be reestablished within the preset duration, which reduces the audio transmission efficiency.
If the duration of the cached voice is greater than the preset closing threshold, the cached voice information can be continuously sent to the voice-to-character module based on the target audio transmission channel, so that character expression corresponding to the voice information is obtained.
EXAMPLE III
Fig. 3 is a schematic flow chart of an information processing method according to a third embodiment of the disclosure. On the basis of the foregoing embodiment, a specific optimization method of "determining a target audio transmission channel from an audio transmission channel pool if it is determined that the cached voice information duration of the target user satisfies a preset voice duration threshold" may be performed, as shown in fig. 3, where the method includes;
and S310, collecting and caching voice information of each speaking user.
S320, if the cache voice time of the target user reaches a preset voice time threshold value and the fact that the message queue to which the cache voice information belongs is not communicated with the voice-to-text module is detected, a target audio transmission channel between the message queue and the voice-to-text module is established.
The preset voice time threshold is predetermined, and the specific time of the preset voice time threshold can be set according to experience or actual requirements. The voice-to-text module is used for converting the received voice information into corresponding text expressions.
Specifically, if the duration of the cached voice reaches the preset duration threshold, it is indicated that the cached voice information needs to be converted into corresponding words, that is, a connection relationship between the message queue to which the cached voice information belongs and the voice-to-word module needs to be established. And if the message queue to which the cached voice information belongs does not establish communication with the voice-to-text module, selecting an available audio transmission channel from the audio transmission channel pool, and taking the determined audio transmission channel as a target audio transmission channel.
Optionally, if the cached voice time reaches a preset voice time threshold, and when it is detected that a message queue to which the cached voice information belongs and a corresponding audio transmission channel exist, the target audio transmission channel is determined based on the audio transmission channel.
That is to say, when it is determined that the cached voice information needs to be converted into a word expression based on the voice caching duration, and it is detected that the message queue to which the cached voice information belongs has established communication with the voice-to-word module, the audio transmission channel for establishing communication may be used as the target audio transmission channel, and the cached voice information is continuously transmitted based on the target audio transmission channel.
According to the technical scheme of the embodiment of the disclosure, when the cached voice time reaches the preset voice time threshold value, whether the message queue to which the cached voice information belongs is communicated with the voice-to-text module or not can be judged, that is, whether the cached voice information can be successfully sent to the voice-to-text module or not is determined. If so, namely an audio transmission channel between the message queue and the voice-to-text module is established, transmitting the cached voice information based on the established audio transmission channel; if not, selecting an available audio transmission channel from the audio transmission channel pool to establish communication between the message queue and the voice-to-text module, and sending the cached voice information to the voice-to-text module based on the audio transmission channel. The setting mode has the advantages that the number of the audio transmission channels established with the voice-to-text module is far smaller than the number of the interactive users, at the moment, only the audio transmission channels communicated with the voice-to-text module need to be processed, namely, only a limited number of audio transmission channels are managed, and the technical problem of resource waste when the audio channels of all the interactive users need to be managed in the prior art is solved.
S330, based on the target audio transmission channel, sending the cached voice information to the voice-to-character module, and processing the cached voice information into characters based on the voice-to-character module.
On the basis of the above scheme, the determining a target audio transmission channel from the audio transmission channel pool includes: acquiring current identification information of each audio transmission channel in an audio transmission channel pool; the current identification information is used for representing the current state of the audio transmission channel; the current state comprises an idle state or a busy state; and determining a target audio transmission channel from the audio transmission channels with the current identification information in the idle state. .
In this embodiment, in order to ensure that the voice information of the speaking user is converted into corresponding text information, as long as it is detected that the buffered voice time length meets the preset voice time length threshold value and the audio transmission channel of the message queue to which the buffered voice information belongs is not detected, an audio transmission channel needs to be selected from the audio transmission channel pool, so as to establish communication between the message queue to which the speaking user belongs and the voice conversion module based on the audio transmission channel, and further send the buffered voice information to the voice-to-text module based on the audio transmission channel.
Wherein the current identification information may be used to characterize the current status of each audio transmission channel in the pool of audio transmission channels. The current state may be understood as whether the audio transmission channel is in a busy state or an idle state. The busy state is a state that the audio transmission channel is communicated with the voice-to-character module and can not be called currently; the idle state is a state that is currently callable without establishing communication with the speech-to-text module. That is, if the current state of the audio transmission channel is busy, it indicates that the audio transmission channel and the speech-to-text module have established communication, and at this time, the communication cannot be called; the current state of the audio transmission channel is an idle state, which indicates that the audio transmission channel does not establish communication with the speech-to-text module, and a target audio transmission channel can be determined from the audio transmission channel in the idle state.
Specifically, when the cached voice time reaches the preset voice time threshold and communication is not established with the voice-to-text module, the current identification information of each audio transmission channel in the audio transmission channel pool can be acquired, and the target audio transmission channel is determined from the audio transmission channel of which the current identification information is in an idle state.
In the process of practical application, if the current identification information of each audio transmission channel in the audio transmission channel pool is in a busy state, that is, each audio transmission channel has established communication with the voice-to-text module, at this moment, the following steps can be executed: optionally, if the current states of the audio transmission channels in the audio transmission channel pool are busy states, obtaining the current channel number of the audio transmission channels in the audio transmission channel pool; and if the current channel number is smaller than a preset channel number threshold value, reestablishing a new audio transmission channel, and determining the target audio transmission channel based on the new audio transmission channel.
The preset channel number threshold is preset, and optionally, the preset channel number threshold is 20. The current number of channels is the number of channels already in the audio transmission channel pool, and optionally, the number of channels already in the transmission channel pool is 10. If the current channel number is smaller than the preset channel number threshold value, the audio transmission channel can be reestablished, communication between the message queue and the voice-to-text module is established based on the reestablished audio transmission channel, and the reestablished audio transmission channel is used as a target audio transmission channel.
Of course, if the conventional audio transmission mode is switched to the audio transmission channel calling mode provided in the embodiment of the present disclosure, there may be a situation that the number of the current audio transmission channels is greater than or equal to the preset channel number threshold, and at this time, the audio transmission channel cannot be re-established. If such a situation is encountered, it is usually waited for the audio transmission channel whose current identification information is in an idle state to exist in the audio transmission channel pool. When the state of a certain audio transmission channel is waited to be updated to the idle state, the audio transmission channel corresponding to the idle state can be used as a target audio transmission channel to realize the multiplexing of the audio transmission channel, thereby improving the technical effect of the utilization rate of the audio transmission channel.
Optionally, if the current channel number is greater than or equal to a preset channel number threshold, waiting for an audio transmission channel whose identification information is in an idle state to exist in the audio transmission channel pool, and determining the target audio transmission channel based on the audio transmission channel.
Example four
Fig. 4 is a schematic flow chart of an information processing method according to a fourth embodiment of the disclosure. On the basis of the foregoing embodiment, the method further includes: regularly detecting the release duration of the audio transmission channel with the current identification information in a busy state; the release duration is the duration that the target audio transmission channel is marked as a release state; the release state is a state that the audio transmission channel does not transmit the voice information within a preset time length; and when the release duration reaches a preset release duration threshold, updating the current identification information of the audio transmission channel from the release state to an idle state, so as to determine the target audio transmission channel based on the updated identification information.
. As shown in fig. 4, the method includes:
s410, collecting and caching voice information of each speaking user;
and S420, if the cached voice time of the target user reaches a preset voice time threshold, determining a target voice frequency transmission channel from the voice frequency transmission channel pool.
And S430, sending the cached voice information to the voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
S440, regularly detecting the release duration of the audio transmission channel with the busy current identification information; the release duration is the duration that the target audio transmission channel is marked as the release state; the release state is a state in which the audio transmission channel does not transmit the voice information within a preset time length.
In order to determine the current state of each audio transmission channel in the audio transmission channel pool in real time and further select a target audio transmission channel from the current state, the duration of time that each target audio transmission channel does not transmit voice information can be obtained, and when the duration of time that the voice information is not transmitted reaches the preset duration, the audio transmission channel can be released, namely, the audio transmission channel is marked as a release state. Further, in order to avoid the problem that the audio transmission channel is released directly, which causes re-establishment when voice information is sent based on the audio transmission channel within a certain time, a duration that the audio transmission channel is marked as released may be recorded, and then it is determined whether the audio transmission channel can be updated from the busy state to the idle state based on the duration marked as released.
S450, when the release duration reaches a preset release duration threshold, updating the current identification information of the audio transmission channel from the release state to an idle state, and determining the target audio transmission channel based on the updated identification information.
Specifically, when the release duration reaches the preset duration threshold, it indicates that the target audio transmission channel does not transmit the cached voice information any more, and the current state of the target audio transmission channel may be changed from the busy state to the idle state, so that when the voice information is collected, the audio transmission channel in the idle state may be multiplexed. According to the technical scheme of the embodiment of the disclosure, only the audio transmission channel which is communicated with the voice-to-text module is maintained, and the technical effect of multiplexing the audio transmission channel is realized.
According to the technical scheme of the embodiment of the disclosure, the release duration of each target audio transmission channel is detected at regular time, when the release duration reaches the preset duration threshold, the target audio transmission channel does not continue to transmit the cached voice information, and the identification information of the target audio transmission channel can be updated to be in an idle state, so that when the voice information is received, the audio transmission channel can be multiplexed, and the utilization rate of the audio transmission channel is improved.
EXAMPLE five
As an alternative to the above embodiments, fig. 5 is a schematic block diagram provided in the embodiments of the present disclosure. As shown in fig. 5, the module for executing the embodiment of the present disclosure includes a VAD sub-module, a VAD information caching sub-module, an ASR multiplexing determination sub-module, and an ASR processing module.
Voice Activity Detection (VAD) is a technique for detecting valid Voice information in Voice information. And the VAD submodule is used for acquiring the voice information of each speaking user, and determining the effective voice duration of each audio frame in the voice information based on VAD technology, so as to determine the cache voice duration based on the effective voice duration. And the VAD information caching submodule is used for caching the voice information of which the effective voice duration is determined into the message queue corresponding to each speaking user. After the voice information is cached in the message queue and sent to the voice-to-text module, the cached voice information in the message queue can be deleted, so that the problem of resource redundancy in the cache is avoided. And the VAD multiplexing judgment sub-module is mainly used for determining whether to multiplex the audio transmission channel in the audio transmission channel pool according to the cache voice duration in the VAD information cache. The audio transmission channel pool is responsible for managing each audio transmission channel. The ASR module is a voice-to-character module, and converts the cached voice information into corresponding character information after receiving the cached voice information.
Fig. 6 is a schematic flowchart of executing a technical solution according to an embodiment of the present disclosure based on the above modules, that is, fig. 6 is a schematic flowchart of an information processing method according to a fifth embodiment of the present disclosure. As shown in fig. 6, the method includes:
s601, receiving voice information.
Specifically, in the speaking process of the speaking user, the client can collect the voice information of each speaking user and send the voice information to the server, and the server can receive the voice information.
S602, determining the effective voice duration in the voice information.
In this embodiment, each audio frame in the speech information or several audio frames at the same time may be processed in sequence based on the VAD sub-module to determine the effective speech duration in the audio frame.
S603, determining the cache voice time length based on the effective voice time length.
After the effective voice time length of each audio frame or a plurality of audio frames is determined, the audio frames can be sequentially stored in the VAD information cache submodule, and the cache voice time length in the VAD information cache submodule can be accumulated based on the effective voice time length. In order to facilitate processing of the voice information of each speaking user, when the audio frame corresponding to the voice information is stored in the VAD information cache sub-module, the voice information may be stored in a message queue corresponding to the speaking user.
That is, message queues corresponding to different speaking users may be established, and when voice information is buffered, the voice information is sequentially stored in the message queue corresponding thereto, so as to determine a target audio transmission channel corresponding thereto based on the message queue.
S604, judging whether the cache voice time length is larger than a preset voice time length threshold value, if so, executing S605, and if not, executing S606.
The preset voice time threshold is preset, optionally 80ms, and is used for judging whether voice information of a speaking user needs to be converted into corresponding text information.
Specifically, the cache voice time corresponding to the voice information of the speaking user can be determined based on the effective voice time, when the cache voice time reaches a preset time threshold, it indicates that the voice information of the speaking user needs to be converted into characters, and at this time, it can be determined whether a target audio transmission channel needs to be selected from the audio transmission channel pool, that is, S605 is executed; when the buffered voice duration is lower than the preset voice duration threshold, it is determined whether an audio outgoing channel has been established between the message queue to which the buffered voice duration belongs and the voice-to-text module, and then corresponding operation is performed, that is, S606 is performed.
S605, determining whether an audio transmission channel needs to be selected from the audio transmission channel pool, if so, performing S607, and if not, performing S608.
Specifically, when it is determined that the cached voice time reaches the preset voice time threshold, the cached voice information needs to be converted into corresponding text information, and at this time, it needs to be determined whether the message queue to which the cached voice information belongs and the voice-to-text module establish an audio transmission channel. That is, whether an audio transmission channel needs to be selected from a connection pool of audio transmission channels.
In the actual application process, there may be a communication that the message queue corresponding to a certain speaking user has established with the voice conversion module, so in the application process, it needs to be determined whether an audio transmission channel needs to be selected from the audio transmission channel pool. If so, selecting an audio transmission channel from the audio transmission channel pool, i.e. performing S607; if not, it indicates that the message queue has established an audio transmission channel with the voice-to-text module, and the buffered voice information may be transmitted based on the determined audio transmission channel, i.e., S608 is performed.
S606, judging whether the message queue to which the voice information belongs is in voice-to-word processing, if so, executing S609, and if not, executing S610.
Specifically, when the duration of the cached voice is lower than the preset voice duration threshold, it is described that there is no content with substantial meaning in the voice information of the speaking user, for example, the voice information is "kayian", and at this time, communication between the message queue to which the cached voice information belongs and the voice-to-text module may not be established. But at this time, it may be determined whether the voice information in the message queue to which the voice information belongs is being transmitted to the voice to text module, that is, whether the message queue has established an audio transmission channel in the audio to text module, if so, the cached voice information may be continuously sent to the voice to text module, and it is determined whether the cached voice is smaller than the closing threshold, that is, S609 is executed; if not, the audio transmission channel may be disconnected and marked as released, i.e., S610 is performed.
That is, if the duration of the cached voice is lower than the preset voice duration threshold, judging whether the message queue to which the cached voice information belongs establishes communication with the voice-to-text module, if so, sending the cached voice information to the cached voice information based on the established audio transmission channel; if not, the audio transmission channel can be marked as released, namely, the audio transmission channel is waited for the disconnection state of the voice-to-text module, and the audio transmission channel can be updated from the busy state to the idle state after waiting for a certain release time.
S607, judging whether a new audio transmission channel needs to be created, if so, executing S611; if not, go to S612.
Specifically, if an audio transmission channel needs to be selected from the audio transmission channel pool, an audio transmission channel whose current identification information is in an idle state may be selected. Whether the audio transmission channel needs to be recreated may be determined based on current identification information for each audio transmission channel in the pool of audio transmission channels.
For example, if there is an audio transmission channel whose current identification information is in an idle state in the audio transmission channel pool, the audio transmission channel whose current identification information is in the idle state may be selected, that is, the audio transmission channel does not need to be created again, and S612 may be executed; if there is no audio transmission channel whose current identification information is in an idle state in the audio transmission channel, the audio transmission channel of the information needs to be re-created, i.e., S611 is executed.
And S608, continuing to process the voice information based on the established audio transmission channel.
That is, if the message queue to which the buffered voice message belongs has established communication with the voice-to-text module, the buffered voice message may be sent to the voice-to-text module based on the audio transmission channel.
S609, transmitting the cached voice information, judging whether the voice time length is less than a closing threshold, if so, executing S610; if not, S613 is executed.
The closing threshold is preset and is used for determining whether to mark an audio transmission channel between a message queue to which the buffered voice information belongs and the voice-to-text module as released.
In the process of transmitting the voice information, whether the duration of the cached voice is less than a closing threshold can be judged, if yes, a channel for converting the voice into the text can be released, and when the release duration reaches a preset duration threshold, the state of the audio transmission channel can be marked as an idle state, namely S610 is executed; if the threshold value is larger than the closing threshold value, the cached voice information can be continuously sent to the voice-to-text module through the voice transmission channel.
S610, releasing the audio transmission channel, marking the audio transmission channel as released, and marking the audio transmission channel as an idle state when the release duration reaches a preset duration.
Specifically, in order to avoid the problem of resource waste caused by reestablishing the communication between the message queue to which the cached voice information belongs and the voice-to-text module, the audio transmission channel with the busy current identification information can be detected in real time, and if the audio transmission channel does not transmit the cached voice information for a certain time, the target audio transmission channel can be marked as released. When the release duration marked as release reaches the preset duration is detected, the current identification information of the audio transmission channel can be updated to be in an idle state from a busy state, so that when the cached voice information in other message queues reaches a preset voice duration threshold value, the audio transmission channel is multiplexed, communication between the message queues and the voice character conversion module is established, the cached voice information is sent to the voice character conversion module, and then character identifications corresponding to the cached voice information are obtained, namely characters corresponding to the speaking information of each speaking user are obtained.
S611, judging whether the number of the audio transmission channels in the audio transmission channel pool reaches a preset number threshold, if so, executing S614; if not, go to S615.
And acquiring the number of the audio transmission channels in the audio transmission channel pool and a preset number threshold preset by the server. Whether to reestablish the audio transmission channel can be determined according to a preset number threshold value and the number of the existing audio transmission channels.
If the audio transmission channel pool does not have the audio transmission channel in the idle state, determining whether a new audio transmission channel can be created again based on whether the number of channels in the audio transmission channel pool is smaller than a preset channel number threshold; if the number of the messages is less than the preset number, audio transmission is re-created, communication between the message queue and the voice-to-text module is established based on the newly created audio transmission channel, and S615 is executed; if not, the number of the audio transmission channels in the audio transmission channel pool in the specification reaches a preset number threshold value, and a new audio transmission channel cannot be reestablished, and at this time, the audio transmission channel with the identification information in the idle state in the audio transmission channel pool can be waited for.
And S612, calling the audio transmission channel marked as the idle state in the audio transmission channel pool.
And acquiring an audio transmission channel with idle identification information in the audio transmission channel pool, and establishing communication between the voice-to-text module and a message queue to which the cached voice information belongs on the basis of the audio transmission channel.
S613, processing the voice information based on the determined audio transmission channel.
That is, the buffered speech information may be sent to the speech-to-text module via an audio transmission channel.
S614, waiting for the audio transmission channel with the identification information in the idle state to exist in the audio transmission channel pool.
It can be understood that: if the number of the audio transmission channels in the audio transmission channel pool reaches the preset number threshold, the audio transmission channels with the identification information in the idle state stored in the audio transmission channel pool can be waited, and the determined audio transmission channels at the moment are used as target audio transmission channels.
S615, a new audio transmission channel is created, and voice information transmission is achieved based on the new audio transmission channel.
That is, if the number of audio transmission channels in the audio transmission channel pool does not reach the preset number threshold, a new audio transmission channel may be established, and communication between the message queue to which the cached voice information belongs and the voice-to-text module is established based on the new audio transmission channel, so that the cached voice information is transmitted from the message queue to the voice-to-text module, and then a text expression corresponding to the cached voice information, that is, a text expression corresponding to the voice information of the speaking user is obtained.
According to the technical scheme of the embodiment, when the cache voice time reaches the preset time threshold, the audio transmission channel between the message queue to which the cache voice information belongs and the voice-to-text module can be established, so that the cache voice information is sent to the voice-to-text module, at the moment, the server only needs to process a plurality of audio transmission channels connected with the voice-to-text module, the problem that in the prior art, the plurality of audio transmission channels need to be processed and managed simultaneously, and therefore efficiency is low is solved, only limited audio channels are processed, and the technical effect of improving the resource utilization rate is achieved.
EXAMPLE six
Fig. 7 is a schematic structural diagram of an information processing apparatus according to a sixth embodiment of the present disclosure. As shown in fig. 7, the apparatus includes: an information collection module 710, a connection establishment module 720, and a voice information conversion module 720.
The system comprises an information acquisition module, a voice module and a voice module, wherein the information acquisition module is used for acquiring and caching voice information of each speaking user; the connection establishing module is used for determining a target audio transmission channel from the audio transmission channel pool if the cached voice time of the target user reaches a preset voice time threshold; and the voice information conversion module is used for transmitting the cached voice information to the voice-to-character module based on the target audio transmission channel so as to process the cached voice information into characters based on the voice-to-character module.
According to the technical scheme of the embodiment, when the audio transmission channel between the message queue to which the voice information belongs and the voice-to-character module needs to be established is determined based on the collected voice information, the target audio transmission channel is selected from the audio transmission channel pool, so that the voice information is sent to the voice-to-character module based on the target audio transmission channel, and therefore the word expression corresponding to the voice information is obtained, and the technical effects of improving the utilization rate of the audio transmission channel and saving resources are achieved when the voice information of a limited number of audio transmission channels is processed.
On the basis of the technical scheme, the information acquisition module is further used for determining the effective voice time length of each audio frame in the voice information so as to determine the cache voice time length based on the effective voice time length; and sequentially storing each audio frame into a target buffer.
On the basis of the above technical solution, the target cache includes at least one message queue, and the information acquisition module is further configured to: and sequentially storing the audio frames into a message queue corresponding to the speaking user, and clearing the cached voice information in the message queue after detecting that all the cached voice information in the message queue is converted into characters.
On the basis of the technical scheme, the connection establishing module is further configured to establish a target audio transmission channel between the message queue and the voice-to-text module if the cached voice time reaches a preset voice time threshold and it is detected that the message queue to which the cached voice information belongs does not establish communication with the voice-to-text module.
On the basis of the technical scheme, the connection establishing module is also used for acquiring the current identification information of each audio transmission channel in the audio transmission channel pool; the current identification information is used for representing the current state of the audio transmission channel; the current state comprises an idle state or a busy state; determining a target audio transmission channel from the audio transmission channels with the current identification information in the idle state
The connection establishing module is used for acquiring the current channel number of the audio transmission channels in the audio transmission channel pool if the current states of all the audio transmission channels in the audio transmission channel pool are busy states on the basis of the technical scheme; and if the current channel number is smaller than a preset channel number threshold value, reestablishing a new audio transmission channel, and determining the target audio transmission channel based on the new audio transmission channel.
On the basis of the above technical solution, the connection establishing module is further configured to wait for an audio transmission channel whose identification information is in an idle state to exist in the audio transmission channel pool if the current channel number is greater than or equal to a preset channel number threshold, and determine the target audio transmission channel based on the audio transmission channel.
On the basis of the technical scheme, the method further comprises the following steps: and if the cache voice time reaches a preset voice time threshold, detecting that an audio transmission channel corresponding to a message queue to which the cache voice information belongs exists, and determining the target audio transmission channel based on the audio transmission channel.
On the basis of the technical scheme, the method further comprises the following steps: if the cached voice time length is lower than a preset voice time length threshold value, and a message queue to which the cached voice information belongs is communicated with the voice-to-character module, judging whether the cached voice time length is smaller than a preset closing threshold value; and if not, transmitting the cached voice information based on the target audio transmission channel.
On the basis of the technical scheme, if the cache voice time length is less than the preset closing threshold value, the target audio transmission channel is released, and the current identification information of the target audio transmission channel is updated to be in an idle state when the release time length reaches the preset release time length threshold value.
On the basis of the above technical solution, the apparatus further comprises: the timing monitoring module is used for regularly detecting the release duration of the audio transmission channel with the busy current identification information; the release duration is the duration that the target audio transmission channel is marked as a release state; the release state is a state that the audio transmission channel does not transmit the voice information within a preset time length; and when the release duration reaches a preset release duration threshold, updating the current identification information of the audio transmission channel from the release state to an idle state, so as to determine the target audio transmission channel based on the updated identification information.
On the basis of the above technical solution, after determining the target audio transmission channel corresponding to the message queue to which the buffered voice information belongs, the method further includes: and marking the identity of a speaking user on the target audio transmission channel, so that when the cache voice time of the message queue reaches a preset voice time threshold, acquiring the target audio transmission channel corresponding to the identity from an audio transmission channel pool based on the identity of the message queue, and transmitting the cache voice information.
On the basis of the above technical solution, the apparatus further includes: and the display module is used for sending the characters corresponding to the cached voice information to the client so as to display in a preset area of the client that the information processing device provided by the embodiment of the disclosure can execute the information processing method provided by any embodiment of the disclosure, and the display module has the corresponding functional modules and the beneficial effects of the execution method.
It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.
EXAMPLE seven
Referring now to fig. 8, a schematic diagram of an electronic device (e.g., a terminal device or a server in fig. 8) 800 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 806 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 806 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 806, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
The electronic device provided by the embodiment of the present disclosure and the information processing method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
Example eight
The disclosed embodiments provide a computer storage medium on which a computer program is stored, which when executed by a processor implements the information processing method provided by the above-described embodiments.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
collecting and caching voice information of each speaking user;
if the cached voice time of the target user reaches a preset voice time threshold, determining a target audio transmission channel from an audio transmission channel pool;
and transmitting the cached voice information to a voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit/module does not in some cases constitute a limitation on the unit itself, for example, a behavioral data collection module may also be described as a "collection module".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, [ example one ] there is provided an information processing method including:
collecting and caching voice information of each speaking user;
if the cached voice time of the target user reaches a preset voice time threshold, determining a target audio transmission channel from an audio transmission channel pool;
and transmitting the cached voice information to a voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module.
According to one or more embodiments of the present disclosure, [ example two ] there is provided an information processing method, further comprising:
optionally, the collecting and caching voice information of each speaking user includes:
determining an effective voice duration of each audio frame in the voice information to determine the cache voice duration based on the effective voice duration;
and sequentially storing each audio frame into a target buffer.
According to one or more embodiments of the present disclosure, [ example three ] there is provided an information processing method, further comprising:
optionally, the target buffer includes at least one message queue, and the sequentially storing each audio frame into the target buffer includes:
and sequentially storing the audio frames into a message queue corresponding to the speaking user, and clearing the cached voice information in the message queue after all the cached voice information in the message queue is detected to be converted into characters.
According to one or more embodiments of the present disclosure [ example four ] there is provided an information processing method, further comprising:
optionally, if the cached voice time of the target user is determined to satisfy the preset voice time threshold, determining the target audio transmission channel from the audio transmission channel pool includes:
and if the cache voice time reaches a preset voice time threshold value and the message queue to which the cache voice information belongs is detected not to be communicated with the voice-to-text module, establishing a target audio transmission channel between the message queue and the voice-to-text module.
According to one or more embodiments of the present disclosure [ example five ] there is provided an information processing method, further comprising:
optionally, the determining a target audio transmission channel from the audio transmission channel pool includes:
acquiring current identification information of each audio transmission channel in an audio transmission channel pool; the current identification information is used for representing the current state of the audio transmission channel; the current state comprises an idle state or a busy state;
and determining a target audio transmission channel from the audio transmission channels with the current identification information in the idle state.
According to one or more embodiments of the present disclosure, [ example six ] there is provided an information processing method further comprising:
optionally, the determining a target audio transmission channel from the audio transmission channel pool includes:
if the current states of all the audio transmission channels in the audio transmission channel pool are busy states, acquiring the current channel number of the audio transmission channels in the audio transmission channel pool;
and if the current channel number is smaller than a preset channel number threshold value, reestablishing a new audio transmission channel, and determining the target audio transmission channel based on the new audio transmission channel.
According to one or more embodiments of the present disclosure, [ example seven ] there is provided an information processing method, further comprising:
optionally, if the current channel number is greater than or equal to a preset channel number threshold, waiting for an audio transmission channel whose identification information is in an idle state to exist in the audio transmission channel pool, and determining the target audio transmission channel based on the audio transmission channel.
According to one or more embodiments of the present disclosure, [ example eight ] there is provided an information processing method further comprising:
optionally, if the cached voice time reaches a preset voice time threshold, it is detected that an audio transmission channel corresponding to a message queue to which the cached voice information belongs exists, and the target audio transmission channel is determined based on the audio transmission channel.
According to one or more embodiments of the present disclosure, [ example nine ] there is provided an information processing method further comprising:
optionally, if the cached voice time length is lower than a preset voice time length threshold value, and a message queue to which the cached voice information belongs is already communicated with the voice-to-text module, judging whether the cached voice time length is smaller than a preset closing threshold value;
and if not, transmitting the cache voice information based on the target audio transmission channel. According to one or more embodiments of the present disclosure, [ example ten ] there is provided an information processing method further comprising:
optionally, if the duration of the buffered voice is less than the preset closing threshold, the target audio transmission channel is released, so that when the release duration reaches the preset release duration threshold, the current identification information of the target audio transmission channel is updated to an idle state.
According to one or more embodiments of the present disclosure, [ example eleven ] there is provided an information processing method, further comprising:
optionally, the release duration of the audio transmission channel with the busy current identification information is detected at regular time; the release duration is the duration that the target audio transmission channel is marked as a release state; the release state is a state that the audio transmission channel does not transmit the voice information within a preset time length;
and when the release duration reaches a preset release duration threshold, updating the current identification information of the audio transmission channel from the release state to an idle state, so as to determine the target audio transmission channel based on the updated identification information.
According to one or more embodiments of the present disclosure, [ example twelve ] there is provided an information processing method further comprising:
optionally, after determining a target audio transmission channel corresponding to a message queue to which the buffered voice information belongs, the method further includes:
and marking the identity of a speaking user on the target audio transmission channel, so that when the cache voice time of the message queue reaches a preset voice time threshold, acquiring the target audio transmission channel corresponding to the identity from an audio transmission channel pool based on the identity of the message queue, and transmitting the cache voice information.
According to one or more embodiments of the present disclosure, [ example thirteen ] provides an information processing method, further comprising:
optionally, the text corresponding to the cached voice information is sent to the client for displaying in a preset area of the client.
According to one or more embodiments of the present disclosure, [ example fourteen ] there is provided an information processing apparatus comprising:
the information acquisition module is used for acquiring and caching voice information of each speaking user;
the connection establishing module is used for determining a target audio transmission channel from the audio transmission channel pool if the cached voice time of the target user is determined to meet a preset voice time threshold;
and the voice information conversion module is used for sending the cached voice information to the voice-to-character module based on the target audio transmission channel so as to process the cached voice information into characters based on the voice-to-character module.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (15)

1. An information processing method characterized by comprising:
collecting and caching voice information of each speaking user;
if the cached voice time of the target speaking user reaches a preset voice time threshold, determining a target audio transmission channel from an audio transmission channel pool;
transmitting the cached voice information to a voice-to-text module based on the target audio transmission channel, and processing the cached voice information into text based on the voice-to-text module;
the determining a target audio transmission channel from the audio transmission channel pool includes:
if the current states of all the audio transmission channels in the audio transmission channel pool are busy states, acquiring the current channel number of the audio transmission channels in the audio transmission channel pool;
and if the current channel number is smaller than a preset channel number threshold value, reestablishing a new audio transmission channel, and determining the target audio transmission channel based on the new audio transmission channel.
2. The method of claim 1, wherein the collecting and buffering voice information of each speaking user comprises:
determining an effective voice duration of each audio frame in the voice information to determine the cache voice duration based on the effective voice duration;
and sequentially storing each audio frame into a target buffer.
3. The method of claim 2, wherein the target buffer comprises at least one message queue, and the sequentially storing each audio frame into the target buffer comprises:
and sequentially storing the audio frames into a message queue corresponding to the speaking user, and clearing the cached voice information in the message queue after detecting that all the cached voice information in the message queue is converted into characters.
4. The method of claim 1, wherein determining the target audio transmission channel from the pool of audio transmission channels if the cached voice time of the target speaking user is determined to satisfy the predetermined voice time threshold comprises:
and if the cache voice time reaches a preset voice time threshold value and the message queue to which the cache voice information belongs is detected not to be communicated with the voice-to-text module, establishing a target audio transmission channel between the message queue and the voice-to-text module.
5. The method of claim 4, wherein determining the target audio transmission channel from the pool of audio transmission channels comprises:
acquiring current identification information of each audio transmission channel in an audio transmission channel pool; the current identification information is used for representing the current state of the audio transmission channel; the current state comprises an idle state or a busy state;
and determining a target audio transmission channel from the audio transmission channels with the current identification information in the idle state.
6. The method of claim 1, further comprising:
and if the current channel number is larger than or equal to a preset channel number threshold value, waiting for the audio transmission channel with the identification information in the idle state in the audio transmission channel pool, and determining the target audio transmission channel based on the audio transmission channel.
7. The method of claim 1, further comprising:
and if the cache voice time reaches a preset voice time threshold, detecting that an audio transmission channel corresponding to the message queue to which the cache voice information belongs exists, and determining the target audio transmission channel based on the audio transmission channel.
8. The method of claim 1, further comprising:
if the cached voice time length is lower than a preset voice time length threshold value, and a message queue to which the cached voice information belongs is communicated with the voice-to-character module, judging whether the cached voice time length is smaller than a preset closing threshold value;
and if not, transmitting the cached voice information based on the target audio transmission channel.
9. The method of claim 8, further comprising:
and if the cache voice time length is less than the preset closing threshold value, releasing the target audio transmission channel, and updating the current identification information of the target audio transmission channel to be in an idle state when the release time length reaches the preset release time length threshold value.
10. The method of claim 9, further comprising:
regularly detecting the release duration of the audio transmission channel with the current identification information in a busy state; the release duration is the duration that the target audio transmission channel is marked as a release state; the release state is a state that the audio transmission channel does not transmit the voice information within a preset time length;
and when the release duration reaches a preset release duration threshold, updating the current identification information of the audio transmission channel from the release state to an idle state, so as to determine the target audio transmission channel based on the updated identification information.
11. The method according to any one of claims 1 to 10, further comprising, after determining a target audio transmission channel corresponding to a message queue to which the buffered voice information belongs:
and marking the identity of a speaking user on the target audio transmission channel, so that when the cache voice time of the message queue reaches a preset voice time threshold, acquiring the target audio transmission channel corresponding to the identity from an audio transmission channel pool based on the identity of the message queue, and transmitting the cache voice information.
12. The method of claim 1, further comprising:
and sending the characters corresponding to the cached voice information to the client so as to display the characters in a preset area of the client.
13. An information processing apparatus characterized by comprising:
the information acquisition module is used for acquiring and caching voice information of each speaking user;
the connection establishing module is used for determining a target audio transmission channel from the audio transmission channel pool if the cached voice time of the target speaking user reaches a preset voice time threshold;
a voice information conversion module for transmitting the cached voice information to a voice-to-character module based on the target audio transmission channel, and processing the cached voice information into characters based on the voice-to-character module
The connection establishing module is further configured to:
if the current states of all the audio transmission channels in the audio transmission channel pool are busy states, acquiring the current channel number of the audio transmission channels in the audio transmission channel pool;
and if the current channel number is smaller than a preset channel number threshold value, reestablishing a new audio transmission channel, and determining the target audio transmission channel based on the new audio transmission channel.
14. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the information processing method of any one of claims 1-12.
15. A storage medium containing computer-executable instructions for performing the information processing method of any one of claims 1 to 12 when executed by a computer processor.
CN202010531056.5A 2020-06-11 2020-06-11 Information processing method, information processing apparatus, electronic device, and medium Active CN111755008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010531056.5A CN111755008B (en) 2020-06-11 2020-06-11 Information processing method, information processing apparatus, electronic device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010531056.5A CN111755008B (en) 2020-06-11 2020-06-11 Information processing method, information processing apparatus, electronic device, and medium

Publications (2)

Publication Number Publication Date
CN111755008A CN111755008A (en) 2020-10-09
CN111755008B true CN111755008B (en) 2022-05-27

Family

ID=72675939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010531056.5A Active CN111755008B (en) 2020-06-11 2020-06-11 Information processing method, information processing apparatus, electronic device, and medium

Country Status (1)

Country Link
CN (1) CN111755008B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651125A (en) * 2020-12-22 2021-04-13 郑州捷安高科股份有限公司 Simulated train communication method, device, equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3835032B2 (en) * 1998-12-18 2006-10-18 富士通株式会社 User verification device
EP1185976B1 (en) * 2000-02-25 2006-08-16 Philips Electronics N.V. Speech recognition device with reference transformation means
CN103325385B (en) * 2012-03-23 2018-01-26 杜比实验室特许公司 Voice communication method and equipment, the method and apparatus of operation wobble buffer
US9437205B2 (en) * 2013-05-10 2016-09-06 Tencent Technology (Shenzhen) Company Limited Method, application, and device for audio signal transmission
CN103312907B (en) * 2013-05-15 2016-04-13 腾讯科技(深圳)有限公司 Voice channel allocation management method, voice server and communication system
CN108346429B (en) * 2017-01-22 2022-07-08 腾讯科技(深圳)有限公司 Data transmission method and device based on voice recognition
CN106953797B (en) * 2017-04-05 2020-05-26 苏州浪潮智能科技有限公司 RDMA data transmission method and device based on dynamic connection
CN108074570A (en) * 2017-12-26 2018-05-25 安徽声讯信息技术有限公司 Surface trimming, transmission, the audio recognition method preserved
CN111148159B (en) * 2019-12-26 2023-06-09 拉扎斯网络科技(上海)有限公司 Data transmission method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN111755008A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN110662085B (en) Message sending method, device, readable medium and electronic equipment
CN106998485B (en) Video live broadcasting method and device
JP7448672B2 (en) Information processing methods, systems, devices, electronic devices and storage media
CN110851035B (en) Session message display method and device, electronic equipment and storage medium
CN112202803A (en) Audio processing method, device, terminal and storage medium
CN111245709A (en) Message pushing method and device, electronic equipment and storage medium
WO2023125350A1 (en) Audio data pushing method, apparatus and system, and electronic device and storage medium
CN111629251A (en) Video playing method and device, storage medium and electronic equipment
CN112199174A (en) Message sending control method and device, electronic equipment and computer readable storage medium
CN109446204B (en) Data storage method and device for instant messaging, electronic equipment and medium
CN111818383B (en) Video data generation method, system, device, electronic equipment and storage medium
CN113099055A (en) Communication method, system, device, electronic equipment and storage medium
CN111755008B (en) Information processing method, information processing apparatus, electronic device, and medium
CN112256733A (en) Data caching method and device, electronic equipment and computer readable storage medium
CN110545472B (en) Video data processing method and device, electronic equipment and computer readable medium
CN112713969B (en) Data transmission method and device and system using same
CN113542337B (en) Information sharing method and device, electronic equipment and storage medium
CN113794942A (en) Method, apparatus, system, device and medium for switching view angle of free view angle video
CN112839192A (en) Audio and video communication system and method based on browser
WO2013142705A1 (en) Voice communication method and apparatus and method and apparatus for operating jitter buffer
CN112153322B (en) Data distribution method, device, equipment and storage medium
CN113364672B (en) Method, device, equipment and computer readable medium for determining media gateway information
CN112203039A (en) Processing method and device for online conference, electronic equipment and computer storage medium
CN112311840A (en) Multi-terminal data synchronization method, device, equipment and medium
CN113014986A (en) Interactive information processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant