CN113724735A

CN113724735A - Voice stream processing method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN113724735A
Application number: CN202111021151.1A
Authority: CN
Inventors: 李峰; 陈海齐
Original assignee: Guangzhou Boguan Information Technology Co Ltd
Current assignee: Guangzhou Boguan Information Technology Co Ltd
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2021-11-30

Abstract

The disclosure belongs to the technical field of computers, and relates to a voice stream processing method and device, a computer readable storage medium and electronic equipment. The method comprises the following steps: acquiring a segmented voice stream sent by a client, and determining a voice stream identifier corresponding to the segmented voice stream; and aggregating the segmented voice streams with the same voice stream identification to obtain voice packets, and supervising the segmented voice streams based on the voice packets. In the disclosure, on one hand, the processing of the voice stream does not depend on the client, thereby avoiding the occurrence of the situation that the voice stream needs to be acquired by monitoring the event in the client, reducing the complexity of acquiring the voice stream, and improving the efficiency of acquiring the voice stream; on the other hand, the situation that the client has to resend the segmented voice stream if the segmented voice stream needs to be modified is avoided, the flexibility in the process of modifying the segmented voice stream is increased, and the application scene of voice stream supervision is expanded.

Description

Voice stream processing method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a voice stream processing method, a voice stream processing apparatus, a computer-readable storage medium, and an electronic device.

Background

With the development of computer technology, many ways for enriching people's life, such as playing virtual games, watching anchor live broadcasts, entering a voice chat room for chatting, etc., exist in which voice streams need to be supervised in order to maintain a healthy network environment.

In the prior art, recording of a voice stream is usually implemented in a terminal, and specifically, in the process of recording the voice stream, the terminal needs to continuously monitor a corresponding event, and after monitoring the corresponding event, the voice stream needs to be forwarded to a subsequent server for supervising the voice stream in a voice robot manner so as to implement supervision on the voice stream.

In view of the above, there is a need in the art to develop a new voice stream processing method and apparatus.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a voice stream processing method, a voice stream processing apparatus, a computer-readable storage medium, and an electronic device, so as to overcome, at least to a certain extent, the problem of low voice stream supervision efficiency due to high voice stream supervision complexity caused by related technologies.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of embodiments of the present invention, there is provided a voice stream processing method, the method including: acquiring a segmented voice stream sent by a client, and determining a voice stream identifier corresponding to the segmented voice stream; and aggregating the segmented voice streams with the same voice stream identification to obtain voice packets, and supervising the segmented voice streams based on the voice packets.

In an exemplary embodiment of the present invention, before obtaining the segmented voice stream sent by the client, the method further includes: accessing a client to a voice transmission system to establish data communication between the client and the voice transmission system; recording the voice in the client by utilizing a voice transmission node in the voice transmission system based on the data communication between the client and the voice transmission system to obtain a segmented voice stream; and acquiring a target forwarding address, and sending the segmented voice stream based on the target forwarding address, so that a target server corresponding to the target transmission address receives the segmented voice stream sent by the client.

In an exemplary embodiment of the present invention, the sending the segmented voice stream based on the target forwarding address includes: and transmitting the segmented voice stream by using a connectionless transmission protocol based on the target forwarding address.

In an exemplary embodiment of the invention, the segmented voice stream comprises a voice stream of a single user; the aggregating the segmented voice streams with the same voice stream identifier to obtain a voice packet includes: if the segmented voice stream is of a single user, dividing the segmented voice stream according to the voice stream identification to obtain a first division result; acquiring a user identifier corresponding to the segmented voice stream, and determining a hash value corresponding to the user identifier; and in the first division result, according to the hash value, aggregating the segmented voice flows with the same user identification to obtain a voice packet.

In an exemplary embodiment of the invention, the segmented voice stream is a mixed voice stream of a plurality of the users; the aggregating the segmented voice streams with the same voice stream identifier to obtain a voice packet includes: if the segmented voice stream is the mixed voice stream, dividing the mixed voice stream according to the voice stream identification to obtain a second division result; acquiring a channel identifier corresponding to the second division result, and aggregating the mixed voice streams with the same channel identifier in the second division result to obtain an aggregated result; determining user identifications of a plurality of users in the mixed voice stream in the aggregation result, and determining hash values corresponding to the user identifications; and in the aggregation result, according to the hash value, extracting the user segmented voice stream with the same user identifier in the mixed voice stream so as to aggregate the user segmented voice stream to obtain a voice packet.

In an exemplary embodiment of the present invention, the voice packet includes at least one user segmented voice stream; the policing the segmented voice stream based on the voice packet includes: acquiring the user segmented voice stream in the voice packet, and acquiring a volume value and a sound continuous value corresponding to the user segmented voice stream; wherein the sound continuity value is used to characterize the continuity of the sound; determining a first target voice stream with the volume value smaller than a volume threshold value in the user segmented voice stream, and determining a second target voice stream with the sound continuity value smaller than a continuity value threshold value in the voice file; deleting the first target voice stream and the second target voice stream in the voice packet to obtain the deleted voice packet; and monitoring the segmented voice stream based on the deleted voice packet.

In an exemplary embodiment of the present invention, the policing the segmented voice stream includes: identifying the segmented voice stream to obtain character information corresponding to the segmented voice stream; and transcoding the segmented voice stream to obtain voice corresponding to the segmented voice stream, and judging sensitive words of the text information by combining upper and lower contexts corresponding to the voice so as to realize supervision on the segmented voice stream.

According to a second aspect of the embodiments of the present invention, there is provided a voice stream processing apparatus, including: the acquisition module is configured to acquire a segmented voice stream sent by a client and determine a voice stream identifier corresponding to the segmented voice stream; and the login module is configured to gather the segmented voice streams with the same voice stream identification to obtain voice packets, and realize supervision on the segmented voice streams based on the voice packets.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions, which when executed by the processor, implement the voice stream processing method of any of the above exemplary embodiments.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the voice stream processing method in any of the above-described exemplary embodiments.

As can be seen from the foregoing technical solutions, the voice stream processing method, the voice stream processing apparatus, the computer storage medium and the electronic device in the exemplary embodiment of the present invention have at least the following advantages and positive effects:

in the method and the device provided by the exemplary embodiment of the disclosure, on one hand, the processing of the voice stream does not depend on the client, thereby avoiding the situation that the voice stream needs to be acquired by monitoring the event in the client, reducing the complexity of acquiring the voice stream, and improving the efficiency of acquiring the voice stream; on the other hand, the condition that the segmented voice stream needs to be modified, the client side needs to resend the segmented voice stream is avoided, the flexibility in the process of modifying the segmented voice stream is increased, and the application scene of voice stream supervision is expanded.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically illustrates a flow chart of a voice stream processing method in an embodiment of the present disclosure;

fig. 2 schematically illustrates a flow chart before a segmented voice stream sent by a client is obtained in an embodiment of the present disclosure;

fig. 3 is a schematic flow chart illustrating aggregation of segmented voice streams having the same voice stream identifier into voice packets according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart illustrating aggregation of segmented voice streams having the same voice stream identifier into voice packets according to an embodiment of the present disclosure;

FIG. 5 is a flow chart schematically illustrating policing a segmented voice stream based on voice packets in an embodiment of the present disclosure;

FIG. 6 is a flow diagram schematically illustrating policing a segmented voice stream in an embodiment of the present disclosure;

FIG. 7 is a flow chart schematically illustrating an implementation of policing segmented voice flows in an embodiment of the present disclosure;

FIG. 8 schematically illustrates an interface of a speech supervision background in an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart illustrating a process of sending warning information to a target client in the disclosed embodiment;

FIG. 10 is a flow chart of a speech stream processing method in an application scenario;

fig. 11 is a schematic structural diagram of a voice stream processing apparatus according to an embodiment of the present disclosure;

fig. 12 schematically illustrates an electronic device for a voice stream processing method in an embodiment of the present disclosure;

fig. 13 schematically illustrates a computer-readable storage medium for a voice stream processing method in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

In order to solve the problems in the related art, the present disclosure provides a voice stream processing method. Fig. 1 shows a flow chart of a voice stream processing method, as shown in fig. 1, the voice stream processing method at least includes the following steps:

step S110, the segmented voice stream sent by the client is obtained, and the voice stream identification corresponding to the segmented voice stream is determined.

And S120, aggregating the segmented voice streams with the same voice stream identification to obtain voice packets, and supervising the segmented voice streams based on the voice packets.

The following describes each step of the voice stream processing method in detail.

In step S110, a segmented voice stream sent by the client is obtained, and a voice stream identifier corresponding to the segmented voice stream is determined.

In an exemplary embodiment of the present disclosure, the segmented voice stream is voice information segmented by taking minutes as a unit, and the segmented voice stream is sent from the client, so that the segmented voice stream needs to be segmented by taking minutes as a unit, in order to prevent the phenomenon that the memory occupied by the voice stream is too large, for example, in a live broadcast client of the terminal, the anchor a performs live broadcast for 3 hours, and at this time, the voice stream for 3 hours cannot be sent from the client at one time, which may cause excessive consumption of terminal resources where the client is located, and further reduce the receiving speed of the server for the voice stream, therefore, the voice stream of the anchor a is segmented by taking minutes as a unit in the live broadcast client to obtain the segmented voice stream, and then the segmented voice stream is sent.

Specifically, the segmented voice stream may be transmitted from a virtual game client in the terminal, may be transmitted from a spectator client in the terminal, or may be transmitted from a live client in the terminal, which is not particularly limited in this exemplary embodiment. The segmented voice stream may be obtained by a server or a server cluster, which is not particularly limited in this exemplary embodiment.

Based on this, the segmented voice stream may include voice information of a single user, and may also include voice information of multiple users in the same client, for example, in a live client, the segmented voice stream may be voice information of a single person, and in a virtual game or a voice chat room, the segmented voice stream may be voice information of multiple users in a specific virtual game team, and may be voice information of multiple users in a voice chat room, which is not particularly limited in this exemplary embodiment.

It should be noted that each segmented voice stream has a corresponding voice stream identifier, for example, when the segmented voice stream is a voice stream hosted in a live client, the voice stream identifier is an identifier corresponding to the live client, so as to indicate that the voice stream is generated from the live client, for example, when the segmented voice stream is a voice stream of a plurality of audience users in a voice chat room, the voice stream identifier is an identifier corresponding to the plurality of audience clients in the voice chat room, so as to indicate that the voice streams of the plurality of users are generated from the audience clients corresponding to the chat room in the voice.

When the client sends the segmented voice stream to the server or the server cluster, the mapping relation between the segmented voice stream and the voice stream identifier is automatically established, the voice stream identifier corresponds to the client which generates the segmented voice stream, and when the server or the server cluster receives the segmented voice stream, the server or the server cluster can know which client the segmented voice stream comes from by determining the voice stream identifier corresponding to the segmented voice stream, so that the subsequent supervision on the segmented voice stream is facilitated.

For example, the server cluster acquires a segmented voice stream 1 from the client a, a segmented voice stream 2 from the client B, a segmented voice stream 3 and a segmented voice stream 4 from the client C, and acquires voice stream identifiers corresponding to the segmented voice stream 1, the segmented voice stream 2, the segmented voice stream 3 and the segmented voice stream 4, respectively, so as to obtain a voice stream identifier D, a voice stream identifier E, a voice stream identifier F and a voice stream identifier G.

In an alternative embodiment, fig. 2 shows a flow chart before a segmented voice stream sent by a client is obtained in a voice stream processing method, as shown in fig. 2, the method at least includes the following steps: in step S210, the client is connected to the voice transmission system to establish data communication between the client and the voice transmission system.

The voice transmission system refers to a system provided with a voice transmission node, and specifically, the voice transmission system enables a client to transmit a segmented voice stream to the voice transmission node in the voice transmission system by establishing data communication with the client. When the voice transmission node receives the segmented voice stream sent by the client, the segmented voice stream is transmitted to a subsequent required server or a server cluster.

For example, there are 2 clients, which are a live client and a spectator client, respectively, and there are voice software development kits in both the live client and the spectator client, at this time, the live client and the spectator client access the voice transmission system, so that the live client and the voice chat room establish data communication with the voice transmission system, respectively.

In step S220, based on the data communication between the client and the voice transmission system, the voice in the client is recorded by using the voice transmission node in the voice transmission system to obtain a segmented voice stream.

After data communication is established between the client and the voice transmission system, voice can be acquired from a voice software development kit in the client by using a voice transmission node in the voice transmission system, so that the voice in the client is recorded, and a segmented voice stream is obtained.

For example, on the basis that the live broadcast client and the voice chat room respectively establish data communication with the voice transmission system, the voice transmission node in the voice transmission system is used to record the voices in the live broadcast client and the audience client in real time, and the recording process usually takes minutes as a unit to obtain the segmented voice streams in the two clients.

In step S230, a target forwarding address is obtained, and the segmented voice stream is sent based on the target forwarding address, so that the target server corresponding to the target transport address receives the segmented voice stream sent by the client.

The target server refers to a server which needs to use the segmented voice stream in the downstream, and may be a voice stream supervision server, for example. The target server may be a receiving server of the segmented voice stream, or may be a cluster of receiving servers of the segmented voice stream, which is not particularly limited in this exemplary embodiment. Based on this, the destination forwarding address refers to the receiving address of the destination server. And when the voice transmission system obtains the segmented voice stream in real time, sending the segmented voice stream to a target server corresponding to the target forwarding address so that the target server receives the segmented voice stream.

For example, the target forwarding address is XXX, and the segmented voice stream obtained in real time by the voice transmission system is transmitted to the server cluster corresponding to the target forwarding address XXX.

In the exemplary embodiment, on one hand, the voice in the client is recorded through the voice transmission system, so that the voice recording process does not depend on the client any more, the situation that the voice stream needs to be acquired by monitoring an event in the client is avoided, the complexity of acquiring the voice stream is reduced, the efficiency of acquiring the voice stream is improved, on the other hand, if the segmented voice stream needs to be modified subsequently, the segmented voice stream can be directly acquired from the voice transmission system, the segmented voice stream sent again by the client does not need to be relied on, the flexibility in the segmented voice stream modifying process is increased, and the application scene of voice stream supervision is expanded.

In an alternative embodiment, sending the segmented voice stream based on the destination forwarding address includes: the segmented voice stream is transmitted using a connectionless transport protocol based on the destination forwarding address.

The voice transmission system can successfully send the segmented voice stream to the target server under the condition that no connection exists between the voice transmission system and the target server through a connectionless transmission protocol, and the voice transmission system can distribute a large number of segmented voice streams on a plurality of operation units in the process of transmitting the segmented voice streams.

For example, the segmented voice stream is sent to the server cluster corresponding to the target forwarding address XXX using a connectionless transport protocol.

In the exemplary embodiment, the transmission of the segmented voice stream utilizes a connectionless transmission protocol, which enhances the system stability of the whole supervision process of the segmented voice stream.

In step S120, the segmented voice streams having the same voice stream identifier are aggregated to obtain voice packets, and the segmented voice streams are supervised based on the voice packets.

In an exemplary embodiment of the present disclosure, voice streams with the same voice stream identifier are aggregated, an obtained aggregation result is a voice packet, and the voice packet is supervised to implement supervision on the voice stream sent by the client.

For example, a segmented voice stream a, a segmented voice stream B, a segmented voice stream C, and a segmented voice stream D sent by different clients are acquired, and a voice stream identifier corresponding to the segmented voice stream a is a1, a voice stream identifier corresponding to the segmented voice stream B is a1, a voice stream identifier corresponding to the segmented voice stream C is C1, and a voice stream identifier corresponding to the segmented voice stream D is D1.

Based on this, the voice streams having the same voice stream identifier are grouped together to obtain 3 voice packets, which are a voice packet a including the segmented voice stream a and the segmented voice stream B, a voice packet B including the segmented voice stream C, and a voice packet C including the segmented voice stream D.

In an alternative embodiment, fig. 3 shows a flow chart of aggregating segmented voice streams with the same voice stream identification to obtain voice packets in a voice stream processing method, where the segmented voice streams include voice streams of individual users, as shown in fig. 3, the method at least includes the following steps: in step S310, if the segmented voice stream is of a single user, the segmented voice stream is segmented according to the voice stream identifier to obtain a first segmentation result.

The segmented voice stream may be a segmented voice stream including only a single user voice due to the existence of different types of clients, and is first divided according to a voice stream identifier corresponding to the client to obtain a first division result, which is worth mentioning that the first division result is a segmented voice stream from the same client.

For example, the obtained segmented voice stream includes a segmented voice stream a, a segmented voice stream B, and a segmented voice stream C, and the voice stream identifier of the obtained segmented voice stream a is 123XXX, the voice stream identifier of the obtained segmented voice stream B is 123XXX, and the voice stream identifier of the obtained voice stream C is 234 XXX.

Based on the above, the 3 segmented voice streams are divided into two parts according to the voice stream identification, wherein the first part comprises a segmented voice stream a and a segmented voice stream B, and the second part comprises a segmented voice stream C.

In step S320, a user identifier corresponding to the segmented voice stream is obtained, and a hash value corresponding to the user identifier is determined.

The client to which the segmented voice stream belongs can be determined through the voice stream identification, and the segmented voice stream is determined to be the voice stream of a single user according to the type of the client, so that a user identification corresponding to the segmented voice stream also exists. When a client with only a single user voice stream generates a voice stream, a mapping relation between the segmented voice stream and the voice stream identifier is established, and a mapping relation between the segmented voice stream and the user identifier is also established, so that the segmented voice stream is clear which user in which client sends out the segmented voice stream.

The hash value refers to a value with a fixed length for distinguishing the user id, and by the hash value corresponding to the user id, it is possible to specify which segmented voice streams have the same user id.

For example, the segmented voice stream includes a segmented voice stream a, a segmented voice stream B, and a segmented voice stream C, based on which the user identifier of the segmented voice stream a is obtained as XXXX1, the user identifier of the segmented voice stream B is obtained as XXXX1, the user identifier of the segmented voice stream C is obtained as 2XXXX, and the hash value XX0 corresponding to the first user identifier XXXX1, the hash value corresponding to the second user identifier XXXX1 is obtained as XX0, and the hash value corresponding to the user identifier 2XXXX is obtained as XX 2.

In step S330, in the first division result, the segmented voice streams with the same user identifier are aggregated to obtain a voice packet according to the hash value.

And aggregating the segmented voice streams in the first division result to obtain voice packets based on the hash value corresponding to the user identifier.

And, dividing the 3 segmented voice streams into two parts according to the voice stream identification, wherein the first part comprises a segmented voice stream A and a segmented voice stream B, and the second part comprises a segmented voice stream C.

Based on this, the two voice packets are obtained as a voice packet containing the segmented voice stream a and the segmented voice stream B, and a voice packet containing the segmented voice stream C.

In the exemplary embodiment, a way to aggregate the segmented voice streams generated by a single user is provided to facilitate subsequent policing of the segmented voice streams of the single user.

In an alternative embodiment, fig. 4 shows a flow chart of aggregating segmented voice streams with the same voice stream identification to obtain voice packets in a voice stream processing method, where the segmented voice streams include a mixed voice stream of multiple users, as shown in fig. 4, and the method at least includes the following steps: in step S410, if the segmented voice stream is a mixed voice stream, the mixed voice stream is divided according to the voice stream identifier to obtain a second division result.

Where a mixed voice stream refers to a segmented voice stream containing multiple user voices, for example, in a voice chat room, there may be multiple users' voices in the voice chat room in the segmented voice stream generated every minute.

For example, if there is a segmented voice stream a, a segmented voice stream B, and a segmented voice stream C, and the voice stream identifier corresponding to the segmented voice stream a is XXXX23, the voice stream identifier corresponding to the segmented voice stream B is XXXX23, and the voice stream identifier corresponding to the segmented voice stream C is XXXX31, the second division result includes two parts, where the first part includes the segmented voice stream a and the segmented voice stream B, and the second part includes the segmented voice stream C.

In step S420, a channel identifier corresponding to a second division result is obtained, and in the second division result, the mixed voice streams with the same channel identifier are aggregated to obtain an aggregated result.

Wherein, the channel identification refers to the identification of the channel where multiple users are located, for example, in the viewer client, multiple voice chat rooms may exist, including chat room a and chat room B, and it can be determined from the channel identification which chat room the segmented voice stream comes from.

The channel identification of segmented speech stream a is XXX01, the channel identification of segmented speech stream B is XXX01, and the channel identification of segmented speech stream C is XXX01, so that the aggregation result has three parts, a first part being segmented speech stream a, a second part being segmented speech stream B, and a third part being segmented speech stream C.

In step S430, in the aggregation result, the user identities of the plurality of users in the mixed voice stream are determined, and the hash value corresponding to the user identities is determined.

And if a mixed voice stream comprising a plurality of users possibly exists in the aggregation result, determining user identifications of the plurality of users corresponding to the mixed voice stream so as to determine a hash value corresponding to the user identifications.

For example, the aggregation result has three parts, the first part is a segmented voice stream a, the second part is a segmented voice stream B, and the third part is a segmented voice stream C.

The segmented voice stream a includes three user identities, X1, X2, X1, and X3, the segmented voice stream B includes two user identities, X4 and X5, and the segmented voice stream C includes 2 user identities, X6 and X1.

In step S440, in the aggregation result, according to the hash value, the user segmented voice streams having the same user identifier in the mixed voice stream are extracted, so as to aggregate the user segmented voice streams to obtain a voice packet.

And extracting the user segmented voice stream with the same user identification from the aggregation result according to the hash value to obtain the voice packet.

Based on this, in the first part of the aggregation result, three speech packets, respectively a speech packet comprising 2 segmented speech streams corresponding to the user identity X1, a speech packet comprising 1 segmented speech stream corresponding to the user identity X2, a speech packet comprising 1 segmented speech stream corresponding to the user identity X3, may be extracted, in addition to which, in the second part of the aggregation result, two speech packets, respectively a speech packet comprising 1 segmented speech stream corresponding to the user identity X4, a speech packet comprising 1 segmented speech stream corresponding to the user identity X5, and in the third part of the aggregation result, two speech packets, respectively a speech packet comprising 1 segmented speech stream corresponding to the user identity X6, a speech packet comprising 1 segmented speech stream corresponding to the user identity X1, may be extracted.

In the exemplary embodiment, a mode of aggregating mixed voice streams is provided, so that not only segmented voice streams of a single user can be supervised, but also mixed voice streams can be supervised, not only the use scene of voice stream supervision is enlarged, but also in the subsequent process of supervising the voice streams, two dimensions of the voice streams of the single user and the mixed voice streams can be combined, and a more accurate supervision result is obtained.

In an alternative embodiment, fig. 5 shows a flow chart of policing a segmented voice stream based on voice packets in a voice stream processing method, as shown in fig. 5, the method at least includes the following steps: in step S510, a user segmented voice stream in the voice packet is obtained, and a volume value and a sound continuous value corresponding to the user segmented voice stream are obtained; wherein the sound continuity value is used to characterize the continuity of the sound.

The volume value represents the volume of the sound corresponding to the segmented voice stream, and the sound continuity value represents the continuity of the sound corresponding to the segmented voice stream.

For example, the voice packet includes 3 user segmented voice streams, which are a user segmented voice stream 1, a user segmented voice stream 2, and a user segmented voice stream 3, respectively, wherein the volume value corresponding to the user segmented voice stream 1 is 1, the sound continuity value corresponding to the user segmented voice stream 1 is 0.5, the volume value corresponding to the user segmented voice stream 2 is 7, the sound continuity value corresponding to the user segmented voice stream 2 is 8, the volume value corresponding to the user segmented voice stream 3 is 6, and the sound continuity value corresponding to the user segmented voice stream 3 is 7.

In step S520, a first target voice stream with a volume value smaller than a volume threshold is determined in the user segmented voice stream, and a second target voice stream with a sound continuity value smaller than a continuity value threshold is determined in the voice file.

The volume threshold refers to a value for measuring the volume of the sound, and if the volume value is smaller than the volume threshold, it represents that the volume value is too small to distinguish the content to be expressed by the sound, and the continuity threshold refers to a value for measuring the continuity of the sound, and if the continuity value is smaller than the continuity threshold, it indicates that the continuity of the sound is poor, and it is difficult to distinguish the content to be expressed by the sound.

For example, the voice packet includes 3 user segmented voice streams, which are a user segmented voice stream 1, a user segmented voice stream 2, and a user segmented voice stream 3, respectively, where a volume value corresponding to the user segmented voice stream 1 is 1, a sound continuity value corresponding to the user segmented voice stream 1 is 6, a volume value corresponding to the user segmented voice stream 2 is 7, a sound continuity value corresponding to the user segmented voice stream 2 is 4, a sound continuity value corresponding to the user segmented voice stream 3 is 6, and a sound continuity value corresponding to the user segmented voice stream 3 is 7.

The volume threshold is 5, and the continuous value threshold is 5, at this time, in the user segmented voice stream 1, the user segmented voice stream 2, and the user segmented voice stream 3, the volume value of the user segmented voice stream 1 is smaller than the volume threshold, then the user segmented voice stream 1 is the first target voice stream, the sound continuous value of the user segmented voice stream 2 is smaller than the continuous value threshold, then the user segmented voice stream 2 is the second target voice stream.

In step S530, the first target voice stream and the second target voice stream are deleted from the voice packet to obtain a deleted voice packet.

And deleting the first target voice stream and the second target voice stream in the voice packet so as to avoid the existence of a voice file which can not distinguish the sound content in the voice packet.

For example, the voice packet includes 3 user segmented voice streams, which are a user segmented voice stream 1, a user segmented voice stream 2, and a user segmented voice stream 3, respectively, where the user segmented voice stream 1 is a first target voice stream, and the user segmented voice stream 2 is a second target voice stream, and then the user segmented voice stream 1 and the user segmented voice stream 2 are deleted, so as to obtain the voice packet including the user segmented voice stream 3.

In step S540, the segmented voice stream is supervised based on the deleted voice packet.

And according to the deleted voice packet, realizing the supervision of the segmented voice stream.

For example, the character recognition is performed on the segmented voice stream in the deleted voice packet, and the supervision on the segmented voice stream is realized according to the character recognition result.

In the exemplary embodiment, the first target voice stream with the volume value smaller than the volume threshold and the second target voice stream with the sound continuity value smaller than the continuity value threshold are deleted from the voice packet, so as to ensure that the segmented voice stream included in the voice packet is a segmented voice stream capable of distinguishing sound, avoid performing invalid supervision on the segmented voice stream which cannot distinguish sound content, and improve the efficiency of supervision on the segmented voice stream.

In an alternative embodiment, fig. 6 shows a flow diagram of policing segmented voice streams in a voice stream processing method, as shown in fig. 6, the method at least includes the following steps: in step S610, the segmented voice stream is recognized to obtain text information corresponding to the segmented voice stream.

The segmented voice stream is identified, for example, by using a text identification module to identify the segmented voice stream, so as to obtain text information corresponding to the segmented voice stream.

For example, the segmented voice stream includes a segmented voice stream a and a segmented voice stream B, and the segmented voice stream a is recognized by the voice recognition module to obtain text information a1 corresponding to the segmented voice stream a, and the segmented voice stream B is recognized to obtain text information B1 corresponding to the segmented voice stream B.

In step S620, the segmented voice stream is transcoded to obtain voices corresponding to the segmented voice stream, and sensitive vocabulary determination is performed on the text information in combination with upper and lower contexts corresponding to the voices, so as to implement supervision on the segmented voice stream.

And transcoding the segmented voice stream to obtain voice which can be played. The method comprises the steps of judging sensitive words and phrases of character information, wherein one mode is to firstly judge the character information by using a sensitive word and phrase judgment module, then play voice according to requirements, and judge the sensitive words and phrases of the character information again by combining upper and lower contexts of the voice, and the other mode is to directly play the voice and judge the sensitive words and phrases of the character information by combining the upper and lower contexts of the voice.

If there is sensitive vocabulary in the text information, the corresponding warning information can be issued to perform the supervision of the voice stream, where the sensitive vocabulary may be abuse vocabulary, politically sensitive vocabulary, legal vocabulary, or any vocabulary that should not appear in the client, and this exemplary embodiment is not limited in this respect.

For example, the sensitive vocabulary judgment module is used for judging the sensitive vocabulary of the text information, and the judgment result indicates that the sensitive information exists in the text information, and at the moment, warning information is generated for warning a user generating a segmented voice stream corresponding to the text information, so that the supervision effect on the segmented voice stream is achieved.

After the user receives the warning information, the user can send out the question information on the assumption that the user considers that sensitive words do not exist in the voice, at the moment, relevant workers are required to play the voice, and sensitive word judgment is carried out on the segmented voice stream generated by the user by combining the upper context and the lower context corresponding to the voice so as to maintain the original judgment result or modify the original judgment result.

In the exemplary embodiment, the accuracy of supervision of the segmented voice stream is improved by combining the upper context and the lower context corresponding to the voice to perform sensitive vocabulary judgment on the text information.

In an alternative embodiment, fig. 7 is a schematic flow chart illustrating a method for implementing supervision on a segmented voice stream in a voice stream processing method, as shown in fig. 7, the method at least includes, in step S710, performing sensitive vocabulary recognition on text information, and if the text information includes a sensitive vocabulary, determining that a supervision result of the segmented voice stream is a first result.

And determining that the result of the supervision of the segmented voice stream is a first result if the character information contains the sensitive vocabulary.

For example, as shown in fig. 8, the interface of the speech monitoring background is shown in fig. 8, where the upload time is time for acquiring a speech packet, a source refers to a client to which a segmented speech stream corresponding to the speech packet belongs, a role identifier indicates a game role played by a user corresponding to a user identifier of the speech packet, and a text refers to text information obtained by performing text recognition on the speech packet.

It is apparent that in the first row of fig. 8, the word "neuropathy" is included in the text, and thus the first result is displayed in the cell in the first row that corresponds to the supervised result.

In step S720, if the text message does not include the sensitive vocabulary, the supervision result of the segmented voice stream is determined as the second result.

If the text information does not contain sensitive words, the supervision result can be a second result, namely the text information is a result meeting the supervision requirement.

For example, if the text message does not include the sensitive words, the second result is displayed in the cell corresponding to the supervision result in fig. 8.

In the exemplary embodiment, a method for supervising a segmented voice stream is provided by performing text recognition on a voice packet to obtain text information and performing sensitive vocabulary recognition on the text information to obtain a first result or a second result, thereby realizing supervision on the voice stream.

In an optional embodiment, after determining that the supervision result of the segmented voice stream is the first result, the method further includes: and determining a user identifier corresponding to the voice packet, and determining a target terminal where the user corresponding to the user identifier is located according to the user identifier so as to send warning information to the target client.

And when the supervision result is the first result, acquiring a user identifier corresponding to the voice packet, determining a target client corresponding to the user identifier, and sending warning information to the target client.

For example, the user identifier corresponding to the voice packet a is 123XXXX, it is determined that the user corresponding to the user identifier is "XX", and the target client in which the "XX" is located is the client a, at this time, it is required to send a warning message to the target client a to warn that the "XX" refers to the sensitive vocabulary.

In the exemplary embodiment, when the supervision result is the first result, the warning information is sent to the target client to warn the user generating the voice packet, so that the supervision efficiency is improved, and the user experience of other users in the target client is also improved.

In an alternative embodiment, fig. 9 shows a flow chart after sending the warning message to the target client in the voice stream processing method, as shown in fig. 9, the method at least includes the following steps: in step S910, if the challenge information for the first result is obtained, the generation time corresponding to the voice packet is obtained.

It should be noted that the first result is a result of determining that the voice packet includes the sensitive vocabulary, and since there may be an error in the process of performing the sensitive vocabulary recognition or the first result is not recognized by the user, at this time, the challenge information for the first result sent by the user may be received, at this time, the voice corresponding to the voice packet needs to be obtained, so as to perform context analysis according to the voice, that is, upper and lower context analysis, to determine whether the first result is correct, and when determining the voice corresponding to the voice packet, the generation time of the voice packet needs to be determined first.

For example, after sensitive word recognition is performed on a segmented voice stream in a voice packet generated by a user a in a virtual game client, a result that a supervision result is a first result is obtained, the result is sent to the virtual game client, and the user a is prohibited to perform the call processing, at this time, the user a considers that the supervision result is wrong, question information for the first result can be sent, and based on this, the generation time of the voice packet a is obtained for 2 points and 23 points.

In step S920, a segmented voice stream corresponding to the voice packet is determined according to the generation time, and the segmented voice stream is subjected to contextual analysis.

The segmented voice stream is usually a result obtained by dividing the segmented voice stream by taking minutes as a dividing unit, so that the segmented voice stream corresponding to the voice packet can be determined in a plurality of segmented voice streams according to the generation time of the voice packet, so as to perform context analysis on the segmented voice stream, and further determine the correctness of the first result again.

For example, the generation time of the voice packet a is 2: 23 minutes, and there are 3 voice streams having the same voice stream identifier as the voice packet a, which are respectively a segmented voice stream 1, a segmented voice stream 2, and a segmented voice stream 3, where the generation time of the segmented voice stream 1 is 2: 22 minutes, the generation time of the segmented voice stream 2 is 2: 25 minutes, and the generation time of the segmented voice stream 3 is 2: 30 minutes, so that the segmented voice stream corresponding to the voice packet a is determined to be the segmented voice stream 1 according to the generation time, so as to perform context analysis in the segmented voice stream 1.

In step S930, the first result is updated according to the result of the contextual analysis.

Wherein the first result is updated if the result of the contextual analysis is an opposite result to the first result, and otherwise, the first result is maintained if the result of the contextual analysis is the same as the first result.

For example, the generation time of the voice packet a is 2: 23 minutes, it can be determined that the segmented voice stream corresponding to the voice packet a is the segmented voice stream 1, the voice of the user generating the voice packet a within 1 minute is included in the segmented voice stream 1, and the voice of other users can also be included in the segmented voice stream 1, the segmented voice stream 1 is subjected to the contextual analysis, the obtained contextual analysis result is the second result, and based on this, the first result is updated to the second result.

In the exemplary embodiment, the sensitive vocabulary recognition is performed on the voice packet, the context analysis is also performed on the segmented voice stream, and the supervision result is determined by combining the recognition result of the sensitive vocabulary recognition and the analysis result of the context analysis, so that the accuracy of the determined supervision result is improved, and the situation that the supervision result is wrong is avoided.

The following describes a speech stream processing method in the embodiment of the present disclosure in detail with reference to an application scenario.

Fig. 10 is a schematic flow chart of a voice stream processing method in an application scenario, as shown in fig. 10, where a toolkit 1001 is a virtual game client side voice software development toolkit, a node 1002 is a voice transmission node in a server cluster, step S1010 is a process of acquiring, by the node 1002, a voice sent in the toolkit 1001, the server cluster 1003 is a target server cluster, step S1020 is a process of copying, by the node 1002, segmented voice streams collected by different threads to a forwarding thread, and then the forwarding thread shares the different segmented voice streams in the server cluster 1003 in a load balancing manner based on a connectionless data transmission protocol.

Step S1030 is a process of aggregating the segmented voice streams according to the voice stream identifiers to obtain voice packets, step S1040 is a process of storing the voice packets, step S1050 is a process of encoding the voice packets, step S1060 is a process of determining a first target voice stream and a second target voice stream in the voice packets and deleting the first target voice stream and the second target voice stream in the voice packets, step S1070 is a process of performing character recognition and sensitive vocabulary recognition on the voice packets to obtain supervision results, the server 1004 is a downstream server, and step S1080 is a process of transmitting the voice packets to the downstream server to serve as a data source in the downstream server.

In the application scenario, on one hand, the processing of the voice stream does not depend on the client, thereby avoiding the situation that the voice stream needs to be acquired by monitoring the event in the client, reducing the complexity of acquiring the voice stream, and improving the efficiency of acquiring the voice stream; on the other hand, the condition that the segmented voice stream needs to be modified, the client side needs to resend the segmented voice stream is avoided, the flexibility in the process of modifying the segmented voice stream is increased, and the application scene of voice stream supervision is expanded.

Furthermore, in an exemplary embodiment of the present disclosure, a voice stream processing apparatus is also provided. Fig. 11 shows a schematic structural diagram of a voice stream processing apparatus, and as shown in fig. 11, the voice stream processing apparatus 1100 may include: an acquisition module 1110 and a login module 1120. Wherein:

an obtaining module 1110 configured to obtain a segmented voice stream sent by a client, and determine a voice stream identifier corresponding to the segmented voice stream; and the login module 920 is configured to aggregate the segmented voice streams with the same voice stream identifier to obtain a voice packet, and implement supervision on the segmented voice streams based on the voice packet.

The details of the voice stream processing apparatus 1100 are described in detail in the corresponding voice stream processing method, and therefore are not described herein again.

It should be noted that although several modules or units of the voice stream processing apparatus 1100 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

An electronic device 1200 according to such an embodiment of the invention is described below with reference to fig. 12. The electronic device 1200 shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 12, the electronic device 1200 is embodied in the form of a general purpose computing device. The components of the electronic device 1200 may include, but are not limited to: the at least one processing unit 1210, the at least one memory unit 1220, the bus 1230 connecting the various system components (including the memory unit 1220 and the processing unit 1210), and the display unit 1240.

Wherein the memory unit stores program code that is executable by the processing unit 1210 to cause the processing unit 1210 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification.

The storage unit 1220 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1021 and/or a cache storage unit 1222, and may further include a read only memory unit (ROM) 1223.

Storage unit 1220 may also include a program/utility 1224 having a set (at least one) of program modules 1225, such program modules 1225 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, and in some combination, may comprise a representation of a network environment.

Bus 1230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1200 may also communicate with one or more external devices 1270 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 1250. Also, the electronic device 1200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 1260. As shown, the network adapter 1260 communicates with the other modules of the electronic device 1200 via the bus 1230. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 13, a program product 1300 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for processing a voice stream, the method comprising:

acquiring a segmented voice stream sent by a client, and determining a voice stream identifier corresponding to the segmented voice stream;

and aggregating the segmented voice streams with the same voice stream identification to obtain voice packets, and supervising the segmented voice streams based on the voice packets.

2. The voice stream processing method according to claim 1, wherein before the obtaining the segmented voice stream sent by the client, the method further comprises:

accessing a client to a voice transmission system to establish data communication between the client and the voice transmission system;

recording the voice in the client by utilizing a voice transmission node in the voice transmission system based on the data communication between the client and the voice transmission system to obtain a segmented voice stream;

and acquiring a target forwarding address, and sending the segmented voice stream based on the target forwarding address, so that a target server corresponding to the target transmission address receives the segmented voice stream sent by the client.

3. The voice stream processing method according to claim 2, wherein the sending the segmented voice stream based on the target forwarding address comprises:

and transmitting the segmented voice stream by using a connectionless transmission protocol based on the target forwarding address.

4. The voice stream processing method according to claim 1, wherein the segmented voice stream comprises a voice stream of a single user;

the aggregating the segmented voice streams with the same voice stream identifier to obtain a voice packet includes:

if the segmented voice stream is of a single user, dividing the segmented voice stream according to the voice stream identification to obtain a first division result;

acquiring a user identifier corresponding to the segmented voice stream, and determining a hash value corresponding to the user identifier;

and in the first division result, according to the hash value, aggregating the segmented voice flows with the same user identification to obtain a voice packet.

5. The voice stream processing method according to claim 4, wherein the segmented voice stream is a mixed voice stream of a plurality of the users;

if the segmented voice stream is the mixed voice stream, dividing the mixed voice stream according to the voice stream identification to obtain a second division result;

acquiring a channel identifier corresponding to the second division result, and aggregating the mixed voice streams with the same channel identifier in the second division result to obtain an aggregated result;

determining user identifications of a plurality of users in the mixed voice stream in the aggregation result, and determining hash values corresponding to the user identifications;

and in the aggregation result, according to the hash value, extracting the user segmented voice stream with the same user identifier in the mixed voice stream so as to aggregate the user segmented voice stream to obtain a voice packet.

6. The voice stream processing method according to claim 5, wherein the voice packet comprises at least one user segmented voice stream;

the policing the segmented voice stream based on the voice packet includes:

acquiring the user segmented voice stream in the voice packet, and acquiring a volume value and a sound continuous value corresponding to the user segmented voice stream; wherein the sound continuity value is used to characterize the continuity of the sound;

determining a first target voice stream with the volume value smaller than a volume threshold value in the user segmented voice stream, and determining a second target voice stream with the sound continuity value smaller than a continuity value threshold value in the voice file;

deleting the first target voice stream and the second target voice stream in the voice packet to obtain the deleted voice packet;

and monitoring the segmented voice stream based on the deleted voice packet.

7. The voice stream processing method according to any one of claims 1 or 6, wherein the policing the segmented voice stream comprises:

identifying the segmented voice stream to obtain character information corresponding to the segmented voice stream;

and transcoding the segmented voice stream to obtain voice corresponding to the segmented voice stream, and judging sensitive words of the text information by combining upper and lower contexts corresponding to the voice so as to realize supervision on the segmented voice stream.

8. A voice stream processing apparatus, comprising:

the acquisition module is configured to acquire a segmented voice stream sent by a client and determine a voice stream identifier corresponding to the segmented voice stream;

and the login module is configured to gather the segmented voice streams with the same voice stream identification to obtain voice packets, and realize supervision on the segmented voice streams based on the voice packets.

9. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the voice stream processing method of any one of claims 1-7 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the voice stream processing method according to any one of claims 1 to 7.