CN113393842A

CN113393842A - Voice data processing method, device, equipment and medium

Info

Publication number: CN113393842A
Application number: CN202011295049.6A
Authority: CN
Inventors: 黄铁鸣; 李娜芬; 顾华阳; 林莉; 李斌; 梁百怡
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-09-14

Abstract

The embodiment of the application provides a voice data processing method, a device, equipment and a medium, the method relates to the field of artificial intelligence, and the method comprises the following steps: when the application client side obtains the voice message of the session interface, obtaining a voice identifier corresponding to the voice message, adding the voice identifier to an initial identifier queue, and taking the initial identifier queue with the voice identifier as a target identifier queue; generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sending the voice conversion request to a server so that the server can acquire conversion text information corresponding to the voice identification; receiving the converted text information returned by the server, and outputting the converted text information to the position area where the voice message is located in the session interface; there is an association between the voice message and the converted text information in the location area. By adopting the method and the device, active touch of text information conversion can be realized, and the conversion efficiency of voice messages can be improved.

Description

Voice data processing method, device, equipment and medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a medium for processing voice data.

Background

When an existing application client (e.g., a social client) with an instant messaging function converts a voice message into text information, a target user (e.g., user a) needs to manually operate the voice message (e.g., voice message 1) to initiate a voice conversion request, so as to obtain the text information (e.g., text information 1) corresponding to the voice message 1. This means that the existing speech-to-text scheme belongs to a passive conversion scheme, and it is difficult to realize active contact of text information.

In addition, according to the existing voice-to-text scheme of the social client, when the voice message 1 received by the user a is converted, the social client is required to upload the local voice data of the voice message 1 to the server when a voice conversion request is initiated, depending on the local data of the social client. Obviously, under the condition that the network environment is unstable, the uploading speed of the voice data is slow, the text information cannot be provided for the user a quickly, and even the uploading failure of the voice data occurs, so that the conversion efficiency of the voice message is reduced.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a device, and a medium for processing voice data, which can implement active touch for converting text information and can improve conversion efficiency of voice messages.

An embodiment of the present application provides a method for processing voice data, including:

when the application client side obtains the voice message of the session interface, obtaining a voice identifier corresponding to the voice message, adding the voice identifier to an initial identifier queue, and taking the initial identifier queue with the voice identifier as a target identifier queue;

generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sending the voice conversion request to a server so that the server can acquire conversion text information corresponding to the voice identification based on the voice conversion request;

receiving the converted text information returned by the server, and outputting the converted text information to the position area where the voice message is located in the session interface; there is an association between the voice message and the converted text information in the location area.

An aspect of an embodiment of the present application provides a voice data processing apparatus, including:

the voice acquisition module is used for acquiring a voice identifier corresponding to the voice message when the application client side acquires the voice message of the session interface, adding the voice identifier to the initial identifier queue, and taking the initial identifier queue with the voice identifier as a target identifier queue;

the request sending module is used for generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue and sending the voice conversion request to the server so that the server can obtain conversion text information corresponding to the voice identification based on the voice conversion request;

the text receiving module is used for receiving the converted text information returned by the server and outputting the converted text information to the position area where the voice message is located in the conversation interface; there is an association between the voice message and the converted text information in the location area.

Wherein, the session interface comprises a second user associated with the first user; the initial identification queue comprises a first sub-queue and a second sub-queue; the first sub-queue is used for storing a first voice identifier; the first voice identification is used for representing the identification of a first voice message of a voice conversion request to be sent in the application client; the second sub-queue is used for storing a second voice identifier; the second voice identification is used for representing the identification of the second voice message of the sent voice conversion request in the application client;

the voice acquisition module comprises:

the voice receiving unit is used for receiving the voice message forwarded by the second user through the server by the application client corresponding to the first user and receiving the voice identifier configured for the voice message by the server;

the timestamp determining unit is used for acquiring a voice conversion condition associated with the session interface, determining the received voice identifier as a target voice identifier based on the voice conversion condition, taking the voice message received by the application client as a target voice message, and marking a receiving timestamp corresponding to the target voice message as a target receiving timestamp;

the identification adding unit is used for determining the queue position of the target voice identification of the target voice message in a first sub-queue containing the identification of the first voice message based on the target receiving timestamp, and adding the target voice identification to the first sub-queue based on the queue position to obtain an initial first sub-queue;

a queue determining unit, configured to determine a target identification queue based on the initial first sub-queue and a second sub-queue containing an identification of the second voice message.

The request priority of the second sub-queue is greater than that of the first sub-queue;

the voice acquisition module further comprises:

the first triggering unit is used for responding to triggering operation aiming at a conversation interface where a second user is located, outputting a target voice message to the conversation interface and acquiring an initial grade adjusting instruction in a voice conversion condition;

the first adjusting unit is used for determining the queue position of the target voice identifier as a first position in the initial first sub-queue based on the initial grade adjusting instruction, and adjusting the queue position of the target voice identifier from the first position to a second position in the initial first sub-queue to obtain an adjusted initial first sub-queue; the request priority of the identifier corresponding to the second position is greater than that of the identifier corresponding to the first position;

and the first updating unit is used for updating the target identification queue based on the adjusted initial first sub-queue and the adjusted second sub-queue.

Wherein, the voice acquisition module further comprises:

the second trigger unit is used for responding to the trigger operation aiming at the target voice message in the session interface and acquiring a target grade adjusting instruction in the voice conversion condition;

the second adjusting unit is used for determining the adjusted initial first sub-queue as a target first sub-queue based on the target grade adjusting instruction, and adjusting the queue position of the target voice identifier from the second position to the third position in the target first sub-queue to obtain an adjusted target first sub-queue; the request priority of the identifier corresponding to the third position is greater than that of the identifier corresponding to the second position;

and the second updating unit is used for updating the updated target identification queue based on the adjusted first sub-queue and the second sub-queue of the target.

The target identification queue comprises an identification queue to be requested and an identification queue requested; the voice identifier is positioned in the identifier queue to be requested; the requested identification queue comprises M queue positions; a queue position in the requested identification queue is used for storing an identification of the voice message to be converted; m is the total number of the identifications of the voice message to be converted, which has sent the voice conversion request;

the request sending module comprises:

the information receiving unit is used for receiving conversion success information which is returned by the server and aims at M voice messages to be converted of the transmitted voice conversion request, and recording the conversion number of the received conversion success information as N; n is a positive integer less than or equal to M;

the position determining unit is used for acquiring the queue position of the voice identifier in the identifier queue to be requested of the target identifier queue, and determining the target queue position of the voice identifier in the requested identifier queue when the queue position of the voice identifier meets a voice conversion condition;

and the request generating unit is used for adding the voice identifier to the requested identifier queue based on the target queue position, generating a voice conversion request carrying the voice identifier based on the requested identifier queue added with the voice identifier, and sending the voice conversion request to the server.

Wherein, the device still includes:

and the identification deleting module is used for acquiring target conversion success information aiming at the voice message when receiving the conversion text information returned by the server and deleting the voice identification from the target identification queue based on the target conversion success information.

when the voice message of the application client is acquired, generating a voice identifier corresponding to the voice message, and sending the voice message and the voice identifier to a user terminal so that the user terminal adds the voice identifier to an initial identifier queue, and taking the initial identifier queue added with the voice identifier as a target identifier queue;

receiving a voice conversion request sent by a user terminal, and acquiring a voice identifier from the voice conversion request; the voice conversion request is generated based on the queue position of the voice identifier in the target identifier queue;

when the voice message corresponding to the voice identification is inquired, converting the voice message to obtain converted text information corresponding to the voice message;

and returning the converted text information to the user terminal so that the user terminal outputs the converted text information to the position area where the voice message is located in the session interface of the application client.

the voice sending module is used for generating a voice identifier corresponding to the voice message when the voice message of the application client is obtained, and sending the voice message and the voice identifier to the user terminal so that the user terminal adds the voice identifier to the initial identifier queue and takes the initial identifier queue added with the voice identifier as a target identifier queue;

the request receiving module is used for receiving a voice conversion request sent by a user terminal and acquiring a voice identifier from the voice conversion request; the voice conversion request is generated based on the queue position of the voice identifier in the target identifier queue;

the text acquisition module is used for converting the voice message when the voice message corresponding to the voice identifier is inquired to obtain converted text information corresponding to the voice message;

and the text sending module is used for returning the converted text information to the user terminal so that the user terminal outputs the converted text information to the position area where the voice message is located in the session interface of the application client.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the method in the aspect of the embodiments of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions that, when executed by a processor, perform the steps of the method as in an aspect of the embodiments of the present application.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method provided in the various alternatives of the above aspect.

In the embodiment of the application, when the application client acquires the voice message of the session interface, the user terminal may acquire the voice identifier corresponding to the voice message, add the voice identifier to the initial identifier queue, and use the initial identifier queue with the voice identifier added as the target identifier queue. The voice message of the session interface may be a voice message sent by the application client, or a voice message received by the application client. Further, the user terminal may generate a voice conversion request carrying the voice identifier based on the queue position of the voice identifier in the target identifier queue, and send the voice conversion request to the server, so that the server queries a voice message corresponding to the voice identifier based on the voice conversion request, and performs conversion processing on the voice message to obtain converted text information. Further, the user terminal may receive the converted text information returned by the server, and output the converted text information to the location area where the voice message is located in the session interface. The voice message and the converted text information in the location area have an association relationship therebetween, for example, the voice message and the converted text information may have an adjacent location relationship in the conversation interface. It should be understood that, by introducing the target identifier queue, when the voice message and the voice identifier corresponding to the voice message are obtained, the first user corresponding to the user terminal is not required to perform the trigger operation, and the user terminal may output the conversion text information corresponding to the voice message in the session interface of the application client, so that the voice message may be automatically converted into the conversion text information corresponding to the voice message, and the active contact of converting the text information may be implemented. When the voice message is converted based on the voice identifier in the target identifier queue, the voice message in the local memory is not required to be uploaded to the server by the user terminal, the voice message corresponding to the voice identifier is intelligently inquired by the server according to the uploaded voice identifier, and the inquired voice message is converted, so that the problems caused by the uploading failure of the voice message and the like can be solved under the condition that the network environment is unstable, and the conversion efficiency of the voice message can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

fig. 2 is a schematic view of a scenario for performing data interaction according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a voice data processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a scenario for adding a voice identifier according to an embodiment of the present application;

fig. 5 is a schematic view of a scenario in which a user opens a session according to an embodiment of the present application;

fig. 6 is a schematic view of a scenario for a user to make a selection according to an embodiment of the present application;

fig. 7 is a schematic view of a scenario in which a converted text message is received according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a voice data processing method according to an embodiment of the present application;

fig. 9 is a schematic view of a scenario for forwarding a voice message according to an embodiment of the present application;

fig. 10 is a schematic flow chart of a speech-to-text scheme according to an embodiment of the present application;

fig. 11 is a schematic view of a scene for converting speech into text according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a speech data processing apparatus according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a speech data processing apparatus according to an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 16 is a speech data processing system according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Specifically, please refer to fig. 1, where fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 3000 and a cluster of user terminals. The user terminal cluster may specifically include one or more user terminals, and here, the number of the user terminals in the user terminal cluster is not limited. As shown in fig. 1, the plurality of user terminals may specifically include a user terminal 3000a, a user terminal 3000b, user terminals 3000c, …, and a user terminal 3000 n. The user terminals 3000a, 3000b, 3000c, …, and 3000n may be directly or indirectly connected to the server 3000 through wired or wireless communication, so that each user terminal may interact with the server 3000 through the network connection.

The server 3000 shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform.

Each ue in the ue cluster shown in fig. 1 may include: the intelligent terminal comprises an intelligent terminal with a voice data processing function, such as a smart phone, a tablet computer and a notebook computer. It should be understood that each user terminal in the user terminal cluster shown in fig. 1 may be integrally installed with an application Client, and when the application Client runs in each user terminal, data interaction may be performed between the application Client and the Server 3000 shown in fig. 1 based on a Client/Server (C/S) architecture. The application client may be understood as an instant messaging client capable of loading and displaying voice data, and for example, the application client may specifically include: social clients (e.g., WeChat clients), office clients (e.g., enterprise WeChat clients), entertainment clients (e.g., gaming clients), and in-vehicle clients, among others.

The voice data processing method provided by the embodiment of the application can relate to the voice technology direction in the field of artificial intelligence. It is understood that by Artificial Intelligence (AI) is meant a new technical science of using a digital computer or data computer controlled computer device (e.g., server 3000 of fig. 1) to simulate, extend and extend human Intelligence. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among the key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR), Speech synthesis (Text To Speech, TTS) and voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

It can be understood that the voice data processing method provided in the embodiment of the present application may also relate to the field of Cloud technology, and the Cloud technology (Cloud technology) refers to a hosting technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data. The Cloud technology (Cloud technology) is based on a general term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data of different levels are processed separately, and various industry data need strong system background support so as to be realized through cloud computing.

For convenience of understanding, in the embodiment of the present application, one user terminal may be arbitrarily selected from the user terminal cluster as the first terminal, for example, to describe a specific process of performing voice data processing in the first terminal, for example, the user terminal 3000c in the user terminal cluster shown in fig. 1 may be used as the first terminal in the embodiment of the present application. It should be understood that, in the embodiment of the present application, a user who logs in the application client through the first account information (for example, account information 1) in the first terminal may be referred to as a first user, that is, the first user may be a user using the first terminal. It is understood that the first user in the embodiment of the present application may be a user that receives a voice message through an application client in the first terminal, i.e., a message recipient.

It can be understood that, in the embodiment of the present application, a user who logs in the application client through second account information (for example, account information 2) may be referred to as a second user, and a user terminal corresponding to the second user may be referred to as a second terminal, that is, the second user may be a user using the second terminal. In the embodiment of the present application, one user terminal may be arbitrarily selected as the second terminal in the user terminal cluster, for example, the user terminal 3000a in the user terminal cluster shown in fig. 1 may be used as the second terminal in the embodiment of the present application. It is understood that the second user in the embodiment of the present application may be a user who sends a voice message through an application client in the second terminal, i.e., a message sender.

It should be understood that the first user in the embodiment of the present application may serve as both the message receiver and the message sender, for example, the first user may serve as the message receiver through an application client in the first terminal, and the first user may also serve as the message sender through an application client in the first terminal. Similarly, the second user in the embodiment of the present application may serve as both the message sender and the message receiver, for example, the second user may serve as the message sender through an application client in the second terminal, and the second user may also serve as the message receiver through the application client in the second terminal.

It is understood that the message sender and the message receiver may be connected through a server (e.g., the server 3000 described above), the server synchronizes the voice message from the user terminal corresponding to the message sender (e.g., the second terminal corresponding to the second user) to the user terminal corresponding to the message receiver (e.g., the first terminal corresponding to the first user), and in the subsequent steps, the voice message may be converted, so that the first user and the second user may directly obtain the converted text information corresponding to the voice message. The first terminal and the second terminal both operate application clients corresponding to the server, and the transmission and the reception of the voice message between the first terminal and the second terminal can be realized through the application clients.

For ease of understanding, please refer to fig. 2, and fig. 2 is a schematic diagram of a scenario for performing data interaction according to an embodiment of the present application. The server shown in fig. 2 may be the server 3000 in the embodiment corresponding to fig. 1, and the user terminal X shown in fig. 2 may be any one of the user terminals in the user terminal cluster in the embodiment corresponding to fig. 1, and for convenience of understanding, in this embodiment of the application, the user terminal 3000c shown in fig. 1 is taken as the user terminal X, so as to describe a specific process of data interaction between the user terminal X shown in fig. 2 and the server.

It is understood that the application database shown in fig. 2 may specifically include a plurality of databases, and the plurality of databases may specifically include the database 10a, the database 10b, the database …, and the database 10n shown in fig. 2. This means that the application database can be used to store the voice content 1 corresponding to different voice messages in an application client (e.g. an office client). For example, the database 10a may be used to store voice content corresponding to the voice message x1, the database 10b may be used to store voice content 2, … corresponding to the voice message x2, and the database 10n may be used to store voice content n corresponding to the voice message xn (not shown in the figure).

The application Database may be referred to as a Database (Database) for short, the Database may be regarded as an electronic file cabinet, which is a place for storing electronic files, and a user may add, query, update, delete, etc. to data in the files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

As shown in fig. 2, when acquiring the voice message of the session interface 2a, the user terminal X may add the voice identifier corresponding to the voice message to the initial identifier queue. It should be understood that in the session interface 2a, the number of the voice messages received by the user terminal X may be one or more, and the specific number of the received voice messages will not be limited herein.

For the convenience of understanding, the received voice number is taken as an example, specifically, the voice message may include, but is not limited to, the voice message X1, the voice message X2, the voice message X3 and the voice message X4 shown in fig. 2, and at this time, the user terminal X may also receive the voice identifiers of the voice messages together, that is, one voice message corresponds to one voice identifier. Then, the user terminal X may add the voice identifier corresponding to the received voice message X1 as X1, the voice identifier corresponding to the voice message X2 as X2, the voice identifier corresponding to the voice message X3 as X3, and the voice identifier corresponding to the voice message X4 as X4 to the initial identifier queue in sequence according to the receiving time stamps, for example, the voice identifiers may be added to a first sub-queue of the initial identifier queue to obtain a target identifier queue. The initial queue may include a first sub-queue and a second sub-queue, where the first sub-queue may be used to store the voice identifier of the voice conversion request to be sent, and the second sub-queue may be used to store the voice identifier of the voice conversion request to be sent. It should be understood that, in the embodiments of the present application, a first sub-queue corresponding to a voice identifier to which a new voice identifier is currently added may be collectively referred to as a pending identification queue, and a second sub-queue corresponding to a voice identifier of a voice message that has been requested to be converted and is still in a pending state (i.e., a pending voice message) may be collectively referred to as a requested identification queue.

It should be understood that, if the user terminal adds the above 4 voice identifiers to the first sub-queue at time T2, the above pending identification queue is obtained. Then, at a time immediately before the time T2 (for example, at the time T1, that is, immediately before the 4 voicemails are added to the first sub-queue), the first sub-queue may specifically include L queue positions, and each queue position in the first sub-queue may be used to store a voiceprint of a to-be-sent voice conversion request. L is a positive integer, and the value of L is not limited herein. For example, at time T1, if the first sub-queue currently stores 6 voicemails of voice conversion requests to be sent, it is indicated that 6 of the L queue positions of the first sub-queue are in an occupied state and (L-6) are in an unoccupied state, so when the 4 voicemails are added to the first sub-queue at time T2, 10 (i.e., 6+4) of the L queue positions of the first sub-queue will be in an occupied state, and the first sub-queue at time T2 may be used as the queue of the identifiers to be requested in the embodiment of the present application.

For another example, the second sub-queue may specifically include M queue positions, and each queue position in the second sub-queue may correspond to a voice identifier that has sent a voice conversion request. M is a positive integer. Here, for the sake of understanding, it is taken as an example that the second sub-queue includes 5 (for example, M is equal to 5) queue positions at the time point T1, which means that 5 voicemails of sent voice conversion requests, for example, the voicemark Y1, the voicemark Y2, the voicemark Y3, the voicemark Y4, and the voicemark Y5, are currently stored in the M positions of the second sub-queue. If the voice identifiers of the 5 voice messages to be converted that have sent the voice conversion request are still in the state to be converted at time T2, the second sub-queue corresponding to the 5 voice messages to be converted that have sent the voice conversion request and are in the state to be converted may be used as the above-mentioned requested identifier queue at time T2.

It should be understood that alternatively, when the server successfully acquires the converted text information (for example, the converted text information 1 of the voice identity Y1, the converted text information 2 of the voice identity Y2, the converted text information 3 of the voice identity Y3) of the 3 (for example, N is equal to 3) transmitted voice identities of the voice identity queue at the next time point (for example, time point T3) of time point T2, the conversion success information for the 3 voice identities may be transmitted to the user terminal X. At this time, the user terminal X may output the converted text information of the 3 voice identifiers having sent the voice conversion request to the session interface of the user terminal X for display. It should be understood that, when acquiring the conversion success information for the 3 voicemails, the user terminal X may further release the queue position occupied by the 3 voicemails (i.e. the voicemails Y1, Y2, Y3) in the requested identifier queue (i.e. the target identifier queue). This means that at this point, three queue positions in the requested identification queue are currently unoccupied. In this way, the user terminal can adjust the queue position of each voice identifier stored currently in the target identifier queue according to the preset voice conversion condition. For example, 3 voice identifiers (e.g., the voice identifier X1, the voice identifier X2, and the voice identifier X3) with the highest priority may be picked out from the pending identification queue in the pending identification queue of the target identification queue according to the receiving time stamps of the 10 voice identifiers, and added to the requested identification queue, so that a voice conversion request for the 3 voice identifiers (e.g., the voice identifier X1, the voice identifier X2, and the voice identifier X3) may be generated, and then the 3 voice conversion requests may be sent to the server shown in fig. 2.

As shown in fig. 2, the server may receive the voice conversion request for the 3 voice identifiers (e.g., the voice identifier X1, the voice identifier X2, and the voice identifier X3), and may distribute the voice content corresponding to the voice identifier in the 3 voice conversion requests to the voice processing server cluster, so as to improve the voice conversion efficiency in a distributed processing manner. For example, the cluster of speech processing servers may include one or more speech processing servers. For ease of understanding, the plurality of voice processing servers may specifically include the voice processing server 100a, the voice processing server 100b, and the voice processing server 100 c. For example, the server shown in fig. 2 may forward the voice content 1 corresponding to the voice identifier X1 found in the database 10a to the voice processing server 100a, so that the voice processing server 100a may perform conversion processing on the voice content 1, and may return the converted text information (e.g., the text information 1 shown in fig. 2) to the server shown in fig. 2, so that the server may output the converted text information (e.g., the text information 1 shown in fig. 2) to the session interface (e.g., the session interface 2b shown in fig. 2) of the user terminal X shown in fig. 2. For another example, for example, the server shown in fig. 2 may forward the voice content 2 corresponding to the voice identifier X2 found in the database 10b to the voice processing server 100b, so that the voice processing server 100b may perform conversion processing on the voice content 2, and further may return the converted text information (e.g., the text information 2 shown in fig. 2) to the server shown in fig. 2, so that the server may output the converted text information (e.g., the text information 2 shown in fig. 2) to the session interface (e.g., the session interface 2b shown in fig. 2) of the user terminal X shown in fig. 2. By analogy, for example, the server shown in fig. 2 may forward the voice content 3 corresponding to the voice identifier X3 found in the database 10c to the voice processing server 100c, so that the voice processing server 100c can perform conversion processing on the voice content 3. It should be understood that the above-mentioned converted text information may be collectively referred to as converted text information in the embodiments of the present application.

The voice processing server 100a, the voice processing server 100b, and the voice processing server 100c may be the same voice processing server for providing the conversion processing service, or may be independent voice processing servers for providing the conversion processing service, and the present invention is not limited thereto. Alternatively, as shown in fig. 2, one or more voice processing servers with conversion processing services may run in the server shown in fig. 2, or may exist independently from the server shown in fig. 2, which is not limited herein.

The voice conversion condition may include one or more of the following conversion conditions: a first transition condition, a second transition condition, and a third transition condition. It can be understood that, for the voice identifier of the same voice message currently located in the first sub-queue, if the queue position (for example, queue position 1) of the voice identifier is adjusted by using three conversion conditions respectively, the request priority of the new queue position (for example, queue position a) of the voice identifier adjusted by using the first conversion condition is higher than the request priority of the new queue position (for example, queue position B) of the voice identifier adjusted by using the second conversion condition; at the same time, the request priority of the new queue position (e.g., queue position B) of the voice identifier adjusted by the second conversion condition is higher than the request priority of the new queue position (e.g., queue position C) of the voice identifier adjusted by the third conversion condition. It will be appreciated that in the first sub-queue, the request priority for queue location C will be higher than the request priority for queue location 1, which the voice tag originally had.

For example, the first conversion condition may be understood as that when a user corresponding to the user terminal X performs a trigger operation on a certain voice message (e.g., the voice message 1) in the session interface, the queue position 1 of the voice identifier of the voice message 1 may be adjusted in the target identifier queue according to the preset first conversion condition of the voice conversion condition, so that the voice identifier may be added to the requested identifier queue most quickly in the following. For another example, the second conversion condition may be understood that, when the user opens the current session interface through the user terminal X, the user terminal X may adjust, according to the second conversion condition of the preset voice conversion condition, the queue position 2 of the voice identifier of the voice message read when the current session interface is opened (for example, the voice message 2 that changes the reading state of the voice message from the unread state to the read state) in the target identifier queue, so that the voice message may be added to the requested identifier queue relatively quickly in the following. For another example, the third conversion condition may be understood as that the user terminal X adjusts the queue position 3 of the voice identifier of the voice message (e.g., the voice message 3) satisfying the third conversion condition in the target identifier queue according to the sequence of the receiving timestamps of the received voice messages and the preset third conversion condition of the voice conversion condition in order to add the voice message to the requested identifier queue relatively quickly.

In the embodiment of the application, the user terminal can send the voice conversion request to the server based on the voice identifier in the target identifier queue, the user terminal does not need to upload the voice message in the local memory to the server, the server intelligently queries the voice message corresponding to the voice identifier according to the uploaded voice identifier, and converts the queried voice message, so that the conversion efficiency of the voice message can be improved.

In the embodiment of the present application, on the basis of acquiring a voice message, a user terminal integrated with an application client performs a conversion process on the voice message through the user terminal and a server to obtain a specific process of converting text information, which may be referred to as the following embodiments corresponding to fig. 3 to 11.

Further, please refer to fig. 3, where fig. 3 is a schematic flowchart of a voice data processing method according to an embodiment of the present application. As shown in fig. 3, the method may be executed by a computer device, where the computer device may be a user terminal installed with the office client, and the user terminal may be the user terminal X in the embodiment corresponding to fig. 2; optionally, the computer device may also be a server corresponding to the office client, and the server may be the server in the embodiment corresponding to fig. 2. In other words, the method according to the embodiment of the present application may be executed by the user terminal, may be executed by the server, or may be executed by both the user terminal and the server. For the convenience of understanding, the present embodiment is described as an example in which the method is executed by the user terminal to describe a specific process of acquiring the converted text information corresponding to the voice message in the user terminal. Wherein, the method at least comprises the following steps S101-S103:

step S101: when the application client side obtains the voice message of the session interface, obtaining a voice identifier corresponding to the voice message, adding the voice identifier to an initial identifier queue, and taking the initial identifier queue with the voice identifier as a target identifier queue;

specifically, an application client in a user terminal corresponding to a first user may receive a voice message forwarded by a second user through a server, and receive a voice identifier configured by the server for the voice message. Wherein the session interface includes a second user associated with the first user. The initial identification queue comprises a first sub-queue and a second sub-queue, the first sub-queue is used for storing a first voice identification, and the first voice identification is used for representing the identification of a first voice message of a voice conversion request to be sent in an application client; the second sub-queue is used for storing a second voice identifier, and the second voice identifier is used for representing the identifier of a second voice message of the sent voice conversion request in the application client. Further, the user terminal may obtain a voice conversion condition associated with the session interface, determine the received voice identifier as a target voice identifier based on the voice conversion condition, take the voice message received by the application client as a target voice message, and stamp a receiving time stamp corresponding to the target voice message as a target receiving time stamp. Further, the user terminal may determine, based on the target receiving timestamp, a queue position of the target voice identifier of the target voice message in the first sub-queue including the identifier of the first voice message, and add the target voice identifier to the first sub-queue based on the queue position, to obtain an initial first sub-queue. Further, the user terminal may determine a target identification queue based on the initial first sub-queue and a second sub-queue containing an identification of the second voice message.

For ease of understanding, please refer to fig. 4, and fig. 4 is a schematic diagram of a scenario in which a voice identifier is added according to an embodiment of the present application. As shown in fig. 4, the initial queue of identifications includes a first sub-queue and a second sub-queue, the first sub-queue can be used for storing the voice identifications of the voice conversion requests to be sent, and the second sub-queue can be used for storing the voice identifications of the voice conversion requests to be sent. The voice identifier of the voice conversion request to be sent may be collectively referred to as an identifier of the first voice message, and the voice identifier of the voice conversion request to be sent may be collectively referred to as an identifier of the second voice message.

The first sub-queue may specifically include L queue positions (e.g., 6 queue positions shown in fig. 4, that is, L equals 6), and the second sub-queue may specifically include M queue positions (e.g., 3 queue positions shown in fig. 4, that is, M equals 3), where L and M are positive integers. Assuming that the time corresponding to the initial identifier queue shown in fig. 4 is T1, at this time, 1 voice identifier of the voice conversion request to be sent may be stored in the first sub-queue, for example, the voice identifier Z1; the second sub-queue may have 3 voicemails of sent voice conversion requests stored therein, for example, voicemail Y1, voicemail Y2, voicemail Y3 and voicemail Y4.

It should be understood that the user terminal may obtain the voice message of the session interface, determine the voice message as the target voice message at a time next to time T1 (e.g., time T2), determine the voice identifier corresponding to the target voice message as the target voice identifier, and sequentially add the target voice identifiers (e.g., the voice identifier X1, the voice identifier X2, the voice identifier X3, and the voice identifier X4) to the first sub-queue according to the precedence order of the target receiving timestamps (i.e., receiving timestamps) to obtain an initial first sub-queue { Z1, X1, X2, X3, X4 }. It should be understood that, in the embodiment of the present application, the first sub-queue (e.g., the initial first sub-queue) to which the new voice identifier is currently added may be collectively referred to as the pending identification queue, and the second sub-queue and the initial first sub-queue may be collectively referred to as the target identification queue, that is, the identification queue 4 shown in fig. 4. The number of the voice messages acquired by the user terminal may be one or more, and the specific number of the acquired voice messages is not limited here. Further, the user terminal may send a voice conversion request carrying the target voice identifier to the server in a subsequent step based on the voice conversion condition.

It can be understood that the earlier the receiving time stamp of the voice identifier obtained by the user terminal is, the higher the request priority of the queue position where the target voice identifier is located when the target voice identifier is added to the first sub-queue, for example, when the target receiving time stamp of the target voice identifier X1 is time T1, the receiving time stamp of the target voice identifier X2 is time T2, and when the time T1 is a certain time before time T2, after the target voice identifier X1 and the target voice identifier X2 are added to the first sub-queue, the request priority of the queue position where the target voice identifier X1 is located is greater than the request priority of the queue position where the target voice identifier X2 is located. As shown in fig. 4, it can be seen that the request priority of the queue position where the target voice identifier X2 is located is greater than the request priority of the queue position where the target voice identifier X3 is located, and the request priority of the queue position where the target voice identifier X3 is located is greater than the request priority of the queue position where the target voice identifier X4 is located.

It is understood that the user terminal may change the request priority of the target voice identifier in the initial first sub-queue of the target identifier queue in response to the triggering operation (e.g., the first triggering operation and the second triggering operation) performed on the first user (here, the first user may be a user using the user terminal), i.e., adjust the queue position of the target voice identifier in the first sub-queue, so that the following step S102 may be subsequently performed. The first trigger operation and the second trigger operation may include contact operations such as clicking, long pressing, sliding, and the like, and may also include non-contact operations such as voice, gesture, and the like, which are not limited herein.

It should be understood that the user terminal may output the target voice message to the conversation interface in response to a trigger operation (here, the trigger operation may be the first trigger operation) for the conversation interface where the second user is located, and obtain the initial level adjustment instruction in the voice conversion condition. The request priority of the second sub-queue is higher than that of the first sub-queue, namely the request priority of the second sub-queue is higher than that of the initial first sub-queue. Further, the user terminal may determine, based on the initial level adjustment instruction, a queue position of the target voice identifier in the initial first sub-queue as a first position, and adjust, in the initial first sub-queue, the queue position of the target voice identifier from the first position to a second position, to obtain an adjusted initial first sub-queue. And the request priority of the identifier corresponding to the second position is greater than that of the identifier corresponding to the first position. Further, the user terminal may update the target identification queue based on the adjusted initial first sub-queue and the adjusted second sub-queue.

For ease of understanding, please refer to fig. 5, and fig. 5 is a schematic view of a scenario in which a user opens a session according to an embodiment of the present application. As shown in fig. 5, the identification queue 5a may be a target identification queue corresponding to the session interface 50a, the identification queue 5b may be a target identification queue corresponding to the session interface 50b, and the identification queue 5a may be the identification queue 4 in the embodiment corresponding to fig. 4. It can be understood that, when the user terminal acquires the voice message sent by the user "BBB" (and opens a session interface corresponding to the user "BBB", where the voice message may be the voice message Z1), a session interface 50a may be displayed, and at this time, if a target voice message sent by the user "AAA" (where the target voice message may be the voice message X1, the voice message X2, the voice message X3, and the voice message X4) is received, an identification queue 5a shown in fig. 5 may be obtained, where the second sub-queue of the identification queue 5a may include the voice identifications { Y1, Y2, Y3}, and the initial first sub-queue may include the voice identifications { Z1, X1, X2, X3, X4 }. The voice identifier X1 is a voice identifier corresponding to the voice message X1, the voice identifier X2 is a voice identifier corresponding to the voice message X2, the voice identifier X3 is a voice identifier corresponding to the voice message X3, and the voice identifier X4 is a voice identifier corresponding to the voice message X4.

It is understood that, if the first user performs the first trigger operation (for example, the first trigger operation may be a click operation) on the session of the user "AAA" (the user "AAA" may be referred to as the second user) shown in fig. 5, the user terminal may open the session interface (i.e., the session interface 50b) corresponding to the user "AAA" in response to the click operation, and adjust queue positions of the voice identifier X1, the voice identifier X2, the voice identifier X3 and the voice identifier X4 in the identification queue 5a to change request priorities of the voice identifier X1, the voice identifier X2, the voice identifier X3 and the voice identifier X4, so as to obtain the identification queue 5b shown in fig. 5, wherein the voice identifiers { Y1, Y2 and Y3} may be included in the second sub-queue of the identification queue 5b, and the adjusted initial first sub-queue may include the voice identifiers { X1, X2, X3, X4, and X3583 }, Z1 }.

Optionally, it can be understood that, if the first user receives a target voice message (here, the target voice message may be a voice message X5, which is not shown in the figure) sent by a user "CCC" (the user "CCC" may be referred to as a third user), the second sub-queue may include a voice identifier { Y1, Y2, Y3}, and the initial first sub-queue may include a voice identifier { X1, X2, X3, X4, Z1, X5 }. The voice identifier X5 is the voice identifier corresponding to the voice message X5. If the second user performs the first trigger operation for the session of the user "BBB" shown in fig. 5, and then performs another first trigger operation for the session of the user "CCC" (for example, the another first trigger operation may be a click operation), the user terminal may open a session interface corresponding to the user "CCC" in response to the click operation, adjust a queue position of the voice identifier X5 corresponding to the voice message X5 in the identifier queue 5b to change the request priority of the voice identifier X5, wherein the voice identifier { Y1, Y2, Y3} may be included in the second sub-queue of the identifier queue 5b, and the voice identifier { X5, X1, X2, X3, X4, Z1} may be included in the adjusted initial first sub-queue. Optionally, the second sub-queue of the identification queue 5b may include a speech identifier { Y1, Y2, Y3}, and the adjusted initial first sub-queue may include a speech identifier { X1, X2, X3, X4, X5, Z1 }.

It should be understood that the user terminal may obtain the target level adjustment instruction in the voice conversion condition in response to the triggering operation for the target voice message (here, the triggering operation may be the second triggering operation) in the session interface. Further, the user terminal may determine, based on the target level adjustment instruction, the adjusted initial first sub-queue as a target first sub-queue, and adjust a queue position of the target voice identifier from a second position to a third position in the target first sub-queue to obtain an adjusted target first sub-queue. And the request priority of the identifier corresponding to the third position is greater than that of the identifier corresponding to the second position. Further, the user terminal may perform update processing on the updated target identifier queue based on the adjusted target first sub-queue and the adjusted target second sub-queue.

For easy understanding, please refer to fig. 6, and fig. 6 is a schematic view of a scene selected by a user according to an embodiment of the present application. As shown in fig. 6, the identification queue 6a and the identification queue 6b may be target identification queues corresponding to the session interface 60, and the identification queue 6a may be the identification queue 5b in the embodiment corresponding to fig. 5. It can be understood that, when acquiring a target voice message sent by a user "AAA" (the user "AAA" may be referred to as a second user) (and opening a session interface corresponding to the user "AAA", where the target voice message may be the voice message X1, the voice message X2, the voice message X3, and the voice message X4 in the embodiment corresponding to fig. 5 described above), the user terminal may display the session interface 60, where the second sub-queue of the identification queue 6a may include a voice identifier { Y1, Y2, Y3}, and the first sub-queue of the target may include a voice identifier { X1, X2, X3, X4, Z1 }.

It is understood that, if the first user performs the second triggering operation on the voice message X3 (for example, the second triggering operation may be a click operation performed after performing a right click operation), the user terminal may adjust the queue position of the voice identifier X3 corresponding to the voice message X3 in the identification queue 6a to change the request priority of the voice identifier X3, resulting in the identification queue 6b shown in fig. 6, wherein the second sub-queue of the identification queue 6b may include the voice identifiers { Y1, Y2, Y3}, and the adjusted target first sub-queue may include the voice identifiers { X3, X1, X2, X4, Z1 }.

Alternatively, it is understood that, if the first user performs the second trigger operation on the voice message X3 and then performs another second trigger operation on the voice message X4 (for example, another second trigger operation may be a click operation performed after performing a right click operation), in response to the click operation, the user terminal may adjust the queue position of the voice identifier X4 corresponding to the voice message X4 in the identifier queue 6b to change the request priority of the voice identifier X4, wherein the second sub-queue of the identifier queue 6b may include the voice identifiers { Y1, Y2, Y3}, and the adjusted target first sub-queue may include the voice identifiers { X4, X3, X1, X2, Z1 }. Optionally, the second sub-queue of the identifier queue 6b may include a speech identifier { Y1, Y2, Y3}, and the adjusted target first sub-queue may include a speech identifier { X3, X4, X1, X2, Z1 }.

Alternatively, it is understood that the first user may perform a second trigger operation (e.g., the second trigger operation may be a click operation performed after performing a multiple selection operation) on a plurality of target voice messages (e.g., the voice message X3 and the voice message X4) at the same time in the conversation interface 60, so that the user terminal may respond to the click operation by adjusting queue positions of the voice identifier X3 and the voice identifier X3 in the identifier queue 6a, so as to change request priorities of the voice identifier X3 and the voice identifier X4, thereby obtaining an adjusted target first sub-queue { X3, X4, X1, X2, Z1 }.

In this regard, it can be understood that the first user may not need to perform the second trigger operation for the target voice message after performing the first trigger operation for the conversation interface, but may directly perform the second trigger operation for the target voice message on the current conversation interface. At this time, the current session interface of the application client may be a session interface corresponding to the second user, the voice message of the session interface acquired by the user terminal is the voice message of the current session interface corresponding to the second user, at this time, the user does not need to execute the first trigger operation for the session interface of the second user, and the session interface of the second user is already opened, so that the same effect as the first trigger operation can be achieved, that is, the "unread" message is converted into the "read" message, so that the second trigger operation is executed for the target voice message in the session interface of the second user in the subsequent step.

It can be understood that, in the embodiment of the present application, the initial first sub-queue, the adjusted initial first sub-queue (i.e., the target first sub-queue), and the adjusted target first sub-queue may be collectively referred to as a to-be-requested identification queue, and a second sub-queue (i.e., the second sub-queue) corresponding to a voice identification of a voice message that has sent a voice conversion request and is still in a to-be-converted state may be collectively referred to as a requested identification queue, so that the to-be-requested identification queue and the requested identification queue may be collectively referred to as a target identification queue. Based on this, the second sub-queue and the initial first sub-queue may be collectively referred to as a target identification queue, the second sub-queue and the adjusted initial first sub-queue (i.e., target first sub-queue) may be collectively referred to as a target identification queue, and the second sub-queue and the adjusted target first sub-queue may be collectively referred to as a target identification queue.

It will be appreciated that the speech conversion conditions may include one or more of the following conversion conditions: a first transition condition, a second transition condition and a third transition condition, wherein the second transition condition corresponds to the initial level adjustment command, and wherein the first transition condition corresponds to the target level adjustment command. It can be understood that, for the voice identifier of the same voice message currently located in the first sub-queue (or queue of identifiers to be requested), if the queue position (for example, queue position 1) of the voice identifier is adjusted by using three conversion conditions respectively, the request priority of the new queue position (for example, queue position a) of the voice identifier adjusted by using the first conversion condition is higher than the request priority of the new queue position (for example, queue position B) of the voice identifier adjusted by using the second conversion condition; at the same time, the request priority of the new queue position (e.g., queue position B) of the voice identifier adjusted by the second conversion condition is higher than the request priority of the new queue position (e.g., queue position C) of the voice identifier adjusted by the third conversion condition. It will be appreciated that in the first sub-queue, the request priority for queue location C will be higher than the request priority for queue location 1, which the voice tag originally had.

For example, when a first user corresponding to a user terminal performs a trigger operation on a certain voice message (e.g., voice message 1) in a session interface, the first conversion condition may be understood as that a queue position 1 of a voice identifier of the voice message 1 is adjusted in a target identifier queue according to a preset first conversion condition of the voice conversion condition, so that the voice identifier may be added to a requested identifier queue most quickly in the following. For another example, the second conversion condition may be understood that, when the first user opens the current session interface through the user terminal, the user terminal may adjust, according to the second conversion condition of the preset voice conversion condition, the queue position 2 of the voice identifier of the voice message read when the current session interface is opened (for example, the voice message 2 that changes the reading state of the voice message from the unread state to the read state) in the target identifier queue, so that the voice message may be added to the requested identifier queue relatively quickly in the following. For another example, the third conversion condition may be understood as that the user terminal adjusts the queue position 3 of the voice identifier of the voice message (e.g., the voice message 3) satisfying the third conversion condition in the target identifier queue according to the sequence of the receiving timestamps of the received voice messages and the preset third conversion condition of the voice conversion condition in order, so that the voice message can be added to the requested identifier queue relatively quickly.

Step S102: generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sending the voice conversion request to a server so that the server can acquire conversion text information corresponding to the voice identification based on the voice conversion request;

specifically, the user terminal may receive conversion success information, which is returned by the server and is for the M to-be-converted voice messages that have sent the voice conversion request, and record the conversion number of the received conversion success information as N. The target identification queue comprises an identification queue to be requested and an identification queue requested; the voice identifier is located in the queue of identifiers to be requested, the queue of requested identifiers includes M queue positions, one queue position in the queue of requested identifiers is used for storing an identifier of a voice message to be converted, where M may be the total number of identifiers of the voice message to be converted that has sent the voice conversion request. Wherein N may be a positive integer less than or equal to M. Further, the user terminal may obtain a queue position of the voice identifier in the to-be-requested identifier queue of the target identifier queue, and determine the target queue position of the voice identifier in the requested identifier queue when the queue position of the voice identifier satisfies a voice conversion condition. Further, the user terminal may add the voice identifier to the requested identifier queue based on the target queue position, generate a voice conversion request carrying the voice identifier based on the requested identifier queue to which the voice identifier is added, and send the voice conversion request to the server.

When receiving the conversion success information of the N voice messages to be converted, the user terminal may send a voice conversion request to the server based on the voice identifiers of the voice messages to be requested in the identifier queue to be requested. It can be understood that the user terminal may obtain the voice identifiers of N voice messages to be requested in the queue of identifiers to be requested, add the N voice identifiers to the target queue position of the queue of identifiers already requested, and send a voice conversion request to the server based on the N voice identifiers. Optionally, if the to-be-requested identifier queue does not include the voice identifiers of N to-be-requested voice messages, for example, the to-be-requested identifier queue may include the voice identifiers of K to-be-requested voice messages, where K may be a positive integer smaller than N, the user terminal may obtain the voice identifiers of K to-be-requested voice messages in the to-be-requested identifier queue, add the K voice identifiers to the target queue position of the requested identifier queue, and send a voice conversion request to the server based on the K voice identifiers.

Optionally, when acquiring the voice message, the user terminal may directly send the voice conversion request to the server based on the voice identifier of the voice message to be requested in the identifier queue to be requested, at this time, M queue positions in the identifier queue to be requested are in an unoccupied state, and the user may acquire the voice identifiers of M voice messages to be requested in the identifier queue to be requested, add the M voice identifiers to the target queue position of the identifier queue to be requested, and send the voice conversion request to the server based on the M voice identifiers. Optionally, if the requested identifier queue includes voice messages to be converted, for example, the requested identifier queue may include (M-L) voice identifiers of the voice messages to be converted, where L may be a positive integer smaller than M, and at this time, L queue positions in the requested identifier queue are in an unoccupied state, the user terminal may obtain, in the identifier queue to be requested, the voice identifiers of L voice messages to be requested, add the L voice identifiers to a target queue position of the requested identifier queue, and send a voice conversion request to the server based on the L voice identifiers. Optionally, if the to-be-requested identifier queue does not include the voice identifiers of the M or L to-be-requested voice messages, for example, the to-be-requested identifier queue may include K voice identifiers of the to-be-requested voice messages, where K may be a positive integer smaller than M or L, the user terminal may obtain the voice identifiers of the K to-be-requested voice messages in the to-be-requested identifier queue, add the K voice identifiers to the target queue position of the requested identifier queue, and send a voice conversion request to the server based on the K voice identifiers.

For easy understanding, please refer to fig. 7, and fig. 7 is a schematic view of a scenario for receiving converted text information according to an embodiment of the present application. As shown in fig. 7, the id queue 7a may be the id queue 4 in the embodiment corresponding to fig. 4, and at a time (e.g., time T3) next to time T2 corresponding to the id queue 4 shown in fig. 4, that is, at a time (e.g., time T3) next to time T2 corresponding to the id queue 7a shown in fig. 7, the server may return conversion success information (e.g., conversion success information returned for the voice id Y1 and the voice id Y2) to the user terminal based on the 3 voice ids that have sent the voice conversion requests in the requested id queue, at which time, the user terminal may output the conversion text information corresponding to the voice id Y1 and the voice id Y2 to the session interface of the user terminal. It should be understood that the user terminal may release the queue positions occupied by the 2 voiceidentifications in the requested identification queue of the target identification queue to obtain the identification queue 7b shown in fig. 7 when acquiring the conversion success information returned for the voiceidentification Y1 and the voiceidentification Y2, which means that at this time, two queue positions in the requested identification queue are currently in an unoccupied state. Based on this, the ue can pick two voicemails with the highest request priority (e.g. the voicemails Z1 and the voicemails X1) from the pending-request-identifier queue of the identifier queue 7b, and add these two voicemails to the target queue position of the requested-identifier queue, where the requested-identifier queue may include the voicemails { Y3, Z1, X1} according to the difference of the target queue position, and at this time, the voicemails in the requested-identifier queue are sorted according to the time for sending the voice conversion request. Optionally, the requested identifier queue may include the speech identifiers { Z1, X1, Y3}, where the speech identifiers in the requested identifier queue do not need to be sorted according to the time of the speech conversion request, and there is no limitation on the queue position of the identifiers in the requested identifier queue.

Alternatively, the user terminal may adjust the queue position of the voice identifier in the to-be-requested identifier queue of the identifier queue 7b in response to the triggering operation (e.g., the first triggering operation and the second triggering operation) performed on the first user, so as to change the request priority of the voice identifier, so that the step S101 may be performed while the step S102 and the step S103 are performed. The first trigger operation and the second trigger operation may include contact operations such as clicking, long pressing, sliding, and the like, and may also include non-contact operations such as voice, gesture, and the like, which are not limited herein.

Step S103: and receiving the converted text information returned by the server, and outputting the converted text information to the position area where the voice message is located in the conversation interface.

The voice message and the converted text information in the location area have an association relationship, for example, the voice message and the converted text information may have an adjacent location relationship in the conversation interface (for example, the converted text information may be located at a lower position of the voice message).

It can be understood that the converted text information received by the user terminal and returned by the server may be complete converted text information corresponding to one voice message, or complete converted text information corresponding to a plurality of voice messages. For example, the voice conversion request may include a voice identifier X1, a voice identifier X2, and a voice identifier X3, and when returning the converted text information, the server may return text information 1 corresponding to the voice identifier X1 and text information 2 corresponding to the voice identifier X2 to the user terminal at the first time stamp, and may also return text information 3 corresponding to the voice identifier X3 to the user terminal at the second time stamp.

Similarly, it can be understood that the converted text information received by the first terminal and returned by the server may also be converted for a portion corresponding to a voice message. For example, when the server performs conversion processing on the voice message X2 corresponding to the voice identifier X2, the server may convert the content in the text information 2 in batches, and return part of the obtained text information to the user terminal in batches. For example, textual information 2 may be "I have sorted the last week's documents in the morning today! "at time T11, the server may return" i am today am "to the user terminal, at which time the user terminal may include partial conversion text information of text information y2 on the session interface, i.e.," i am … today am "; at time T22, the server may return "document of last week" to the user terminal, at which point the user terminal may include partial conversion textual information on the session interface that includes textual information y2, i.e., "i am the document … of last week today in the morning; at time T33, the server may return "cleared! ", in this case, the complete converted text message of the text message 2 may be included on the session interface of the user terminal, i.e." I have sorted the last week's documents in the morning in the present time! ". Wherein, the time T11 is earlier than the time T22, and the time T22 is earlier than the time T33.

Further, it can be understood that, if the converted text information received by the user terminal is a part of the converted text information corresponding to the voice message, the part of the converted text information is output to a session interface of the application client, and the user terminal waits for receiving the complete converted text information corresponding to the voice message. If the conversion text information received by the user terminal is the complete conversion text information corresponding to the voice message, or the user terminal has received the complete conversion text information corresponding to the one or more voice messages, the conversion success information for the one or more voice messages sent by the server (for example, the target conversion success information for the target voice message) is obtained, and the voice identifier is deleted from the requested identifier queue of the target identifier queue based on the conversion success information (for example, the target voice identifier is deleted from the requested identifier queue of the target identifier queue for the target conversion success information), so that the user terminal can continue to send the voice conversion request to the server in the subsequent steps.

The method includes that when a user terminal receives converted text information returned by a server, the converted text information can be stored in a local memory, and when a first user views the converted text information corresponding to a historically received voice message in an application client (assuming that the converted text information corresponding to the historically received voice message is not directly displayed in a conversation interface or the converted text information is already hidden by the first user), the user terminal can directly acquire the converted text information corresponding to the voice message from the local memory without performing re-conversion processing on the historically received voice message, and output the converted text information to a position where the voice message is located in the conversation interface of the application client.

Alternatively, it may be understood that, when the first user views the converted text information corresponding to the historically received voice messages, the first user may perform a trigger operation on one or more voice messages in the historically received voice messages, so that the user terminal may generate a voice conversion request based on the voice identifiers corresponding to the one or more voice messages and send the voice conversion request to the server. Therefore, when the voice conversion algorithm of the server side is updated, the first user can obtain the latest converted text information so as to improve the accuracy of the converted text information obtained by the first user, and at the moment, the user terminal can use the latest converted text information to update the converted text information in the local memory.

It should be understood that the speech identity transmitted by the user terminal at adjacent time instants and the received converted text information do not necessarily correspond. For example, the user terminal may transmit a voice conversion request to the server based on the voice identity X1 and the voice identity X2 at time T1, and transmit a voice conversion request to the server based on the voice identity X3 at time T2. At time T3, the ue may receive the converted text information returned by the server, where the converted text information may be the converted text information corresponding to the voice identifier X1 (or the voice identifier X2), or the converted text information corresponding to the voice identifier X3. The time T1, the time T2, and the time T3 may be adjacent time in time order, where the time T1 is earlier than the time T2, and the time T2 is earlier than the time T3.

Further, please refer to fig. 8, where fig. 8 is a flowchart illustrating a voice data processing method according to an embodiment of the present application. As shown in fig. 8, the method may be executed by a computer device, where the computer device may be a user terminal installed with the office client, and the user terminal may be the user terminal X in the embodiment corresponding to fig. 2; optionally, the computer device may also be a server corresponding to the office client, and the server may be the server in the embodiment corresponding to fig. 2. In other words, the method according to the embodiment of the present application may be executed by the user terminal, may be executed by the server, or may be executed by both the user terminal and the server. For ease of understanding, the present embodiment is described as an example in which the method is executed by the user terminal and the server together. Wherein the method may comprise the steps of:

step S201: when the server acquires the voice message of the application client, generating a voice identifier corresponding to the voice message, and sending the voice message and the voice identifier to the user terminal so that the user terminal adds the voice identifier to the initial identifier queue, and taking the initial identifier queue added with the voice identifier as a target identifier queue;

it can be understood that the user terminal in this embodiment may be a first terminal, and a user corresponding to the first terminal is referred to as a first user, that is, the first user may be a user using the first terminal, and the first terminal may be the user terminal 3000c in the user terminal cluster in the embodiment corresponding to fig. 1. Similarly, in this embodiment of the application, the user corresponding to the second terminal may be referred to as a second user, that is, the second user may be a user using the second terminal, and the second terminal may be the user terminal 3000a in the user terminal cluster in the embodiment corresponding to fig. 1. The first user may be a user who logs in the office client through first account information (for example, account information 1) in the first terminal, and the second user may be a user who logs in the office client through second account information (for example, account information 2) in the second terminal.

It can be understood that, the first user in the embodiment of the present application may be a user that receives a voice message through an application client in the first terminal, that is, a message recipient; the second user in the embodiment of the present application may be a user who sends a voice message through an application client in the second terminal, that is, a message sender. It should be understood that the first user in the embodiment of the present application may serve as both the message receiver and the message sender, for example, the first user may serve as the message receiver through the first terminal, and the first user may also serve as the message sender through the first terminal. Similarly, the second user in the embodiment of the present application may serve as both the message sender and the message receiver, for example, the second user may serve as the message sender through the second terminal, and the second user may also serve as the message receiver through the second terminal.

It can be understood that the message sender and the message receiver may be connected through a server, the server synchronizes the voice message from a user terminal corresponding to the message sender (e.g., a second terminal corresponding to a second user) to a user terminal corresponding to the message receiver (e.g., a first terminal corresponding to a first user), and in a subsequent step, the voice message may be converted, so that the first user and the second user may directly obtain converted text information corresponding to the voice message. The first terminal and the second terminal both operate application clients corresponding to the server, and the transmission and the reception of the voice message between the first terminal and the second terminal can be realized through the application clients.

It can be understood that, if the second user is used as a message sender, the server may receive a voice message sent by the application client of the second terminal, generate a voice identifier corresponding to the voice message (that is, configure a voice identifier corresponding to the voice message), store the voice content and the voice identifier corresponding to the voice message in the application database, and forward the voice message and the voice identifier to the first terminal, so that the first terminal may output the voice message in a session interface of the application client. Meanwhile, the server may return the voice message sent by the second terminal to the second terminal so that the second terminal may output the voice message in the session interface of the application client, and at the same time, the server may return the voice identifier corresponding to the voice message to the second terminal so that the second terminal may send the voice conversion request to the server based on the voice identifier corresponding to the voice message sent by the second user.

For ease of understanding, please refer to fig. 9, and fig. 9 is a schematic diagram of a scenario for forwarding a voice message according to an embodiment of the present application. As shown in fig. 9, the second user (i.e., user "AAA") uses the second terminal to send a voice message to the first user (i.e., user "FFF"), which may be forwarded by the server. The server may receive the voice message sent by the second terminal, generate a voice identifier corresponding to the voice message (i.e., configure a voice identifier corresponding to the voice message), and store the voice identifier and a voice content corresponding to the voice message in the application database. Further, the server may send the voice message and the voice identifier to the first terminal, so that the first terminal outputs the voice message to a session interface of the first terminal, and in a subsequent step, the first terminal may send a voice conversion request to the server based on the voice identifier. Meanwhile, when the server shown in fig. 9 sends the voice message and the voice identifier to the first terminal, the server may send the voice message and the voice identifier to the second terminal, so that the second terminal outputs the voice message to a session interface of the second terminal, and in a subsequent step, the second terminal may send a voice conversion request to the server based on the voice identifier.

For a voice message sent by a second user, the second terminal may add the voice identifier to the initial identifier queue when acquiring the voice identifier corresponding to the voice message, so that the second terminal may automatically send a voice conversion request to the server based on the voice identifier. Optionally, when acquiring the voice identifier corresponding to the voice message, the second terminal may add the voice identifier to the initial identifier queue without adding the voice identifier to the initial identifier queue, and when the second user performs a trigger operation on the voice message (the trigger operation here may be the second trigger operation in the embodiment corresponding to fig. 3), add the acquired voice identifier to the initial identifier queue, so that the second terminal may send the voice conversion request to the server based on the voice identifier.

Step S202: when an application client side obtains a voice message of a session interface, a user terminal obtains a voice identifier corresponding to the voice message, the voice identifier is added to an initial identifier queue, and the initial identifier queue with the voice identifier added is used as a target identifier queue;

the user terminal may be the first terminal in step S201, and the voice message acquired by the user terminal may be a voice message received by the first user as a message receiver, or a voice message sent by the first user as a message sender.

It is understood that the initial queue of identifications and the target queue of identifications (abbreviated as "identification queue" or "queue") may be used to store voice identifications, the initial queue of identifications in this embodiment may include a first sub-queue and a second sub-queue, and the target queue of identifications may include a queue of identifications to be requested and a queue of identifications requested, and in general, a queue is a group of people or things waiting to be served or processed in a ranked order.

It should be understood that the queue position of the voice message in the first sub-queue (or the to-be-requested identification queue) is determined by the receiving timestamp of the voice message, and the user terminal may adjust the queue position of the voice message in the first sub-queue (or the to-be-requested identification queue) in response to the triggering operation performed for the first user, so that the request priority of the adjusted queue position is greater than that of the queue position before the adjustment. When the voice identifier is added to the initial identifier queue, an enqueue operation can be executed on the first sub-queue, and when a voice conversion request is sent to the server, a dequeue operation can be executed on the identifier queue to be requested.

It should be understood that the queue positions of the voice messages in the second sub-queue (or the requested identifier queue) are determined by the timestamp of the voice conversion request, the M queue positions included in the second sub-queue (or the requested identifier queue) indicate that the total number of identifiers that can be accommodated in the second sub-queue is M, the user terminal may send the voice conversion request for M voice identifiers to the server, and if the value of M is too large, the pressure that the server is subjected to is too large, and if the value of M is too small, the speed of the voice message conversion process is too slow, so in the embodiment of the present application, M may be set to 5. When the voice conversion request is sent to the server, the enqueue operation can be executed on the second sub-queue, and when the user terminal receives the conversion text information, the dequeue operation can be executed on the requested identification queue.

For easy understanding, please refer to fig. 10, fig. 10 is a schematic flowchart of a speech-to-text scheme according to an embodiment of the present application. The application client shown in fig. 10 may be an office client, which may be a client installed on a user terminal, and the target user shown in fig. 10 may be a first user using the user terminal, for example, the user "FFF" described above. When the application client receives the voice message (i.e. the voice message of the display interface of the application client is acquired) and the voice identifier, the voice identifier corresponding to the voice message may be added to the initial identifier queue to obtain a target identifier queue, and the target identifier queue is sorted according to priority (i.e. request priority), that is, the voice identifier is added to the initial identifier queue according to the receiving timestamp of the voice message.

It should be understood that, as shown in fig. 10, when the target user opens a session (i.e., the target user performs the first triggering operation in the embodiment corresponding to fig. 5) or clicks a voice to text (i.e., the target user performs the second triggering operation in the embodiment corresponding to fig. 6), the application client may adjust the queue position of the voice message in the target identification queue, so that the request priority of the adjusted queue position is greater than that of the queue position before adjustment, i.e., the request priority of the voice message is updated. The voice message corresponding to the second trigger operation may have a first priority, the voice message corresponding to the first trigger operation may have a second priority, and the other voice messages (i.e., the voice messages except the first priority and the second priority) may have a third priority. The request priority of the first priority is higher than that of the second priority, and the request priority of the second priority is higher than that of the third priority.

It can be understood that the first trigger operation may be a trigger operation executed by the target user for a session interface of the group message, and at this time, the user terminal may obtain voice identifiers corresponding to voice messages forwarded by multiple users (e.g., the second user and the third user) through the server, and add the voice identifiers to the initial identifier queue to obtain the target identifier queue. When the user terminal responds to the first trigger operation and outputs the voice message to the session interface, the user terminal can adjust the queue position of the voice identifier corresponding to the voice message in the target identifier queue. For a specific implementation manner of the user terminal adjusting the queue position of the voice identifier corresponding to the voice message of the group message in the target identifier queue, reference may be made to the description of the user terminal adjusting the queue position of the voice identifier corresponding to the voice message of the second user in the target identifier queue, which will not be described herein again.

The specific implementation manner of the user terminal dynamically adjusting the queue position of the voice identifier in the target identifier queue according to the request priority may be referred to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

Step S203: the user terminal generates a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sends the voice conversion request to the server so that the server can obtain conversion text information corresponding to the voice identification based on the voice conversion request;

as shown in fig. 10, when the voice conversion condition is satisfied, the application client may initiate a text conversion request, that is, send a voice conversion request to the server.

For a specific implementation manner of sending the voice conversion request from the user terminal to the server, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, which will not be described herein again.

Step S204: the server receives a voice conversion request sent by the user terminal and acquires a voice identifier from the voice conversion request;

wherein the voice conversion request is generated based on a queue position of the voice tag in the target tag queue.

It can be understood that, when the voice identifier in the voice conversion request is the voice identifier corresponding to the voice message in the group message, the server may receive multiple voice conversion requests sent by multiple user terminals (e.g., the second terminal and the third terminal), and obtain the same voice identifier from the multiple voice conversion requests, so that after the same voice content corresponding to the same voice identifier is queried in the subsequent steps, the same voice content may be subjected to multiple conversion processes.

Optionally, to increase the speed of the conversion processing, the server may perform conversion processing on the voice content corresponding to the voice message when acquiring the voice message, so as to store the obtained converted text information in the application database, and when receiving a voice conversion request sent by the user terminal, query the converted text information corresponding to the voice identifier in the application database based on the voice identifier carried in the voice conversion request. Similarly, optionally, in order to increase the speed of the conversion processing, the server may query, when receiving a certain voice identifier for the first time, the voice content corresponding to the voice identifier in the application database based on the voice identifier, and perform the conversion processing on the voice content to store the obtained converted text information in the application database, and then the server may query, when receiving the voice identifier for the second time, the converted text information corresponding to the voice identifier in the application database based on the voice identifier.

Step S205: when the voice message corresponding to the voice identification is inquired, the server converts the voice message to obtain converted text information corresponding to the voice message;

it should be understood that the speech conversion algorithm for converting the speech content corresponding to the speech message into the converted text message in the embodiment of the present application may be a pattern matching method, that is, in the training stage, each word in the vocabulary is spoken once, and the feature vector thereof is stored in the template library as a template; in the recognition stage, the feature vector of the input voice is compared with each template in the template library in similarity in sequence, and the highest similarity is output as a recognition result. Optionally, the speech conversion algorithm in the embodiment of the present application may be a Hidden Markov Model (HMM) method based on a parametric Model, may also be an algorithm based on Dynamic Time Warping (DTW), and may also be a Vector Quantization (VQ) method based on a non-parametric Model, and the embodiment of the present application does not limit the specific type of the speech conversion algorithm. The conversion of the voice message into converted text information may be referred to as voice recognition, and the voice conversion algorithm may be referred to as a voice recognition method.

For easy understanding, please refer to fig. 11, and fig. 11 is a schematic view of a scenario for converting speech into text according to an embodiment of the present application. As shown in fig. 11, the server may, upon receiving a voice conversion request transmitted by the user terminal based on a voice identity (e.g., voice identity X1 and voice identity X2), based on the voice identification carried by the voice conversion request, the voice content 1 and the voice content 2 corresponding to the voice identification X1 and the voice identification X2 are inquired in an application database corresponding to the server, and the voice content 1 and the voice content 2 are forwarded to the voice processing server, so that the voice processing server can perform a conversion process on the

voice contents

1 and 2, the converted text information (for example, the text information 1 and the text information 2 shown in fig. 11, i.e., the converted text information) after the conversion processing may be further returned to the server so that the server may output the converted text information 1 and the text information 2 to the user terminal shown in fig. 11.

The voice processing server may be the same voice processing server for providing the conversion processing service, or may be a cluster of voice processing servers independent from each other for providing the conversion processing service, for example, the cluster of voice processing servers may include a voice processing server 100a, voice processing servers 100b, …, and a voice processing server 100n (i.e., the voice content 1 may be subjected to the conversion processing by the voice processing server 100a, the voice content 2 may be subjected to the conversion processing by the voice processing server 100b, and similarly, the voice content 3 may be subjected to the conversion processing by the voice processing server 100 c), which is not limited herein. Alternatively, one or more voice processing servers providing the conversion processing service may run in the server shown in fig. 11, or may exist independently of the server shown in fig. 11, which will not be limited herein.

Step S206: the server returns the converted text information to the user terminal so that the user terminal outputs the converted text information to the position area where the voice message is located in the session interface of the application client;

as shown in fig. 10, when the server performs conversion processing on the voice message, if the conversion processing is successful, the server may return text information (i.e., converted text information) corresponding to the voice message to the application client, so that the application client displays the converted text information, and the target user may visually check the converted text information received by the application client in a session interface of the application client. Similarly, if the conversion processing of the voice message fails, the server may return rejection prompt information to the application client, so that the target user may send the voice conversion request to the server again. The reason for the failure of the voice message conversion process may be various, for example, the voice message is too fast, the voice message is dialect, the voice message is too loud, and the voice message is of an unsupported language type.

It can be understood that, due to the unstable network, when the server returns the converted text information to the user terminal, a failure in returning may occur, and at this time, the user terminal may also receive the rejection prompt information returned by the server. Similarly, due to the unstable network, when the user terminal sends the voice conversion request to the server, the situation of failed sending may occur, and at this time, the user terminal may also receive the rejection prompt information returned by the server.

The method for returning the rejection prompt information to the user terminal by the server may be to pop up a sub-interface independent of the original session interface on the session interface of the application client, where the sub-interface may prompt: "voice conversion failed, please retry". It will be appreciated that the prompt on the sub-interface may vary depending on the reason for returning the rejection prompt.

Step S207: and the user terminal receives the converted text information returned by the server and outputs the converted text information to the position area where the voice message is located in the conversation interface.

Wherein, the voice message in the position area has an association relation with the converted text information.

It can be understood that, in this embodiment of the present application, the first user may further select one or more users interested in the first user from the group list corresponding to the conversation interface of the group message, so that when the user terminal receives the voice messages sent by the users, the converted text information corresponding to the voice messages of the users may be output on the conversation interface. In addition, optionally, the user terminal may also display the converted text information corresponding to the voice messages of the users selected by the user terminal on other display interfaces, so that the user terminal may only listen to the voice messages of the users selected by the user terminal in other display interfaces and only view the converted text information corresponding to the voice messages of the users selected by the user terminal.

It can be understood that, when acquiring a voice message (for example, a voice message a sent by a second user to a first user), the server may configure a unique voice identifier for the voice message a, so that the voice identifier and the voice message a may be distributed to a user terminal corresponding to the first user (i.e., the first terminal) and a user terminal corresponding to the second user (i.e., the second terminal), so that the second user may view the voice message a sent by the second user on a session interface of the second terminal. At this time, the first terminal and the server may acquire the converted text information corresponding to the voice message a in a data interaction manner described in the above step S201 to step S207.

Optionally, for convenience of understanding, in the embodiment of the present application, taking the user terminal that acquires the voice identifier of the voice message a as an example of a user terminal corresponding to a first user (that is, the first terminal described above), another implementation manner that the first terminal automatically receives the converted text information corresponding to the voice message a sent by the server is described.

For example, considering that the voice content corresponding to the voice message a may be stored in the server, in order to increase the speed of the conversion processing, the server in this embodiment of the application may further perform the conversion processing on the voice content corresponding to the voice message a in the local of the server while sending the voice message a and the voice identifier corresponding to the voice message a to the user terminal corresponding to the first user (i.e., the first terminal), so that it may not be necessary to receive the voice conversion request sent by the user terminal based on the queue position of the voice identifier in the target identifier queue. Subsequently, when the server completes the conversion processing of the voice content of the voice message a, the server may directly issue the converted text information of the voice message a obtained by the conversion processing to the user terminal corresponding to the first user intelligently, so as to display the converted text information corresponding to the voice message a in the session interface of the user terminal corresponding to the first user.

Optionally, in the process of converting the voice content of the voice message a, the server may further perform semantic analysis on the voice content corresponding to the voice message a, so that when it is detected that semantic information of a preset keyword exists in the voice message a, the preset keyword may be further identified in the converted text information corresponding to the voice message a (for example, the preset keyword is highlighted in the converted text information), and then the converted text information carrying the keyword after the identification processing may be returned to the user terminal. It can be understood that, when receiving the converted text information carrying the keyword after the identification processing, the user terminal may display the converted text information carrying the keyword and the voice message after the identification processing in the current session interface.

Optionally, if the type of the keyword belongs to a specific type of keyword in a group session, the user terminal may further output and display the received converted text information carrying the keyword after the identifier processing on another display interface independent from the session interface, for example, the converted text information carrying the keyword after the identifier processing may be displayed in a popup window independent from the current session interface.

It should be understood that, by introducing the target identifier queue, when the voice message and the voice identifier corresponding to the voice message are obtained, the first user corresponding to the user terminal is not required to perform the trigger operation, and the user terminal may output the conversion text information corresponding to the voice message in the session interface of the application client, so that the voice message may be automatically converted into the conversion text information corresponding to the voice message, and the active contact of converting the text information may be implemented. When the voice message is converted based on the voice identifier in the target identifier queue, the voice message in the local memory is not required to be uploaded to the server by the user terminal, the voice message corresponding to the voice identifier is intelligently inquired by the server according to the uploaded voice identifier, and the inquired voice message is converted, so that the problems caused by the uploading failure of the voice message and the like can be solved under the condition that the network environment is unstable, and the conversion efficiency of the voice message can be effectively improved.

Further, please refer to fig. 12, where fig. 12 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present application. The voice data processing apparatus 1 may be applied to the user terminal, which may be the user terminal 3000c in the embodiment corresponding to fig. 1. Wherein, the voice data processing apparatus 1 may include: the voice acquisition module 10, the request sending module 20 and the text receiving module 30; further, the voice data processing apparatus 1 may further include: an identity deletion module 40;

the voice obtaining module 10 is configured to, when the application client obtains the voice message of the session interface, obtain a voice identifier corresponding to the voice message, add the voice identifier to the initial identifier queue, and use the initial identifier queue to which the voice identifier is added as a target identifier queue;

the voice acquisition module 10 includes: a voice receiving unit 101, a time stamp determining unit 102, an identifier adding unit 103, a queue determining unit 104; optionally, the voice acquiring module 10 may further include: a first trigger unit 105, a first adjusting unit 106, a first updating unit 107, a second trigger unit 108, a second adjusting unit 109, and a second updating unit 110;

the voice receiving unit 101 is configured to receive, by an application client corresponding to a first user, a voice message forwarded by a second user through a server, and receive a voice identifier configured by the server for the voice message;

a timestamp determining unit 102, configured to obtain a voice conversion condition associated with the session interface, determine, based on the voice conversion condition, a received voice identifier as a target voice identifier, use a voice message received by the application client as a target voice message, and mark a receiving timestamp corresponding to the target voice message as a target receiving timestamp;

an identifier adding unit 103, configured to determine, based on the target receiving timestamp, a queue position of a target voice identifier of the target voice message in a first sub-queue including an identifier of the first voice message, and add the target voice identifier to the first sub-queue based on the queue position, to obtain an initial first sub-queue;

a queue determining unit 104, configured to determine a target identification queue based on the initial first sub-queue and a second sub-queue containing an identification of the second voice message.

Optionally, the request priority of the second sub-queue is greater than the request priority of the first sub-queue;

a first triggering unit 105, configured to respond to a triggering operation for a session interface where a second user is located, output a target voice message to the session interface, and obtain an initial level adjustment instruction in a voice conversion condition;

a first adjusting unit 106, configured to determine, based on the initial level adjustment instruction, a queue position of the target voice identifier in the initial first sub-queue as a first position, and adjust, in the initial first sub-queue, the queue position of the target voice identifier from the first position to a second position, to obtain an adjusted initial first sub-queue; the request priority of the identifier corresponding to the second position is greater than that of the identifier corresponding to the first position;

a first updating unit 107, configured to update the target identification queue based on the adjusted initial first sub-queue and second sub-queue.

Optionally, the second triggering unit 108 is configured to respond to a triggering operation for a target voice message in the session interface, and obtain a target level adjustment instruction in the voice conversion condition;

a second adjusting unit 109, configured to determine, based on the target level adjustment instruction, the adjusted initial first sub-queue as a target first sub-queue, and adjust a queue position of the target voice identifier from a second position to a third position in the target first sub-queue to obtain an adjusted target first sub-queue; the request priority of the identifier corresponding to the third position is greater than that of the identifier corresponding to the second position;

and a second updating unit 110, configured to update the updated target identifier queue based on the adjusted target first sub-queue and second sub-queue.

For specific implementation manners of the voice receiving unit 101, the timestamp determining unit 102, the identifier adding unit 103, and the queue determining unit 104, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again. Optionally, for a specific implementation manner of the first triggering unit 105, the first adjusting unit 106, the first updating unit 107, the second triggering unit 108, the second adjusting unit 109, and the second updating unit 110, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described again here.

A request sending module 20, configured to generate a voice conversion request carrying a voice identifier based on a queue position of the voice identifier in the target identifier queue, and send the voice conversion request to the server, so that the server obtains conversion text information corresponding to the voice identifier based on the voice conversion request;

the request transmission module 20 includes: an information receiving unit 201, a position determining unit 202, a request generating unit 203;

an information receiving unit 201, configured to receive conversion success information, which is returned by the server and is for the M voice messages to be converted that have sent the voice conversion request, and record the conversion number of the received conversion success information as N; n is a positive integer less than or equal to M;

a position determining unit 202, configured to obtain a queue position of the voice identifier in a to-be-requested identifier queue of the target identifier queue, and determine a target queue position of the voice identifier in the requested identifier queue when the queue position of the voice identifier meets a voice conversion condition;

and the request generating unit 203 is configured to add the voice identifier to the requested identifier queue based on the target queue position, generate a voice conversion request carrying the voice identifier based on the requested identifier queue to which the voice identifier is added, and send the voice conversion request to the server.

For specific implementation manners of the information receiving unit 201, the position determining unit 202, and the request generating unit 203, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, which will not be described herein again.

The text receiving module 30 is configured to receive the converted text information returned by the server, and output the converted text information to the location area where the voice message is located in the session interface; there is an association between the voice message and the converted text information in the location area.

Optionally, the identifier deleting module 40 is configured to, when receiving the converted text information returned by the server, obtain target conversion success information for the voice message, and delete the voice identifier from the target identifier queue based on the target conversion success information.

For specific implementation manners of the voice obtaining module 10, the request sending module 20, and the text receiving module 30, reference may be made to the description of step S101 to step S103 in the embodiment corresponding to fig. 3, which will not be described herein again. Optionally, for a specific implementation manner of the identifier deleting module 40, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, which will not be described again here. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 13, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. Optionally, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the processor 1001. As shown in fig. 13, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 13, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 or fig. 8, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 12, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the data processing apparatus 1 mentioned above is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the voice data processing method in the embodiment corresponding to fig. 3 or fig. 8 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

Further, please refer to fig. 14, fig. 14 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present application. The voice data processing apparatus 2 may be applied to the server, which may be the server 3000 in the embodiment corresponding to fig. 1. The voice data processing apparatus 2 may include: the voice sending module 100, the request receiving module 200, the text acquiring module 300 and the text sending module 400;

the voice sending module 100 is configured to generate a voice identifier corresponding to a voice message when the voice message of the application client is acquired, and send the voice message and the voice identifier to the user terminal, so that the user terminal adds the voice identifier to the initial identifier queue, and the initial identifier queue with the voice identifier added is used as a target identifier queue;

a request receiving module 200, configured to receive a voice conversion request sent by a user terminal, and obtain a voice identifier from the voice conversion request; the voice conversion request is generated based on the queue position of the voice identifier in the target identifier queue;

the text acquisition module 300 is configured to, when a voice message corresponding to the voice identifier is queried, perform conversion processing on the voice message to obtain conversion text information corresponding to the voice message;

the text sending module 400 is configured to return the converted text information to the user terminal, so that the user terminal outputs the converted text information to the location area where the voice message is located in the session interface of the application client.

For specific implementation manners of the voice sending module 100, the request receiving module 200, the text obtaining module 300, and the text sending module 400, reference may be made to the description of step S201 to step S207 in the embodiment corresponding to fig. 8, and details will not be described here. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 15, the computer device 2000 may include: the processor 2001, the network interface 2004 and the memory 2005, the computer device 2000 may further include: a user interface 2003, and at least one communication bus 2002. The communication bus 2002 is used to implement connection communication between these components. The user interface 2003 may include a Display (Display) and a Keyboard (Keyboard), and the optional user interface 2003 may further include a standard wired interface and a standard wireless interface. Optionally, the network interface 2004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). Memory 2005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Alternatively, the memory 2005 may be at least one storage device located remotely from the aforementioned processor 2001. As shown in fig. 15, the memory 2005 which is a kind of computer-readable storage medium may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 2000 shown in fig. 15, the network interface 2004 may provide a network communication function; and the user interface 2003 is primarily used to provide an interface for user input; and processor 2001 may be used to invoke the device control application stored in memory 2005 to implement:

It should be understood that the computer device 2000 described in this embodiment of the present application may perform the description of the voice data processing method in the embodiment corresponding to fig. 8, and may also perform the description of the data processing apparatus 2 in the embodiment corresponding to fig. 14, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the data processing apparatus 2 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 8 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

Further, please refer to fig. 16, fig. 16 is a diagram illustrating a speech data processing system according to an embodiment of the present application. The voice data processing system 3 may include a user terminal 1 and a server 2, where the user terminal 1 may be the voice data processing apparatus 1 in the embodiment corresponding to fig. 12; the server 2 may be the voice data processing apparatus 2 in the embodiment corresponding to fig. 14. It is understood that the beneficial effects of the same method are not described in detail.

Further, it should be noted that: embodiments of the present application also provide a computer program product or computer program, which may include computer instructions, which may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor can execute the computer instruction, so that the computer device executes the description of the voice data processing method in the embodiment corresponding to fig. 3 or fig. 8, which will not be described herein again. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product or the computer program referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for processing voice data, comprising:

when an application client side obtains a voice message of a session interface, obtaining a voice identifier corresponding to the voice message, adding the voice identifier to an initial identifier queue, and taking the initial identifier queue added with the voice identifier as a target identifier queue;

generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sending the voice conversion request to a server so that the server obtains conversion text information corresponding to the voice identification based on the voice conversion request;

receiving the converted text information returned by the server, and outputting the converted text information to a position area where the voice message is located in the conversation interface; the voice message and the converted text information in the location area have an association relationship.

2. The method of claim 1, wherein the session interface includes a second user associated with the first user; the initial identification queue comprises a first sub-queue and a second sub-queue; the first sub-queue is used for storing a first voice identifier; the first voice identification is used for representing the identification of a first voice message of a voice conversion request to be sent in the application client; the second sub-queue is used for storing a second voice identifier; the second voice identification is used for characterizing the identification of a second voice message of the sent voice conversion request in the application client;

when the application client side obtains the voice message of the session interface, obtaining the voice identifier corresponding to the voice message, adding the voice identifier to an initial identifier queue, and using the initial identifier queue added with the voice identifier as a target identifier queue, including:

the application client corresponding to the first user receives the voice message forwarded by the second user through the server and receives the voice identifier configured by the server for the voice message;

acquiring a voice conversion condition associated with the session interface, determining the received voice identifier as a target voice identifier based on the voice conversion condition, taking the voice message received by the application client as a target voice message, and marking a receiving time stamp corresponding to the target voice message as a target receiving time stamp;

determining a queue position of the target voice identifier of the target voice message in the first sub-queue containing the identifier of the first voice message based on the target receiving timestamp, and adding the target voice identifier to the first sub-queue based on the queue position to obtain an initial first sub-queue;

determining a target identified queue based on the initial first sub-queue and the second sub-queue containing the identification of the second voice message.

3. The method of claim 2, wherein the request priority of the second sub-queue is greater than the request priority of the first sub-queue;

the method further comprises the following steps:

responding to a triggering operation aiming at the conversation interface where the second user is located, outputting the target voice message to the conversation interface, and acquiring an initial level adjusting instruction in the voice conversion condition;

determining the queue position of the target voice identifier as a first position in the initial first sub-queue based on the initial grade adjusting instruction, and adjusting the queue position of the target voice identifier from the first position to a second position in the initial first sub-queue to obtain an adjusted initial first sub-queue; the request priority of the identifier corresponding to the second position is greater than that of the identifier corresponding to the first position;

updating the target identification queue based on the adjusted initial first sub-queue and the second sub-queue.

4. The method of claim 3, further comprising:

responding to the trigger operation aiming at the target voice message in the session interface, and acquiring a target grade adjusting instruction in the voice conversion condition;

determining the adjusted initial first sub-queue as a target first sub-queue based on the target grade adjusting instruction, and adjusting the queue position of the target voice identifier from the second position to a third position in the target first sub-queue to obtain an adjusted target first sub-queue; the request priority of the identifier corresponding to the third position is greater than that of the identifier corresponding to the second position;

and updating the updated target identification queue based on the adjusted first sub-queue and the second sub-queue of the target.

5. The method of claim 1, wherein the target identification queue comprises a pending identification queue and a requested identification queue; the voice identifier is positioned in the identifier queue to be requested; the requested identification queue comprises M queue positions; one queue position in the requested identification queue is used for storing an identification of the voice message to be converted; the M is the total number of the identifications of the voice message to be converted, which has sent the voice conversion request;

generating a voice conversion request carrying the voice identification based on the queue position of the voice identification in the target identification queue, and sending the voice conversion request to a server, wherein the method comprises the following steps:

receiving conversion success information which is returned by a server and aims at M voice messages to be converted and sent by a voice conversion request, and recording the conversion number of the received conversion success information as N; n is a positive integer less than or equal to M;

acquiring the queue position of the voice identifier in a to-be-requested identifier queue of the target identifier queue, and determining the target queue position of the voice identifier in the requested identifier queue when the queue position of the voice identifier meets a voice conversion condition;

and adding the voice identifier to the requested identifier queue based on the target queue position, generating a voice conversion request carrying the voice identifier based on the requested identifier queue added with the voice identifier, and sending the voice conversion request to a server.

6. The method of claim 1, further comprising:

and when the converted text information returned by the server is received, acquiring target conversion success information aiming at the voice message, and deleting the voice identifier from the target identifier queue based on the target conversion success information.

7. A method for processing voice data, comprising:

when a voice message of an application client is acquired, generating a voice identifier corresponding to the voice message, and sending the voice message and the voice identifier to a user terminal, so that the user terminal adds the voice identifier to an initial identifier queue, and the initial identifier queue added with the voice identifier is used as a target identifier queue;

receiving a voice conversion request sent by the user terminal, and acquiring the voice identifier from the voice conversion request; the voice conversion request is generated based on a queue position of the voice identifier in the target identifier queue;

8. A speech data processing apparatus, comprising:

the voice acquisition module is used for acquiring a voice identifier corresponding to a voice message when an application client side acquires the voice message of a session interface, adding the voice identifier to an initial identifier queue, and taking the initial identifier queue added with the voice identifier as a target identifier queue;

a request sending module, configured to generate a voice conversion request carrying the voice identifier based on the queue position of the voice identifier in the target identifier queue, and send the voice conversion request to a server, so that the server obtains, based on the voice conversion request, conversion text information corresponding to the voice identifier;

the text receiving module is used for receiving the converted text information returned by the server and outputting the converted text information to the position area where the voice message is located in the conversation interface; the voice message and the converted text information in the location area have an association relationship.

9. A speech data processing apparatus, comprising:

the voice sending module is used for generating a voice identifier corresponding to the voice message when the voice message of the application client is obtained, and sending the voice message and the voice identifier to the user terminal so that the user terminal adds the voice identifier to an initial identifier queue and takes the initial identifier queue added with the voice identifier as a target identifier queue;

a request receiving module, configured to receive a voice conversion request sent by the user terminal, and obtain the voice identifier from the voice conversion request; the voice conversion request is generated based on a queue position of the voice identifier in the target identifier queue;

10. A computer device, comprising: a processor, a memory, a network interface;

the processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and a processor for calling the computer program to perform the method of any one of claims 1 to 7.