CN109815360B

CN109815360B - Audio data processing method, device and equipment

Info

Publication number: CN109815360B
Application number: CN201910079735.0A
Authority: CN
Inventors: 林梅贞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2023-12-29
Anticipated expiration: 2039-01-28
Also published as: CN109815360A

Abstract

The embodiment of the invention discloses a method, a device and equipment for processing audio data. The method comprises the following steps: acquiring audio data; detecting a trigger instruction for starting a labeling function on audio data; determining marking points of the audio data according to the triggering instruction, wherein the marking points are positioned between data frames of the audio data; and creating an annotation icon for the annotation point so as to identify and locate the annotation point of the audio data through the annotation icon. After the audio data is acquired, the marking icon is created for the marking point of the audio data according to the detected triggering instruction, so that the marking point of the audio data is marked and positioned through the marking icon, the marking point can be positioned without continuously searching in the whole audio data, and when a user wants to listen to a certain content, the user can quickly find the corresponding content, the efficiency is improved, and the user experience is further improved.

Description

Audio data processing method, device and equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a method, a device and equipment for processing audio data.

Background

In daily life and in the working process, various scenes and various roles are unavoidable to participate in important occasions and conferences, such as interviews of a reporter, meetings of a lawyer, courses of a scene and important conferences, and when key information appearing in the scenes is usually selected and recorded, so that the key information cannot be recalled when the key information is needed to be used later. Among all recording methods, the most effective and accurate recording method is recording, and content is recorded by audio data acquired by recording.

When processing the audio data collected by starting the recording function, if a certain audio content in the audio data needs to be played, the related technology adopts a mode of playing the whole audio data, and a user searches the needed audio content in the played audio data.

However, in this manner, the user is required to search continuously for the played audio data, which results in inefficiency and affects the user experience.

Disclosure of Invention

The embodiment of the invention provides a processing method, device and equipment of audio data and a storage medium, which can be used for solving the problems in the related art. The technical scheme is as follows:

in one aspect, an embodiment of the present invention provides a method for processing audio data, where the method includes:

acquiring audio data;

detecting a trigger instruction for starting a labeling function on the audio data;

determining the marking points of the audio data according to the triggering instruction, wherein the marking points are positioned between data frames of the audio data;

and creating an annotation icon for the annotation point so as to identify and locate the annotation point of the audio data through the annotation icon.

Also provided is a processing method of audio data, the method comprising:

Receiving a trigger instruction for starting a labeling function on audio data, wherein the trigger instruction is detected and sent after the terminal acquires the audio data;

determining a marking point of the audio data according to the triggering instruction, and generating a name of the marking point and a node identification ID for identifying the marking point;

and returning the node ID of the marking point to the terminal, storing the node ID of the marking point by the terminal, determining the marking point of the audio data according to the triggering instruction, and creating a marking icon for the marking point so as to identify and position the marking point of the audio data through the marking icon.

There is also provided an apparatus for processing audio data, the apparatus comprising:

the acquisition module is used for acquiring the audio data;

the detection module is used for detecting a trigger instruction for starting a labeling function on the audio data;

the determining module is used for determining the marking points of the audio data according to the triggering instruction, wherein the marking points are positioned between the data frames of the audio data;

the creating module is used for creating the annotation icon for the annotation point so as to identify and locate the annotation point of the audio data through the annotation icon.

the receiving module is used for receiving a triggering instruction for starting a labeling function on the audio data, and the triggering instruction is detected and sent after the terminal acquires the audio data;

the generation module is used for determining the marking point of the audio data according to the triggering instruction and generating the name of the marking point and the node identification ID for identifying the marking point;

the return module is used for returning the node ID of the marking point to the terminal, the terminal stores the node ID of the marking point, the marking point of the audio data is determined according to the trigger instruction, and a marking icon is created for the marking point, so that marking and positioning are carried out on the marking point of the audio data through the marking icon.

In one aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction which, when executed by the processor, implements a method of processing audio data as described in any of the above.

In one aspect, a computer readable storage medium is provided, in which at least one instruction is stored, which when executed implements a method of processing audio data as described in any of the above.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

after the audio data is acquired, the marking icon is created for the marking point of the audio data according to the detected triggering instruction, so that the marking point of the audio data is marked and positioned through the marking icon, the marking point can be positioned without continuously searching in the whole audio data, and when a user wants to listen to a certain content, the user can quickly find the corresponding content, the efficiency is improved, and the user experience is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present invention;

fig. 2 is a flowchart of a processing method of audio data according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of an interface provided by an embodiment of the present invention;

Fig. 4 is a schematic diagram of a terminal state according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for processing audio data according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of an interface provided by an embodiment of the present invention;

fig. 7 is a flowchart of a processing method of audio data according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for processing audio data according to an embodiment of the present invention;

fig. 9 is a flowchart of a processing method of audio data according to an embodiment of the present invention;

fig. 10 is a flowchart of a processing method of audio data according to an embodiment of the present invention;

FIG. 11 is a flowchart of a method for processing audio data according to an embodiment of the present invention;

fig. 12 is a schematic diagram of an audio data processing apparatus according to an embodiment of the present invention;

fig. 13 is a schematic diagram of an audio data processing apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

An embodiment of the present invention provides a method for processing audio data, please refer to fig. 1, which shows a schematic diagram of an implementation environment of the method provided by the embodiment of the present invention. The implementation environment may include: a terminal 11 and a server 12.

The terminal 11 is provided with an application client, for example, a recording-type application client, and the like. When the application client is started, the recording function is started, and audio data can be collected through a microphone on the terminal 11. The collected audio data may be sent to the server 12 for storage. Of course, the terminal 11 may store the collected audio data. Thus acquiring the collected audio data when the audio data needs to be processed.

The terminal 11 can collect audio data in the recording process and process the audio data by adopting the method provided by the embodiment of the invention. Of course, after the recording is finished, the method provided by the embodiment of the invention can be adopted to process the collected audio data.

Alternatively, the terminal 11 shown in fig. 1 may be an electronic device such as a cellular phone, a tablet computer, a personal computer, or the like. The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.

Based on the implementation environment shown in fig. 1, the method for processing audio data provided in the embodiment of the present invention may be shown in fig. 2, and the method is applied to the terminal 11 in the implementation environment shown in fig. 1. As shown in fig. 2, the method provided by the embodiment of the invention may include the following steps:

in step 201, audio data is acquired.

In this embodiment, the audio data is the audio data collected after the recording function is started, and may be the audio data collected in the recording process. For example, after the recording function is started, audio data is collected through a microphone of the terminal. In addition, the method provided by the embodiment of the invention is also suitable for audio processing after the recording is finished, and the audio data acquired in the step are acquired audio data. The embodiment of the invention is not limited thereto, and for example, besides the audio data collected by the terminal when the recording function is started, the embodiment of the invention can also be audio data collected by other terminals, and the terminal can obtain the audio data from other terminals or from a server.

In step 202, a trigger instruction to initiate an annotation function on audio data is detected.

The marking function can be triggered no matter in the recording process or after the recording is finished if the marking requirement exists. Optionally, in the method provided by the embodiment of the present invention, a trigger instruction for starting the labeling function on the audio data is detected, including but not limited to the following three embodiments:

First embodiment: the processing interface is used for displaying the audio data, wherein a labeling control is displayed on the processing interface, and the processing interface comprises a recording acquisition interface or a playing interface of the audio data; if touch operation on the labeling control is detected, a trigger instruction for starting the labeling function on the audio data is detected based on the touch operation.

The recording acquisition interface is displayed after the terminal starts the recording function. The playing interface may be an interface when the audio data is played after the audio data is obtained after the recording is finished. In any interface, the labeling control can be arranged, so that after touch operation of the labeling control is detected, a triggering instruction for starting a labeling function is obtained.

For example, as shown in fig. 3 (1), taking an audio data recording and collecting interface as an example, the recording and collecting interface may be presented after a recording App of a voice memo on the terminal is opened, and a labeling control named "label" is displayed on the interface. When touch operation, such as clicking operation, is detected on the labeling control, a triggering instruction for starting the labeling function on the audio data is detected based on the touch operation.

Second embodiment: after the recording function is started, a speed sensor built in the terminal is started, and the terminal is used for processing audio data; if the speed sensor detects that the terminal performs the reference action for starting the labeling function, a trigger instruction for starting the labeling function on the audio data is acquired based on the reference action.

In the embodiment, if the terminal screen is seen to click on the annotation control in the recorded real scene, the operation is inconvenient in use and the work in the scene cannot be focused. Therefore, the method provided by the embodiment of the invention supports the use of the speed sensor to start the recording labeling function. For example, the moving position of the terminal is detected by an angular velocity sensor to determine whether the terminal has performed a reference motion. In the mode, the speed sensor is adopted to detect that the terminal carries out the reference action of starting the marking function, and once the speed sensor detects the reference action, the user can confirm that the marking function needs to be started.

The reference actions may be set, and the embodiment of the present invention does not limit the set reference actions. For example, the reference operation may be to raise the terminal back and forth, and the labeling function is activated, as shown in fig. 4 (1), with the direction indicated by the arrow being the back and forth direction. Alternatively, the reference action may be that when the terminal is moved vertically to the right, the labeling function is started, and as shown in fig. 4 (2), the direction indicated by the arrow is the right direction. Alternatively, the reference action may be that when the terminal is moved vertically to the left, the labeling function is started, and as shown in fig. 4 (3), the direction indicated by the arrow is the left direction.

Whichever reference action is adopted, the method provided by the embodiment of the invention can set the angle range of the reference action so as to avoid the misjudgment phenomenon of the speed sensor. For example, taking the reference actions of fig. 4 (1) to (3) as an example, when the angular velocity sensor determines that the front-rear lifting action of the terminal is greater than 30 degrees, the labeling function is started, otherwise, the labeling function is not started; when the terminal moves vertically leftwards, starting the labeling function when the movement angle is larger than 30 degrees, and otherwise, not starting the labeling function; when the terminal moves vertically and rightwards, the labeling function is started when the movement angle is larger than 30 degrees, and otherwise, the labeling function is not started. Regardless of the reference actions described above, the process of initiating the recording annotation function by the reference actions may be as shown in FIG. 5. After a recording App of the terminal is opened, a recording function is started, and an angular velocity sensor is started through an API. Judging whether the angle of the mobile phone is changed or not, and if the angle of the mobile phone is not changed, not starting the labeling function. If the angle of the mobile phone is changed, judging whether the angle of the mobile phone is changed by more than 30 degrees, and if the angle of the mobile phone is changed by more than 30 degrees, starting the labeling function. If not, the labeling function is not activated.

Third embodiment: acquiring a voice instruction; recognizing the voice command to obtain a voice recognition result; if the voice recognition result is detected to comprise instruction information for starting the labeling function on the audio data, a triggering instruction for starting the labeling function on the audio data is acquired based on the instruction information.

For such an embodiment, the user may not need to click on a control on the terminal interface or cause the terminal to move in position, but may be directly controlled by voice instructions. The method can set instruction information for starting the labeling function on the audio data, and when the voice instruction sent by the user contains the instruction information, a triggering instruction for starting the labeling function can be triggered. The embodiment of the invention does not limit the content of the set instruction information. For example, the user may directly speak a voice command whose content is "start labeling function", and when the terminal acquires the voice command, the voice command is recognized to obtain a voice recognition result. The voice recognition result can be text, and the recognized text is compared with text content corresponding to the set instruction information, and if the recognized text is consistent with the text content, the voice recognition result is considered to comprise instruction information for starting the labeling function on the audio data, so that a trigger instruction is obtained. Or, it may also detect whether the voice recognition result contains a keyword for starting the labeling function, and if so, obtain a trigger instruction.

It should be noted that, since the voice command is required to be made by the user, this approach does not conform to the actual situation even more in the first and second embodiments described above for the situation where the user needs to stay quiet and cannot make a sound, such as a meeting. Therefore, the method for starting the labeling function in which mode is adopted is not limited, and can be selected according to the scene condition of the recording.

In step 203, a point of the audio data is determined according to the trigger instruction, where the point of the audio data is located between data frames of the audio data.

Optionally, determining the annotation point of the audio data according to the trigger instruction includes: and taking the point matched with the time point of the trigger instruction in the audio data as the marking point of the audio data.

The audio data comprises a plurality of data frames, and each data frame has a corresponding time point. For example, when audio data is collected, each data frame corresponds to a collection time point, or when audio data is played, each data frame corresponds to a play time point. Optionally, when a point in the audio data, which is matched with the time point of the trigger instruction, is used as a labeling point of the audio data, a data frame, which corresponds to the same time point as the time point of the trigger instruction, is used as a target data frame, and a point between the target data frame and a previous data frame is used as a labeling point, namely, a point in the audio data, which is matched with the time point of the trigger instruction.

For example, if a trigger instruction for starting a labeling function on audio data is detected in the process of collecting the audio data, a data frame with the same corresponding collection time point as the time point of the trigger instruction is taken as a target data frame, and a point between the target data frame and a previous data frame is taken as a labeling point, namely, a point matched with the time point of the trigger instruction in the audio data. Taking the 6 th data frame being acquired when the trigger instruction is detected as an example in the process of acquiring the audio data, namely the acquisition time point corresponding to the 6 th data frame is the same as the time point of the trigger instruction, and taking the point between the 6 th data frame and the 5 th data frame as the marking point of the audio data.

For another example, if a trigger instruction for starting a labeling function on audio data is detected during the process of playing the audio data, a data frame with the same corresponding playing time point as the time point of the trigger instruction is taken as a target data frame, and a point between the target data frame and a previous data frame is taken as a labeling point, namely, a point matched with the time point of the trigger instruction in the audio data. Taking the example that the 8 th data frame is being played in the process of playing the audio data, namely the playing time point corresponding to the 8 th data frame is the same as the time point of the trigger instruction, and taking the point between the 8 th data frame and the 7 th data frame as the marking point of the audio data.

Optionally, determining the annotation point of the audio data according to the trigger instruction includes: if the time interval between the time of detecting the trigger instruction and the time of starting the marking function last time is larger than the reference threshold value, determining the marking point of the audio data according to the trigger instruction.

The reference threshold may be set empirically, or may be set by a user, for example, 5 seconds is used as the reference threshold, so that the phenomenon of misjudgment of the terminal when the recording labeling function is used is avoided through the reference threshold. For example, the labeling function cannot be used within 5 seconds after the labeling function is started, i.e., after the labeling function is used in any way for the first time, the labeling function is used in any way for the second time, and only after 5 seconds of the first use is needed.

In step 204, a callout icon is created for the callout point to identify and locate the callout point of the audio data by the callout icon.

Optionally, creating the annotation icon for the annotation point includes: and displaying the recording track of the audio data, taking the position of the marking point in the recording track as a marking position, and creating a marking icon at the marking position.

As shown in fig. 3 (2), after the labeling function is started, a labeling icon is displayed on the recording progress bar; and displaying the default name of the annotation point: "label 1". As shown in fig. 3 (3), the marking function can be used for unlimited times in the recording process, and after each marking function is started, a record of marking points, namely marking icons, is added in the page.

Optionally, after creating the annotation icon for the annotation point, the method further includes: displaying the name of the marking point and a processing button, wherein the processing button comprises at least one of a first button and a second button, the first button is used for editing the name of the marking point, and the second button is used for deleting the marking point; and if the processing button is detected to be triggered, processing the marked point based on the triggered processing button. For example, if the first button is detected to be triggered, the name of the labeling point is edited, and the edited name is acquired and stored. And deleting the marking point if the second button is detected to be triggered.

Optionally, considering that the marking points can be set for multiple times and the display area of the terminal interface is limited, when the name of the marking point and the processing button are displayed, if the marking points exceed the reference number, displaying a dynamic panel, and displaying the name of each marking point and the processing button of each marking point through the dynamic panel.

For example, the dynamic panel may set a slide-down menu through which all of the annotation points are displayed. As shown in fig. 6 (1), when more than three points are marked, the page displays a dynamic panel through which the names of the points and the process buttons are displayed, and the information records of all the points can be viewed by pulling down the slider.

Further, after the annotation icon is created, the method provided by the embodiment of the invention also supports the user to customize the name of the annotation point, thereby meeting the personalized requirements of the user. For example, clicking an edit button in the process buttons can edit the name of the annotation point, and the process can refer to the flow shown in fig. 7. After the labeling function is started through the angular velocity sensor or the App page, if the labeling content needs to be edited, the edited labeling content, namely the updated name of the labeling point, is obtained. After the editing labeling content is stored, the recording is finished, and the angular velocity sensor of the mobile phone is closed.

Considering that most of the audio data are stored on the server side, the method provided by the embodiment of the invention can still support the processing of the audio data for the situations of terminal replacement and the like of users. For example, the method for storing the audio data on the server side, whichever way to detect the trigger instruction is adopted, further includes, after detecting the trigger instruction for starting the labeling function on the audio data: the method comprises the steps that a trigger instruction is sent to a server, the server determines a marking point of audio data according to the trigger instruction, and a name of the marking point and a node identification ID for identifying the marking point are generated; and receiving the node ID of the marking point returned by the server, and storing the node ID of the marking point.

For example, as shown in fig. 8, when the labeling function is started, a trigger instruction is sent to the background server; after receiving the trigger instruction, the background server judges whether the trigger instruction for starting the labeling function is executed within 5 seconds, if not, the labeling function is started, otherwise, the labeling function is not started; after receiving a triggering instruction for starting the labeling function, the background server creates a labeling point at a triggering time point. The default generated node name is "label 1" and the corresponding node ID. After creating the node ID, the node ID is sent to the front end, i.e. the terminal. The front end creates a mark icon at a point in time when a mark point is started on a recording track (progress bar of audio data), and displays the name of the mark point and a processing button, an editing button and a deletion button.

Optionally, after the processing of the annotation point based on the triggered processing button, the method further includes:

the node ID of the processed target marking point is obtained, the node ID of the target marking point and the processing information are sent to a server, and the server processes the target marking point according to the node ID of the target marking point and the processing information.

For example, as shown in fig. 6 (1), each annotation icon includes two buttons for deleting and editing, if the deletion button is detected to be triggered, the corresponding annotation icon is deleted, the ID of the annotation point is sent to the server, and the server deletes the annotation point corresponding to the ID of the annotation point stored on the server side. If the edit button is triggered, the name of the corresponding annotation point is edited, the ID of the annotation point and the edited name are sent to the server, the server performs the process of the name of the annotation point corresponding to the ID of the annotation point stored on the server side, and the process can refer to the flow shown in FIG. 8.

The above is only an example of the recording process, and after the recording is completed, the obtained playing interface may be as shown in fig. 6 (2). When the whole-process recording needs to be played, clicking a playing icon to play the audio data of the whole-process recording; when the audio data obtained by recording is required to be played, the playing button behind the name of the marking point is clicked, and the recording can be started to be played at the marking point; in the process of playing the whole-course recording, clicking the play button after marking can jump to the marking point to start playing the recording, and the process can also refer to the following method flow shown in fig. 9.

According to the method provided by the embodiment, after the audio data is obtained, the marking icon is created for the marking point of the audio data according to the detected triggering instruction, so that the marking point of the audio data is marked and positioned through the marking icon, the marking point can be positioned without continuously searching in the whole audio data, and therefore when a user wants to listen to a certain content, the user can quickly find the corresponding content, the efficiency is improved, and the user experience is further improved.

The name of the mark point can be recorded, so that the mark point can be searched through the name during playback of the record, and blind searching is not needed; and the recording nodes can be marked rapidly and conveniently by referring to the speed sensor of the terminal, and the efficient operation method can effectively improve the working efficiency, so that higher return rate is obtained.

The embodiment of the invention provides a processing method of audio data, which aims at processing the audio data acquired by recording after the recording is finished. Referring to fig. 9, the method provided by the embodiment of the invention includes the following steps.

In step 301, audio data is acquired.

The implementation of this step is detailed in the above step 201, and will not be described here again.

In step 302, a trigger to initiate an annotation function on audio data is detected.

The implementation of this step is detailed in the above step 202, and will not be described here again.

In step 303, a point of the audio data is determined according to the trigger instruction, where the point of the audio data is located between data frames of the audio data.

The implementation of this step is detailed in the above-mentioned step 203, and will not be described here again.

In addition, in the process of playing the audio record, the method provided by the embodiment of the invention also supports the modification of the name of the annotation point, and the modification process can refer to the method flow shown in fig. 10.

In step 304, a label icon is created for the label point, so as to identify and locate the label point of the audio data through the label icon.

The implementation of this step is detailed in the above-mentioned step 204, and will not be described here again.

In step 305, a selection instruction of a mark point is acquired, and based on the selection instruction, audio data including the selected mark point is played from a time point of the selected mark point.

Optionally, based on the selection instruction, playing the audio data including the selected annotation point from the time point of the selected annotation point includes: acquiring the node ID of the selected marking point based on the selection instruction; the node ID of the selected marking point is sent to a server, and the server returns audio data comprising the selected marking point according to the node ID of the selected marking point; and receiving the audio data of the selected annotation point, and playing the audio data comprising the selected annotation point from the time point of the selected annotation point.

The interaction process between the terminal and the background server can refer to fig. 11. When the record APP on the terminal plays the record, the node ID of the marking point selected by the user is sent to the background server, and the background server returns the audio data of the marking point comprising the node ID, so that the APP on the terminal side can play the audio data comprising the selected marking point from the time point of the selected marking point.

Because the names of the marking points can be recorded, the marking points can be searched through the names during the playback of the record, and blind searching is not needed; and the recording nodes can be marked rapidly and conveniently by referring to the speed sensor of the terminal, and the efficient operation method can effectively improve the working efficiency, so that higher return rate is obtained.

Based on the same technical concept, referring to fig. 12, an embodiment of the present invention provides an apparatus for processing audio data, the apparatus including:

an acquisition module 121 for acquiring audio data;

the detection module 122 is configured to detect a trigger instruction for starting a labeling function on audio data;

the determining module 123 is configured to determine, according to the trigger instruction, a labeling point of the audio data, where the labeling point is located between data frames of the audio data;

the creating module 124 is configured to create an annotation icon for the annotation point, so as to identify and locate the annotation point of the audio data through the annotation icon.

Optionally, the detection module 122 is configured to display a processing interface of the audio data, where the processing interface displays a labeling control, and the processing interface includes a recording acquisition interface or a playing interface of the audio data; if touch operation on the labeling control is detected, a trigger instruction for starting the labeling function on the audio data is detected based on the touch operation.

Optionally, the detection module 122 is configured to start a speed sensor built in the terminal after the recording function is started, where the terminal is a terminal for processing audio data; if the speed sensor detects that the terminal performs the reference action for starting the labeling function, a trigger instruction for starting the labeling function on the audio data is acquired based on the reference action.

Optionally, the detecting module 122 is configured to obtain a voice instruction; recognizing the voice command to obtain a voice recognition result; if the voice recognition result is detected to comprise instruction information for starting the labeling function on the audio data, a triggering instruction for starting the labeling function on the audio data is acquired based on the instruction information.

Optionally, the determining module 123 is configured to determine the annotation point of the audio data according to the trigger instruction if the time interval between the time when the trigger instruction is detected and the time when the annotation function is started last time is greater than the reference threshold.

Optionally, the determining module 123 is configured to use a point in the audio data, which is matched with the time point of the trigger instruction, as a labeling point of the audio data;

the creating module 124 is configured to display a recording track of the audio data, and create an annotation icon at a marking position using a position of the annotation point in the recording track as the marking position.

Optionally, the apparatus further comprises:

the first sending module is used for sending the trigger instruction to the server, and the server determines the marking point of the audio data according to the trigger instruction, and generates the name of the marking point and the node identification ID for identifying the marking point;

and the receiving module is used for receiving the node ID of the marking point returned by the server and storing the node ID of the marking point.

Optionally, the apparatus further comprises:

the display module is used for displaying the name of the marking point and a processing button, wherein the processing button comprises at least one of a first button and a second button, the first button is used for editing the name of the marking point, and the second button is used for deleting the marking point;

and the processing module is used for processing the annotation point based on the triggered processing button if the processing button is triggered.

Optionally, the display module is configured to display a dynamic panel if the number of the labeling points exceeds the reference number, and display a name of each labeling point and a processing button of each labeling point through the dynamic panel.

Optionally, the apparatus further comprises:

the second sending module is used for obtaining the node ID of the processed target marking point, sending the node ID of the target marking point and the processing information to the server, and processing the target marking point according to the node ID of the target marking point and the processing information by the server.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a selection instruction of the marking point;

and the playing module is used for playing the audio data comprising the selected annotation point from the time point of the selected annotation point based on the selection instruction.

Optionally, the playing module is used for acquiring the node ID of the selected marking point based on the selection instruction; the node ID of the selected marking point is sent to a server, and the server returns audio data comprising the selected marking point according to the node ID of the selected marking point; and receiving the audio data of the selected annotation point, and playing the audio data comprising the selected annotation point from the time point of the selected annotation point.

Referring to fig. 13, an embodiment of the present invention provides an apparatus for processing audio data, including:

the receiving module 131 is configured to receive a triggering instruction for starting a labeling function on audio data, where the triggering instruction is detected and sent after the terminal obtains the audio data;

the generating module 132 is configured to determine a labeling point of the audio data according to the trigger instruction, and generate a name of the labeling point and a node identifier ID for identifying the labeling point;

the return module 133 is configured to return a node ID of the annotation point to the terminal, store the node ID of the annotation point, determine the annotation point of the audio data according to the trigger instruction, and create an annotation icon for the annotation point, so as to identify and locate the annotation point of the audio data through the annotation icon.

Optionally, the receiving module 131 is further configured to receive the node ID and the processing information of the processed target marking point sent by the terminal, and process the target marking point according to the node ID and the processing information of the target marking point.

Optionally, the receiving module 131 is further configured to receive a node ID of the selected annotation point sent by the terminal;

the return module 133 is further configured to return, to the terminal, audio data including the selected annotation point according to the node ID of the selected annotation point, so as to play, by the terminal, the audio data including the selected annotation point from the time point of the selected annotation point.

It should be noted that, when the apparatus provided in the foregoing embodiment performs the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 14 is a schematic structural diagram of an audio data processing device according to an embodiment of the present invention, where the device may be a server, and the server may be a separate server or a cluster server. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The server includes a Central Processing Unit (CPU) 1401, a system memory 1404 of a Random Access Memory (RAM) 1402 and a Read Only Memory (ROM) 1403, and a system bus 1405 connecting the system memory 1404 and the central processing unit 1401. The server also includes a basic input/output system (I/O system) 1406 to facilitate the transfer of information between various devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for a user to input information. Wherein a display 1408 and an input device 1409 are connected to the central processing unit 1401 via an input output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. Mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the server. That is, mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1404 and mass storage device 1407 described above may be collectively referred to as memory.

According to various embodiments of the invention, the server may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the server may be connected to the network 1412 through a network interface unit 1411 coupled to the system bus 1405, or other types of networks or remote computer systems (not shown) may be coupled to the server using the network interface unit 1411.

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the method for processing audio data provided by the embodiments of the present invention.

Fig. 15 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present invention. The device may be a terminal, for example: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminals may also be referred to by other names as user equipment, portable terminals, laptop terminals, desktop terminals, etc.

Generally, the terminal includes: a processor 1501 and a memory 1502.

The processor 1501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1501 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1501 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1501 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1501 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1502 may include one or more computer-readable storage media, which may be non-transitory. Memory 1502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1502 is used to store at least one instruction for execution by processor 1501 to implement the method of processing audio data provided by the method embodiments herein.

In some embodiments, the terminal may further optionally include: a peripheral interface 1503 and at least one peripheral device. The processor 1501, memory 1502 and peripheral interface 1503 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1503 via a bus, signal lines, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1504, a display 1505, a camera assembly 1506, audio circuitry 1507, and a power supply 1509.

A peripheral interface 1503 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1501 and the memory 1502. In some embodiments, processor 1501, memory 1502, and peripheral interface 1503 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1501, the memory 1502, and the peripheral interface 1503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1504 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1504 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1504 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1504 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1504 may also include NFC (Near Field Communication, short range wireless communication) related circuits, which are not limited in this application.

Display 1505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When display screen 1505 is a touch display screen, display screen 1505 also has the ability to collect touch signals at or above the surface of display screen 1505. The touch signal may be input to the processor 1501 as a control signal for processing. At this point, display 1505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1505 may be one, providing a front panel of the terminal; in other embodiments, the display 1505 may be at least two, respectively disposed on different surfaces of the terminal or in a folded design; in still other embodiments, the display 1505 may be a flexible display screen disposed on a curved surface or a folded surface of the terminal. Even more, the display 1505 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display screen 1505 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1506 is used to capture images or video. Optionally, the camera assembly 1506 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1506 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1507 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 1501 for processing, or inputting the electric signals to the radio frequency circuit 1504 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones can be respectively arranged at different parts of the terminal. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1501 or the radio frequency circuit 1504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1507 may also include a headphone jack.

The power supply 1509 is used to power the various components in the terminal. The power supply 1509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal further includes one or more sensors 1510. The one or more sensors 1510 include, but are not limited to: acceleration sensor 1511, gyro sensor 1512, pressure sensor 1513, optical sensor 1515, and proximity sensor 1516.

The acceleration sensor 1511 can detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with a terminal. For example, the acceleration sensor 1511 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1501 may control the display screen 1505 to display the user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 1511. The acceleration sensor 1511 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 1512 may detect a body direction and a rotation angle of the terminal, and the gyro sensor 1512 may collect a 15D motion of the terminal by the user in cooperation with the acceleration sensor 1511. The processor 1501, based on the data collected by the gyro sensor 1512, may implement the following functions: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1513 may be disposed on a side frame of the terminal and/or below the display 1505. When the pressure sensor 1513 is disposed on the side frame of the terminal, a grip signal of the terminal by the user can be detected, and the processor 1501 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 1513. When the pressure sensor 1513 is disposed at the lower layer of the display screen 1505, the processor 1501 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1505. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1515 is used to collect the ambient light intensity. In one embodiment, processor 1501 may control the display brightness of display screen 1505 based on the intensity of ambient light collected by optical sensor 1515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1505 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1505 is turned down. In another embodiment, the processor 1501 may also dynamically adjust the shooting parameters of the camera assembly 1506 based on the ambient light intensity collected by the optical sensor 1515.

A proximity sensor 1516, also referred to as a distance sensor, is typically provided on the front panel of the terminal. The proximity sensor 1516 is used to collect the distance between the user and the front face of the terminal. In one embodiment, when the proximity sensor 1516 detects a gradual decrease in the distance between the user and the front face of the terminal, the processor 1501 controls the display 1505 to switch from the on-screen state to the off-screen state; when the proximity sensor 1516 detects that the distance between the user and the front face of the terminal gradually increases, the processor 1501 controls the display screen 1505 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 15 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an example embodiment, there is also provided a computer device including a processor and a memory having at least one instruction, at least one program, set of codes, or set of instructions stored therein. The at least one instruction, at least one program, code set, or instruction set is configured to be executed by one or more processors to implement a method of processing any of the audio data described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes or a set of instructions is stored, which when executed by a processor of a computer device, implements a method of processing any of the above-mentioned audio data.

Alternatively, the above-described computer-readable storage medium may be a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, or the like.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. The foregoing description of the exemplary embodiments of the invention is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method of processing audio data, the method comprising:

acquiring audio data obtained after recording is finished, wherein the audio data comprises a plurality of data frames, and each data frame corresponds to a playing time point;

in the process of playing the audio data, detecting a trigger instruction for starting a labeling function on the audio data;

if the time interval between the time of detecting the trigger instruction and the time of starting the marking function last time is larger than a reference threshold value, taking a data frame, corresponding to the audio data, with the playing time point identical to the time point of the trigger instruction as a target data frame, and taking a point between the target data frame and a data frame, which is the previous to the target data frame, in the audio data as a marking point of the audio data;

the triggering instruction is sent to a server, after the triggering instruction is received, the server judges whether the triggering instruction for starting the marking function is executed in the reference threshold, and under the condition that the triggering instruction for starting the marking function is not executed, the marking point of the audio data is determined according to the triggering instruction, and a default name of the marking point and a node identification ID for identifying the marking point are generated;

Receiving the default name of the marking point and the node ID of the marking point returned by the server, and storing the default name of the marking point and the node ID of the marking point;

displaying a recording track of the audio data, taking the position of the marking point in the recording track as a marking position, and creating a marking icon at the marking position so as to mark and position the marking point of the audio data through the marking icon;

if the marking points exceed the reference number, displaying a dynamic panel, wherein the dynamic panel comprises a sliding bar;

under the condition that the sliding bar is slid to any position, displaying default names of each marking point in at least two marking points exceeding a reference number and processing buttons corresponding to each marking point through the dynamic panel, wherein the processing buttons comprise a first button, a second button and a third button, the first button is used for editing the default names of the marking points corresponding to the first button, the second button is used for deleting the marking points corresponding to the second button, and the third button is used for playing the audio data from the time point of the marking point corresponding to the third button;

If the first button is detected to be triggered, acquiring the edited name of the marking point corresponding to the triggered first button based on the triggered first button, storing the edited name, acquiring the node ID of the edited marking point, transmitting the node ID of the edited marking point and the edited name to a server, and editing the default name of the marking point marked by the node ID of the edited marking point into the edited name by the server according to the node ID of the edited marking point and the edited name;

if the second button is detected to be triggered, deleting the marking point corresponding to the triggered second button based on the triggered second button, acquiring the node ID of the deleted marking point, and sending the node ID and the deleting information of the deleted marking point to a server, wherein the server deletes the deleted marking point according to the node ID and the deleting information of the deleted marking point;

if the third button is detected to be triggered, acquiring a selection instruction of a selected marking point based on the triggered third button, acquiring a node ID of the selected marking point based on the selection instruction, sending the node ID of the selected marking point to the server, returning audio data comprising the selected marking point by the server according to the node ID of the selected marking point, receiving the audio data of the selected marking point, and playing the audio data comprising the selected marking point from a time point of the selected marking point, wherein the selected marking point is the marking point corresponding to the triggered third button.

2. The method of claim 1, wherein detecting a trigger to initiate an annotation function on the audio data comprises:

displaying a processing interface of the audio data, wherein a labeling control is displayed on the processing interface, and the processing interface comprises a playing interface of the audio data;

and if the touch operation of the labeling control is detected, detecting a trigger instruction for starting the labeling function on the audio data based on the touch operation.

3. The method of claim 1, wherein detecting a trigger to initiate an annotation function on the audio data comprises:

starting a speed sensor built in a terminal, wherein the terminal is used for processing the audio data;

and if the speed sensor detects that the terminal performs the reference action for starting the labeling function, acquiring a trigger instruction for starting the labeling function on the audio data based on the reference action.

4. The method of claim 1, wherein detecting a trigger to initiate an annotation function on the audio data comprises:

acquiring a voice instruction;

recognizing the voice command to obtain a voice recognition result;

If the voice recognition result is detected to comprise instruction information for starting the labeling function on the audio data, acquiring a triggering instruction for starting the labeling function on the audio data based on the instruction information.

5. A method of processing audio data, the method comprising:

receiving a trigger instruction for starting a labeling function on audio data, wherein the trigger instruction is detected and sent in the process of playing the audio data after the terminal acquires the audio data, the audio data is obtained after recording is finished, the audio data comprises a plurality of data frames, and each data frame corresponds to a playing time point;

judging whether a trigger instruction for starting the labeling function is executed in a reference threshold value, and determining a labeling point of the audio data according to the trigger instruction under the condition that the trigger instruction for starting the labeling function is not executed, and generating a default name of the labeling point and a node identification ID for identifying the labeling point;

returning a default name of the marking point and a node ID of the marking point to the terminal, wherein the terminal stores the default name of the marking point and the node ID of the marking point, and if the time interval between the time of detecting the trigger instruction and the time of starting the marking function last time is larger than the reference threshold value, taking a data frame, corresponding to the playing time point in the audio data, which is the same as the time point of the trigger instruction as a target data frame, and taking a point between the target data frame and a previous data frame of the target data frame in the audio data as the marking point of the audio data; displaying a recording track of the audio data, taking the position of the marking point in the recording track as a marking position, and creating a marking icon at the marking position so as to mark and position the marking point of the audio data through the marking icon; if the marking points exceed the reference number, displaying a dynamic panel, wherein the dynamic panel comprises a sliding bar; under the condition that the sliding bar is slid to any position, displaying default names of each marking point in at least two marking points exceeding a reference number and processing buttons corresponding to each marking point through the dynamic panel, wherein the processing buttons comprise a first button, a second button and a third button, the first button is used for editing the default names of the marking points corresponding to the first button, the second button is used for deleting the marking points corresponding to the second button, and the third button is used for playing the audio data from the time point of the marking point corresponding to the third button; if the first button is detected to be triggered, acquiring the edited name of the marking point corresponding to the triggered first button based on the triggered first button, storing the edited name, acquiring the node ID of the edited marking point, transmitting the node ID of the edited marking point and the edited name to a server, and editing the default name of the marking point marked by the node ID of the edited marking point into the edited name by the server according to the node ID of the edited marking point and the edited name; if the second button is detected to be triggered, deleting the marking point corresponding to the triggered second button based on the triggered second button, acquiring the node ID of the deleted marking point, and sending the node ID and the deleting information of the deleted marking point to a server, wherein the server deletes the deleted marking point according to the node ID and the deleting information of the deleted marking point; if the third button is detected to be triggered, acquiring a selection instruction of a selected marking point based on the triggered third button, acquiring a node ID of the selected marking point based on the selection instruction, sending the node ID of the selected marking point to the server, returning audio data comprising the selected marking point by the server according to the node ID of the selected marking point, receiving the audio data of the selected marking point, and playing the audio data comprising the selected marking point from a time point of the selected marking point, wherein the selected marking point is the marking point corresponding to the triggered third button.

6. An apparatus for processing audio data, the apparatus comprising:

the audio recording device comprises an acquisition module, a recording module and a recording module, wherein the acquisition module is used for acquiring audio data obtained after the recording is finished, the audio data comprises a plurality of data frames, and each data frame corresponds to a playing time point;

the detection module is used for detecting a trigger instruction for starting a labeling function on the audio data in the process of playing the audio data;

the determining module is used for taking a data frame, corresponding to the playing time point in the audio data, which is the same as the time point of the trigger instruction as a target data frame and taking a point between the target data frame and a data frame, which is the previous data frame of the target data frame in the audio data as a marking point of the audio data if the time interval between the time of the trigger instruction and the time of starting the marking function last time is larger than a reference threshold value;

the first sending module is used for sending the trigger instruction to a server, judging whether the trigger instruction for starting the marking function is executed in the reference threshold after the server receives the trigger instruction, and determining the marking point of the audio data according to the trigger instruction under the condition that the trigger instruction for starting the marking function is not executed, and generating a default name of the marking point and a node identification ID for identifying the marking point;

The receiving module is used for receiving the default name of the marking point and the node ID of the marking point returned by the server and storing the default name of the marking point and the node ID of the marking point;

the creating module is used for displaying a recording track of the audio data, taking the position of the marking point in the recording track as a marking position, and creating a marking icon at the marking position so as to mark and position the marking point of the audio data through the marking icon;

the display module is used for displaying a dynamic panel if the marking points exceed the reference number, and the dynamic panel comprises a sliding bar; under the condition that the sliding bar is slid to any position, displaying default names of each marking point in at least two marking points exceeding a reference number and processing buttons corresponding to each marking point through the dynamic panel, wherein the processing buttons comprise a first button, a second button and a third button, the first button is used for editing the default names of the marking points corresponding to the first button, the second button is used for deleting the marking points corresponding to the second button, and the third button is used for playing the audio data from the time point of the marking point corresponding to the third button;

The processing module is used for acquiring the edited name of the mark point corresponding to the triggered first button based on the triggered first button and storing the edited name if the first button is detected to be triggered;

the second sending module is used for obtaining the node ID of the edited marking point, sending the node ID of the edited marking point and the edited name to a server, and editing the default name of the marking point marked by the node ID of the edited marking point into the edited name by the server according to the node ID of the edited marking point and the edited name;

the processing module is further configured to delete a mark point corresponding to the triggered second button based on the triggered second button if the second button is detected to be triggered;

the second sending module is further configured to obtain a node ID of the deleted marking point, send the node ID of the deleted marking point and deletion information to a server, and delete the deleted marking point according to the node ID of the deleted marking point and the deletion information by the server;

the selection module is used for acquiring a selection instruction of a selected marking point based on the triggered third button if the third button is triggered, wherein the selected marking point is the marking point corresponding to the triggered third button;

The playing module is used for acquiring the node ID of the selected annotation point based on the selection instruction, sending the node ID of the selected annotation point to the server, returning the audio data comprising the selected annotation point by the server according to the node ID of the selected annotation point, receiving the audio data of the selected annotation point, and playing the audio data comprising the selected annotation point from the time point of the selected annotation point.

7. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction which, when executed by the processor, implements the method of processing audio data according to any one of claims 1 to 4, or the method of processing audio data according to claim 5.